Bespoke Systems - custom analyses

Google BigQuery in Kaggle - tutorial setup

Recently, Kaggle released a feature that allows their kernels - the hosted Jupyter notebooks that power their competitions - to access Google BigQuery. The blog post that provides an overview points to an example tutorial kernel, “How to use BigQuery in Kaggle Kernels”.

To get set up and use this kernel, the assumption was made that one had already done the following:

I wasn’t completely clear from the tutorial kernel as to how to do this, so here are my notes on how I loaded the Ames Housing price data into BigQuery using the command-line.

There are a few setup pieces I’m not going to cover, namely establishing a Google Cloud Platform account (free trial available) and installing the Kaggle api command-line interface.

The steps that’ll be performed are:

  1. Have a GCP account and make sure the gcloud cli is associated with that project
  2. Create a GCP Google Cloud Storage bucket
  3. Download the Ames housing price data
  4. Upload the data to GCP GCS
  5. Create a dataset in GCP BigQuery
  6. Load the training data into a BigQuery table
# Switch to your Google Cloud Project, named YOURPROJECT
gcloud config set project YOURPROJECT

# Create a GCP bucket to house the data (remember that it has to be a unique name)
gsutil mb gs://${BUCKETNAME}

# Obtain the Kaggle Ames data locallly
# this will download 3 csv files and a text file
kaggle competitions download -c house-prices-advanced-regression-techniques

# Upload the data to the GCS bucket
gsutil -m cp *.csv gs://${BUCKETNAME}ameshousing/

# create a dataset in BigQuery
bq mk ameshousing

# load in the training data
bq load --source_format=CSV --skip_leading_rows=1 --ignore_unknown_values=true --autodetect=true ameshousing.train gs://${BUCKETNAME}/ameshousing/train.csv

Once that’s all done, you can clone the example tutorial kernel and proceed with the BigQuery & Kaggle tutorial!

Make sure that you note your Google Cloud Project ID and substitute that throughout the example python code, paying close attention to the SQL strings that contain references to my-example-housing-dataset, which is the tutorial’s example GCP project.

The power of BigQuery’s data warehousing storage capabilities and the BigQuery Machine Learning (BQML) is a welcome addition to Kaggle kernels!

Quoting from the tutorial kernel author, Jessica Li:

This is where the BQ and Kernels integration really shines. Before, you would have to query and download the output in BigQuery, then re-upload the data into a local Integrated Development Environment (IDE), configure related settings, and then do your analysis. Now, with just a few clicks, you can do all of that in the same place right here in Kernels!

Further, if you really like what’s being showing with BQML here, there are two more great resources to check out

_This article was also published on Medium_