BigQuery in Kaggle — setup for tutorial
Recently, Kaggle released a feature that allows their kernels — the hosted Jupyter notebooks that power their competitions — to access Google BigQuery. The blog post that provides an overview points to an example tutorial kernel, “How to use BigQuery in Kaggle Kernels”.
To get this tutorial kernel set up and ready to use, the assumption was made that one had already done the following:
- Set up a Google Cloud Platform account
- Uploaded the sample data from one of their competitions into a Google Cloud Platform account’s BigQuery table
I wasn’t completely clear from the tutorial as to how to do the 2nd part, so here are my notes on how I loaded the Ames Housing price data into BigQuery using the command-line.
There are a few setup pieces I’m not going to cover, namely establishing a Google Cloud Platform account (free trial available) and installing the Kaggle api command-line interface: kaggle api cli (and have a kaggle account, and grab your kaggle api from your account profile page). The steps that’ll be performed are:
- Have a GCP account and make sure the `gcloud` cli is associated with that project
- Create a GCP Google Cloud Storage bucket
- Download the Ames housing price data
- Upload the data to GCP GCS
- Create a dataset in GCP BigQuery
- Load the training data into a BigQuery table
# Switch to your Google Cloud Project, named YOURPROJECT
gcloud config set project YOURPROJECT# Create a GCP bucket to house the data (remember that it has to be a unique name)
BUCKETNAME=my-unique-gcs-bucket-001
gsutil mb gs://${BUCKETNAME}# Obtain the Kaggle Ames data locallly
# this will download 3 csv files and a text file
kaggle competitions download -c house-prices-advanced-regression-techniques# Upload the data to the GCS bucket
# into a folder called "ameshousing"
gsutil -m cp *.csv gs://${BUCKETNAME}/ameshousing/# create a dataset in BigQuery
# this corresponds to the dataset reference in the notebook
bq mk ameshousing# load in the training data from a GCS bucket to a table
# Specify, via flags, the source format is CSV, skip the first
# row (it's the table column headers), ignore unknown values, and
# automatically determine a schema
# see https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv
bq load --source_format=CSV --skip_leading_rows=1 --ignore_unknown_values=true --autodetect=true ameshousing.train gs://${BUCKETNAME}/ameshousing/train.csv# note on test.csv, if you try to import the test.csv, it
# contains NA values that will need to be cleaned up prior
Once that’s all done, you can clone the example tutorial kernel and proceed with the BigQuery & Kaggle tutorial!
Make sure that you note your Google Cloud Project ID and substitute that throughout the example python code, paying close attention to the SQL strings that contain references to “my-example-housing-dataset”, which is the tutorial’s example GCP project.
The power of BigQuery’s data warehousing storage capabilities and the BigQuery Machine Learning (BQML) is a welcome addition to Kaggle kernels!
Quoting from the tutorial kernel author, Jessica Li:
This is where the BQ and Kernels integration really shines. Before, you would have to query and download the output in BigQuery, then re-upload the data into a local Integrated Development Environment (IDE), configure related settings, and then do your analysis. Now, with just a few clicks, you can do all of that in the same place right here in Kernels!
Further, if you like what’s being shown here with BQML, there are two more great resources to check out:
- Qwiklabs Quest, “BigQuery for Machine Learning” to get more hands on with a series of labs! Qwiklabs, the on-line learning platform that provisions temporary GCP accounts for you for the duration of the labs.
- An advanced Kaggle kernel tutorial on Getting started with BigQuery ML