Why a post about Snowflake? What does Alteryx and Tableau have to do with Snowflake?
Here is some context: as I was planning to launch my Data Lake consulting business, which is now launched, I was looking for a place to store and share prototypes, which need to be queried from Tableau with high performance. Cloud databases offer the flexibility in terms of capacity, performance, pricing, and portability that I was looking for.
The leaders in that field, AWS and Azure, offer a lot of solutions, but I decided to take a closer look at Snowflake, not only because it is a hot startup, but also because it promises Uncompromising simplicity, read: no need for a DB Admin! I consider that feature a huge advantage over Redshift, which takes quite some technical skills to get even started…
My experience testing the platform proved that that specific promise of easy administration is met, even though that simplicity does not extend to the overall experience. To put it another way, Snowflake is not Citizen Analyst ready yet, but with a bit of tweaking, and the research I will share below, it comes pretty close!
To illustrate the step by step process, I will use the business scenario of a previous post: storing 27k rows of review data for a Word Cloud analysis.
Setting Up your Snowflake Instance
- Snowflake offers a free trial with some sizable credits to start playing:
It took me 5 days to get my instance approved, so you might want to include that time into your plans, spontaneity here is quite limited…
- Once your instance is created, you get to create your first warehouse:
I strongly suggest to stick with the X-Small Size instance for as long as possible to save your credits. In my experience so far, it is surprisingly powerful and ended up meeting my needs for quite large queries against 20+ million large records…
- Then proceed to create your database:
- Next step is to create the database table you want to load into Snowflake. Alteryx comes very handy for that task, to save you a lot of manual SQL labor. But first, you will need to connect Alteryx to Snowflake by declaring an ODBC connection. After downloading the ODBC driver from here and installing it, open Alteryx and go to Options/Advanced Options/ Manage Connections and Add Other/ODBC:
- Then Click on ODBC Admin:
- Click Add to create a new Data Source,and pick the Snowflake driver you just installed
- Populate the entries:
- Snowflake should now appear under System DSN
- And you should be good to go in Alteryx too:
- Last Step is to enter the name of the Table you want to create in the Output tool when you pick a connection:
Create an Alteryx workflow to load your data the Quick & Dirty way
It gets pretty easy in Alteryx, just take your flat file or any other source, drag the output tool and set it to the Snowflake connection, and kick it off!
No SQL or other code required! If you want to optimize your size and performance, it would be good to insert the Auto Field tool, as described in that earlier post.
One thing you will notice though, is that it took 12 minutes to load 27k records, which is really long. This is because through ODBC, Snowflake inserts each row, one at a time. Tedious… There has to be a better way…
Load Swiftly Data in Bulk to Redshift with Alteryx
The slow insertion issue is not happening when using Alteryx for bulk loads to SQL Server or Redshift for instance, because the Alteryx connector offers a Bulk loader code built in. That bulk loader is effectively copying a compressed CSV to an intermediary stage, AWS S3 in the case of Redshift, and then inserts into the DB from S3, which it turns out, is a much more efficient process. This approach is also advocated by Snowflake, as per this useful video:
Since Snowflake is hosted by AWS, it is using currently S3 for its “Internal” storage, and therefore, you don’t need an external S3 account to follow that process, as it is included in your Snowflake account. You can also use your existing S3 account, but for this guide, I will assume you stay within Snowflake.
The bad news is that, whereas the process is transparent when using Redshift or SQL Server, it gets much more involved with Snowflake at the moment. Eventually, Alteryx will offer a more streamlined experience, and I heard it is planned, but today the bulk loading to Snowflake feature is not included in Alteryx. [2019 Update: Alteryx now offers a Bulk upload connector to Snowflake, BUT it requires credentials to a S3 bucket, and does not make any use of the Snowflake internal staging resource, which comes free with a Snowflake account. The S3 settings on the connector are also REALLY hard to configure. That Bulk connector released by Alteryx is a deeply flawed approach in my opinion, since the purpose of using Snowflake for most analyst users is to get rid of Redshift, which is hard to manage. And therefore they do NOT want to manage and pay for a S3 bucket! So the approach through SnowSQL I describe in this post remains relevant I’m afraid…].
Furthermore, more crucially, Snowflake does not yet support the PUT command through ODBC, which is a key step of the process as we will see [2019 Update: now it does if your Snowflake instance is hosted on AWS as indicated at the link above]. The good news is that we have here a great opportunity to understand what goes on under the hood when bulk loading, and this step by step guide will reveal it all!
1. Modify your Alteryx workflow to only create the table
We will use the actual data to create a table without having to type all the column definitions, restricted to 10 rows to not have to wait, using the Sample tool:
Make sure you also input the following command into the Snowflake Output tool:
DELETE FROM “WORDS_CLOUDS”.”PUBLIC”.”reviews”;
2. Install SNOWSQL
SnowSQL is the command line tool for Snowflake, the equivalent of TABCMD for the Tableau folks. It is NOT sexy but it becomes a required install as it does not have the limitations of the ODBC connection.
- Download the install files from here:
- After installing the App, find the Config file in the snowsql folder and go edit it to input your credentials:
make sure you input those lines above the [connections.example], save and close it…
- Open your Command line editor (in Windows type CMD in Start) and then input the command snowsql. You should see this:
3. Stage the files to the cloud
Here, we use the internal storage offered by Snowflake to stage efficiently the files from your end to the cloud using the PUT command
- Convert your data into a CSV file, then compress it to a GZIP format, using the Open Source 7-Zip tool available for all systems.
- Create a STAGE in Snowflake:
I called it REVIEWS_STAGE, and no settings are needed…
- Use your SNOWSQL command line to input PUT command which will upload to the Stage, Internal table in that case, of Snowflake:
As you can see, it took only 12 seconds, not minutes! I tested loading 20 million records, quite large columns, and it took only minutes, with the smallest instance!
4. Load from the cloud file to your Table
- Create a simple File format to indicate to Snowflake the incoming format of your CSV:
- Run the Copy Into command from SNOWSQL, which will write the CSV to your table:
and this time it took only 3 seconds… Snowflake handles the GZIP format transparently…
- go check your newly loaded table:
with the expected 27k records…