Over the years, I have been monitoring the analytics software market with a quite simple requirement: I am (still) looking for a tool to let me generate full data pipelines quickly from a browser with the least amount of code. As an analyst, I don’t have the luxury of time that developers and data scientists have, I need to crank out pipelines quickly and can’t waste my clients’ time futzing with a Command Line Interface. A nice UI makes the documentation of a complex process much easier. I also want the ability to look at the actual data at each stage while I develop a long process, without having to devise each time a fancy query just for testing. Last but not least, I need frictionless provisioning and deployment, à la Snowflake: just sign up on a webpage and start using without having to summon any AWS engineers nor other IT / Dev Ops rare resource. Believe it or not, that market is still up for grab, as I haven’t found a decent solution yet. All the products I tried have significant flaws. Here are a few, I have evaluated:
Alteryx Designer Cloud
Domo Magic ETL
Databricks Delta Lake
I would say the ones coming closer to that vision are Matillion, Dataiku and Domo in its recent iterations with Magic ETL. All the others fail, in their current state, by being too expensive, too limited or way too complex (looking at you Delta Lake). There is still room for new entrants as this market segment is far from being satisfied. Analysts are usually happy to pay within reason for solutions that boost their productivity.
I have been paying close attention to the solutions available, monitoring fairly easily the relevant ones going to market with Snowflake, through their Partner Connect awesome feature. Yet, I recently came across one of those new entrants, a solution that is Snowflake centric, but not yet present on Snowflake Partner Connect, called Datameer. It offers what I deem a compelling positioning statement:
Working from a browser without having to install any application, nor patches, that is refreshing. Now, is Datameer delivering on the promise? Read on for my test of an early release…
Suppose you have a set of transactions, such as Superstore orders for instance, each with one to many line items, with each line item belonging to a product category:
Note that this is a single order with actually 14 line items, but my data set has 10,000 records with 5,009 distinct orders. My goal is to obtain one row per order, a count of line items, and most important, a single field with all categories found in that order to enable filter by keyword:
What I mean by history table is what is commonly called a Slowly Changing Dimension, or SCD table of type 2, which retains efficiently all the previous values I decide to track, for accurate historical reports. Why do we care for historical values? In a SFDC CRM context, suppose you are reporting on the number of opportunities processed by each rep over a year, based on the opportunity creation date. If the opportunity owner changes over time before close, you will assign the opportunity to the bucket of the most recent owner, instead of the legitimate first owner, who won’t get the credit. A SCD type 2 setup will archive all the owner values each opportunity took over time, and will store those records only if there was a change. That represents much less storage and processing resource than a daily snapshot table. I need to decide early which columns I want to track if they are likely to change, here in blue for that example:
Now suppose you have a simple setup of type: Fivetran copies SFDC table Opportunity to Snowflake, and overwrites all changes originating from SFDC. How do you build that history table?
One of the many appeals of using Snowflake for me, is the user interface in the browser, which allows not only to perform many admin tasks, but also to poke around data. Yet, that legacy environment, while the envy of many competitors (Hi Redshift and BigQuery!), is still lacking many features necessary to perform analytical tasks, most notably the visualization aspects: no graph possible. Nevertheless, pursuing its vision towards becoming a one stop shop data cloud, Snowflake introduced in June 2020 a free preview for an exciting feature: Snowsight.
Snowsight originates from Snowflake’s first acquisition in March 2019, Numeracy. the goal is to enhance the experience by easing data exploration, SQL coding and provide visualization, still within the browser. I have to say, while this is still a preview, it is already very compelling… Let’s take a quick tour after selecting the preview from your regular Snowflake browser UI:
Data for analytics can be structured in two different ways: Wide or Long. Here is an illustration of the two types of format for the same data set:
When would you use one vs. the other? In general, wide models work better for analytical models and offer a better visual layout for humans, whereas the long model works better for columnar DBs like Snowflake, Redshift or Tableau. Their engine is optimized to crunch data at scale within columns.
Now suppose you use Snowflake as a staging area for your analytics, which I highly recommend. You ought to write your data in the Long format. How do you perform that conversion, sometimes called reshaping a table from wide to long? That operation is also called to transpose (Alteryx), unpivot (Snowflake), pivot Columns to Rows (Tableau Prep) or melt a table (Panda). We will also cover the flip side operation, from Long to Wide, also called to cross tab (Alteryx), pivot (Snowflake, Panda) or pivot Rows to Columns (Tableau Prep).