An early-stage founder’s guide to building out your data stack

In this post, I’ll summarize learnings from hundreds of conversations I’ve had with early-stage founders and data practitioners and provide some recommendations on navigating the data landscape.

An early-stage founder’s guide to building out your data stack
Trying to figure out which data tools to buy? 😱 Source: https://mattturck.com/data2020/

In this post, I’ll summarize learnings from hundreds of conversations I’ve had with founders and data practitioners to provide some recommendations on the best tools to solve your problems.

Who this post is for

You’re an early-stage founder, likely an operator or engineer, and know how to work a bit with data: SQL, Excel, maybe Python, or R. Either way, you’ve lived a life far away from data scientists, data engineers, and the “modern data stack.” You see some data issues at your startup, but you don’t know where or when to begin to solve them.

What is a data stack?

Let’s start with what a “modern data stack” actually is. It consists of four parts:

  1. A data warehouse to store your data
  2. An “Extract and Load” tool to send data from your SaaS tools to the warehouse
  3. A transformation tool to model your data so that it's easy for analysis
  4. A frontend to visualize and explore the data to make everyday decisions

Next, I’ll break down each category, describe the problems it solves, and provide my recommendations on which tools you should consider to solve those problems.

A data warehouse to store your data

The key difference between a database and a warehouse is that a database is relational while a warehouse is columnar. The result is that queries are almost always going to be faster in a warehouse.

You might have Postgres or Mongo today, but you’re wondering when you’ll need a warehouse. The timing varies but I typically see teams getting a warehouse after a dozen or so hires and they’re having issues centralizing all of their data for analysis.

My recommendations:

  • Snowflake is the best in class. Faster than the others, better UI, simpler user management. But, most expensive.
  • Google’s BigQuery and Amazon’s Redshift are nice if you’re already on those clouds for the simple reason that it's one less account to worry about.

It might also be worth calling out that this is somewhat commoditized, so if you want to save money this is a reasonable place to do so.

An “Extract and Load” tool to send data from your SaaS tools to the warehouse

To query and join data from across your SaaS tools, you’ll want to centralize data from these tools in the warehouse. To make this possible, you need an “Extract and Load” tool.

Most companies I talk to use either Fivetran or Stitch to get disparate data sources into their warehouse. If you don’t think you have the resources to do this right now, I’ve seen Mozart Data, Syncari, and Workato as alternatives.

My recommendation:

  • Fivetran just works. It has the most connectors but it can get expensive.

A transformation tool to model your data so that it’s easy for analysis

Let’s face it — data quality is awful at most startups. You’re focused on hitting your numbers in the early days, not making them perfect. But sooner or later, questions like this will start to slow your team down:

  • Why is there test data in these numbers?
  • What are these weird characters I’m seeing in this query?
  • Why is your number different than mine?

These issues are most likely the result of poor or non-existent data transformation.

Data transformation (the “T” in “ELT”) is the process of turning data from one format to another. Data transformation typically involves filtering, splitting, merging, enriching, and eliminating duplicate data to reach a clean and usable form.

There are a few types of transformations:

  1. Constructive (addition and replication of data)
  2. Destructive (deletion of records)
  3. Aesthetic (standardization of names)
  4. Structural (renaming and merging database columns)

Traditionally, teams have used views in their database, warehouse, or BI tool to create standardized definitions of objects or metrics. This process is still okay in the early days when you might not need a large set of defined metrics for analytical and data science use cases. However, if you’re looking for a great way to clean your data, make sure it’s reliable with testing, and document it with metadata for your stakeholders, dbt is by far the tool of choice.

My recommendations:

  • Create views in your database or warehouse if you have a handful of straightforward transformations
  • Fivetran and Stich mentioned in the previous section also have transformation capabilities which should be fine early on
  • Get dbt if you’re looking for a best-in-class tool for creating a trustworthy data set for many stakeholders to analyze

A frontend to visualize and explore the data to make everyday decisions

So we’ve arrived at the frontend. You’re probably familiar with some of these tools: Tableau, Looker, and Mode have dominated the space for the past decade. This space, called Business Intelligence (BI), exists to give your employees a way to check metrics and explore data with or without knowing SQL. Let’s break down the different frontends you might be interested in:

Your  business teams are more comfortable with data, and you have some folks that need to write Python and R:

  • Looker and Mode are the big players here
  • Hex is the new kid on the block — basically a collaborative notebook on steroids

Your business teams are mostly reliant on your technical teams to help create or manage dashboards and answer ad-hoc questions:

  • Tableau is the big enterprise player here
  • Shameless plug for my company, Canvas

We started Canvas because we were frustrated with the current state of Business Intelligence. Most charts are too high-level for operators, forcing them to manually export to sheets or wait in line for answers from technical teams. That’s why we built Canvas, a spreadsheet-based data exploration tool for automating reports, building dashboards, and answering ad-hoc questions without SQL.

If you’re overwhelmed by this list and want to get started quickly, we can put you on a managed stack of the best-in-class tools mentioned above. If you have a warehouse, you can connect it in less than five minutes for free at canvasapp.com.

Also happy to chat and help offer some more recommendations. You can reach me at [email protected].