What is dbt? A beginner’s guide to the next big thing in data
If you’re a data practitioner, you already know what dbt is. But if you’re not a data person and you’re starting to look at how to clean up and transform your data, you might be seeing dbt more and more and wondering:
- Why do I keep getting a bunch of results about dialectic behavior therapy when I google dbt?
- Why are so many people in their Slack community?
- What am I missing out on?
In this post, we’ll cover what dbt is, why it’s exploding in popularity, and why we at Canvas are betting on it being the next platform for modern data products.
What is dbt data?
dbt (data build tool) is an open-source tool that simplifies data transformation by allowing data analysts and engineers to transform data by just writing SQL statements, which it then converts into tables and views.
By mixing modular SQL with the best practices in software engineering, dbt makes data transformation fast and reliable. With dbt, data analysts can write business logic via SQL, automate data quality testing, execute the code, and deliver data documentation along with the code. Being able to do this is crucial in dealing with massive volumes of data, considering the scarcity of data engineers.
Why use dbt?
Over 5,500 companies use dbt every week. As their CEO, Tristan Handy puts it, dbt has officially become mainstream. But what exactly is dbt and what value does it provide?
dbt lets you do all the work in SQL. It allows data analysts to write transformations via SELECT statements. This eliminates the need for a boilerplate code and allows analysts to transform data even if they’re unfamiliar with other programming languages.
With dbt, you can neatly arrange all data transformations into discrete data models. Each dbt model converts raw data into the target dataset or functions as an intermediate conversion step. dbt allows you to organize and materialize frequently used business logic in a collaborative, version-controlled, and fast way.
dbt automates documentation generation around descriptions, model dependencies, model SQL, sources, and tests. The documentation displays existing models, relevant database objects, and detailed information about each model.
dbt makes data documentation transparent and visible through the lineage graphs it generates. dbt renders the documentation for the project in their web app and contains information about the project (model code, project DAG, tests added to a column) and the data warehouse (column data types, table sizes).
Analytics as code
As dbt integrates with Git, any new code can get safely tested, reviewed, and documented before being integrated into the master branch. This means the risk of accidentally overwriting or modifying a production table while working on something new is a lot lower.
dbt cloud, a hosted service that helps push dbt deployments into production, provides continuous integration. It allows for continuous deployment and less time spent testing. This is possible as dbt cloud removes the need to push an entire repository when changes need to be deployed. Instead, only the components that need to be changed are addressed. Together, dbt cloud and Git lets you automate continuous integration pipelines, saving you time in management and simplifying the process.
Simpler data refresh and quality checks
Within dbt cloud, you don’t have to host an orchestration tool. It has a feature that fully automates the scheduling of production refreshes at whatever pace or frequency you want.
dbt also provides several ways to create and enforce data quality checks. It lets you create data integrity checks when you create documentation for a given model. It also features a function to make custom data tests driven by business logic. Lastly, it enables you to build snapshot tables that track modifications to the data. This method is most helpful when addressing mutable data since you would have full access to all previously-made changes in the source data.
dbt makes testing data integrity pretty effortless. As dbt allows you to combine Jinja with SQL, you can turn your dbt project into a programming environment for SQL, which allows you to do things you can’t normally do in SQL (e.g. using control structures and environment variables). dbt also allows the application of a test on a given column by simply referencing it under the same YAML file.
dbt has an open-source offering and a huge library of reference documents, step-by-step installation guides, and FAQs.
The open-source community of dbt also provides access to dbt packages. This means you can get your hands on libraries of models and macros that address a specific problem someone else has already answered. Their Slack community is one of the most engaged and spam-free that I've seen of any product.
dbt has quickly become the transformation tool of choice for modern data teams as it eliminates many common problems that data analysts face while also building a reliable collaboration and knowledge exchange platform.
We at Canvas believe that dbt will be the backend that powers the front-ends of the modern data stack. Canvas is a collaborative data exploration tool for operators to make decisions without SQL or relying on data teams for help.
Canvas integrates natively with dbt, so you can connect via dbt Cloud or Github in minutes. Once connected, you can select publish models for your business teams to explore. Even better, your dbt definitions will be synced automatically so your teams can quickly understand what each table and column means.
Sign up today and try Canvas for free.