Use dbt as a near real-time transformation tool


Sep 6, 2022

ACCEPTED

Ester Ramos Carmona, Morgan Sadr-Hashemi

#data #dev-tools

Context and Problem Statement

Our data warehouse is getting more complex as the business grows and we need to find a tool to control and transform the data, so we can gain insights from it. Currently, we get all the raw data into BigQuery using Estuary but we don't perform any kind of clean-up or transformation to it.

Decision Drivers

  • Widely used
  • Friendly language
  • Can connect with Bigquery
  • Testing
  • Version Control

Considered Options

  • dbt (data building tool)
  • Dataform
  • Datameer
  • Alteryx

Decision Outcome

Chosen option: "dbt", because from all the other tools it's the only one focused on the T on ETL, so you get more features and options. Some of the other options don't offer a command-line, or we have to join a waitlist to use it, or they're really expensive. Furthermore, several people in our team are already familiarised with dbt so that's a plus.

Some of the features:

  • It's written on SQL which makes it more accessible for everyone on the tech team compared to other languages.
  • It has version control and integrates with GitHub.
  • It has dependency management: you only run what you need to and not everything everytime.
  • It writes documentation automatically.
  • We'll be able to test the data quality, types and logic among other things.
  • It connects with Bigquery seamlessly.
  • I (Ester) have used it before so the setup will be easier than with other tools.

Positive Consequences

  • More control over the transformation changes (we're currently manually creating views on BigQuery)
  • We'll be able to test the data.
  • Better visibility of the table dependencies.
  • We'll be able to have near real-time data using lambda architectures.

Negative Consequences

  • SQL makes it less readable compared to an interactive UI.
  • dbt can be very complex at times, requiring us to rewrite macros used at the backend. This can slow development.

Pros and Cons of the Options

Dataform

Create and manage a centralized data model the same way you build software. SQLX extends the BigQuery SQL dialect to add features that support dependency management, testing, documentation, and more.

  • Good, because it was acquired by Google which means it supports BigQuery
  • Good, because it supports SQL statements. We can test and have version control.
  • Bad, because the signup process has finished and we'd have to join a waitlist to use it.

Datameer

Datameer is a SaaS data transformation platform that runs in Snowflake.

  • Good, because it has UI and command-line options allowing technical and non-technical members to collaborate.
  • Bad, because it's too focused on Snowflake, so it might not be as good with other platforms like BigQuery.
  • Bad (personal opinion), because they have a big part of their docs dedicated to dissing dbt but most/all the arguments against dbt were about its problem data visibility because it's all SQL (I don't think something being difficult means it lacks visibility).

Alteryx

Alteryx is more of a light-weight ETL tool for analysts.

  • Good, because it's usable for non-technical users.
  • Bad, because the server version is $55k annually.
  • Bad, because it doesn't have a command line option. You depend on the UI features for the transformations.