Exciting news! You have decided to use dbt as your data modeling tool, and it's a great choice. dbt not only allows you to build excellent data models, but also provides additional functionalities such as data tests, documentation, and much more.
By design, dbt is meant to be flexible, giving you complete freedom to decide how to build your project. However, without proper planning, the project can quickly become a mess of unmaintainable code. It is therefore a good idea to establish some ground rules and set some conventions.
This post aims to provide a couple of tips that can make your project more scalable and maintainable in the long run.
Tip 1: Define the project structure
dbt is a very flexible tool that allows you to organize your project in a unique and customized way. However, this can lead to problems with scaling if it is not designed in an optimal and scalable way. Although dbt doesn't enforce a standard way of organizing models, they do suggest some best practices that you could use in your project.
In the article "How we structure our dbt projects", dbt Labs recommends organizing all models into layers, with each layer serving a specific purpose and containing unique features. I refer to this approach as “modeling layers”, as it describes how to model data using layers.
There are three basic layers:
staging layer — describes your raw tables, with light transformations on top
intermediate layer — where you start applying transformation and first business logic, these are modular and reusable blocks
marts layer — contains models that will be exposed to end users and BI tools
At first, this split on layers may sound redundant, but it really pays off in the long run. This approach brings many benefits to your project. Let's start with the separation of concerns: you'll know exactly how your models are organized, which leads to a clearer dependency graph. Next, you finally speaking the same language as your colleagues, which makes scaling a much easier process. It also speeds up the onboarding of new developers who are familiar with this practice.
Take your time and read the article I mentioned — you’ll find a lot of cool tips about project organization.
Next, to ensure consistency and clarity, it is important to establish rules for model naming conventions. The previously mentioned article suggests a few naming rules for each layer. In general, it boils down to a few rules:
don’t afraid of longer names, it’s better to make you models self-descriptive
use layer prefix, for example stg_ for staging and int_ for intermediates
staging models should include name of the data source and described entity
intermediate models should describe the transformation they are performing
marts models simply describe the entity they provide
sometimes marts are prefixed with fct_ or dim_ (facts and dimensions) to better describe what is inside
overall, you can (and sometimes should!) adjust these rules to better suite your needs
once rules are defined, stick to them!
Tip 2: Adjust default settings
When starting a new project, dbt_project.yml provides sensible default settings. However, these settings are too generic and may not reflect your particular situation.
Start with settings default schema and materialization type for all your assets.
For example, if you followed tip #1, you probably divided your models into different layers, which means that you can set defaults for each layer. Additionally, do the same exercise for seeds and snapshots if you want to keep them in a separate schema.
More settings you can check in the documentation.
Next, you may want to alter the default behavior of how dbt handles database schemas. By default, name of the custom schema gets concatenated to the target schema. If you want to change this behavior, just add a macro generate_schema_name that will override default functionality:
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ default_schema }}_{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
Tip 3: Limit data usage in development
When working with large source tables, such as user events or website sessions, it can take a long time for your models to build. It could be fine for production, but not very pleasant during development. Fortunately, there is a trick that can help limit data usage during development, and it involves using a Jinja macro.
The easiest implementation is to add the following piece to your staging models:
select *
from {{ source('mysql', 'user_events') }}
{% if target.name == 'dev' %}
where date(event_timestamp) >= date_add(current_date(), interval -3 day)
{% endif %}
This macro essentially checks the current environment and, if it is development environment, limits the data to the last three days. Alternatively, you could use variable to switch this manually whenever it is necessary like so:
{% if var('is_dev', false) == true %}
where date(event_timestamp) >= date_add(current_date(), interval -3 day)
{% endif %}
So now if you run a simple “dbt run”, all rows will be processed, however if you set is_dev variable to true it will apply the limit:
dbt run --vars '{is_dev: true}'
Tip 4: Add SQL linter
If you are starting a dbt project in a team, this tip is a must: when working with several people on the same codebase, you will eventually encounter situations where your code style differs from others. In this case, you need to do two things: create a code style guide and implement a linter. For dbt projects, you specifically need an SQL linter.
Creating an SQL style guide is entirely up to you and your team. You need to mutually agree on the conventions and rules. You can borrow a few rules from dbt style guide, Gitlab’s guide, or any other set of rules out there. Simply make it your own and come to an agreement with your team.
For lining SQL code, I recommend using SQLFluff, a Python library that checks your SQL code against a list of rules that can be customized. In some cases, it can even automatically fix issues, such as leading vs trailing commas and uppercase vs lowercase keywords. More importantly, there is a plugin for dbt-styled code.
Adding SQLFluff to your project is simple:
Install the package along with the plugin:
pip install sqlfluff sqlfluff-templater-dbt
Create .sqlfluff file and configure SQL rules inside
Create .sqlfluffignore file with a list of folders that should be skipped during linting/fixing
Run linter with the command:
sqlfluff lint
Additionally, check out this nice article that provides more details and examples of SQLFluff.
Tip 5: Use Github Actions as a scheduler
When starting your dbt project, you have several options for triggering the pipeline. You could use dbt Cloud, an external scheduler like Apache Airflow or Dagster, or you could use Github Actions. The latter option is usually overlooked in favor of the first two, so let's discuss it.
Setting up Github Actions is a straightforward process. First, create a configuration file that includes the workflow and a cron schedule. Then, set the Data Warehouse credentials, and you can easily run periodic refreshes of your models.
Many people are using Github Actions for their projects to run dbt refreshes. If you search for "github actions dbt run" on Google, you will find numerous resources. Two great articles that explain how to create your own Github Actions workflow for running dbt refreshes can be found here and here.
Thanks you for reading this post till the end!
I hope you found these tips useful and helpful for your dbt project. If you want to learn more about dbt and data modeling, please subscribe to this newsletter and stay tuned for more updates.
Happy modeling! 😊