A common question for data professionals is, "When was the data/dashboard last updated?" The answer often involves checking the created_at or updated_at columns in your data models for the latest timestamp. However, this might not be entirely accurate, as your models are only as fresh as your data source is.
Another critical aspect of data modeling is ensuring the quality of your source data. The saying "garbage in, garbage out" is relevant here. Even flawless SQL models won't compensate for poor incoming data quality.
Therefore, it's crucial to invest time in setting up proper data quality testing for your data sources. This will help reduce errors in your final dashboards. Let's explore how to accomplish this with dbt.
Freshness tests
A freshness test is a simple test you can perform on your data sources. As the name implies, it checks the freshness of the data sources, alerting you about stale or outdated data.
The method dbt uses to check freshness is simple. The first step is to check the timestamp of the last row loaded into the source table. Then, calculate the difference between last available timestamp and the current time. With this difference, dbt can tell you whether your data is considered stale or not. As different tables may have varying refresh intervals, you can set individual thresholds for each one.
Basic configuration for setting up a freshness test looks like this:
sources:
- name: stripe
database: raw_data
schema: stripe
tables:
- name: charges
freshness:
warn_after: { count: 1, period: day }
error_after: { count: 2, period: day }
loaded_at_field: loaded_at
This config means the following:
we are checking charges table for freshness
dbt will raise a warning if the data is 1 day old
it will alert an error if the data is 2 days stale
lastly, we specify that loaded_at is a timestamp column that has to be checked for freshness
You can tweak this config in several ways. For example, you may add freshness block to the source level (instead of each individual tables). In that case same data freshness rules will be applied to all tables.
Another improvement is to add filter parameter and limit the amount of scanned rows to only last week or so (use syntax of your database here):
freshness:
error_after: { count: 1, period: day }
filter: datediff('day', loaded_at, current_timestamp) <= 7
To trigger freshness tests you need to run command:
dbt source freshness
If you need to run freshness tests only for specific sources/tables, you could use node selection syntax:
# check all tables in the source named "stripe"
dbt source freshness --select source:stripe
# check "charges" table only
dbt source freshness --select source:stripe.charges
How often should you perform freshness tests? Aim to conduct them before each model run to confirm you're not using stale data. However, you might want to do it more frequently because freshness tests are separate from dbt run/build commands. Running freshness tests more often can help prevent stale data during production runs. Being aware that your sources are outdated earlier will provide you more time to correct the issue before materializing production models.
Assertion tests
Similarly to SQL models, you could add assertion data tests to the sources.
⚠️ Pay attention, since dbt 1.8 tests parameter was updated to data_tests. Old syntax will be deprecated with time. See explanations here.
Start by adding generic tests to columns:
sources:
- name: stripe
tables:
- name: charges
columns:
- name: id
data_tests:
- not_null
- unique
Or you could add singular tests by putting .sql file to /tests folder:
select *
from {{ source('stripe', 'charges') }}
where amount < 0
To run tests on sources use the following syntax:
dbt test --select "source:*"
# or
dbt test --select "source:stripe"
# or
dbt test --select "source:stripe.charges"
If you are utilizing modeling layers, particularly the staging layer for your sources, you can move all those tests from sources to staging models. This is only feasible if your staging models are one-to-one representation of your sources. Another advantage of testing staging models is that if you use the dbt build command instead of dbt run, failed tests on staging models will prevent issues from propagating to downstream models.
What tests should I perform on the sources?
While this is a very open-ended question, I typically pick something from a set of “golden tests” that should ensure the correctness of my data sources.
Here you can find a few recommendations:
Check for expected columns (e.g. lighter version of "data contracts"). For instance, try using the expect_table_columns_to_contain_set test from the dbt_expectations package.
Check for duplicates. If the source has a primary key, this task is straightforward. If not, create a surrogate key and test it for uniqueness and non-null values.
Check for completeness. This somewhat complex test should alert you about missing dates or particular column values (e.g., missing partition). I typically create custom SQL tests for such cases, but there may be a package available for this purpose.
Check for relationships. A standard relationships data test should be useful for this purpose. I use it to check sources with the same identifiers, e.g., all user_id's from the orders table must match those in users table.
Check for volume anomalies. This is particularly useful when the data is recent but contains a significantly lower or higher number of rows. Consider using the volume_anomalies test from the elementary package.
If you liked this issue please subscribe and share it with your colleagues, this greatly helps me developing this newsletter!
See you next time 👋