The Core Principles of Robust Data Modeling (Part 2)

Let’s talks about performance and data quality of analytical models

Oleg Agapov

Aug 05, 2025

In the previous post, I discussed the first two important principles of robust data modeling:

📈 scalability, and
♻️ reusability.

In this post, I'll explore the other two critical principles:

⚡️ performance, and
✅ data quality.

Let's dive in!

⚡️ Performance

In analytical data modeling, performance can be measured by:

how quickly the model can be built
how rapidly business users can retrieve results from the data model
the costs associated with building and querying the model

In cloud data warehouses speed and cost are deeply interconnected. Faster model building leads to lower costs, while longer query processing times result in higher expenses.

Simply put: optimize speed and you'll optimize costs💲.

But performance matters not only because it saves money. Business stakeholders need dashboards that load quickly, no one will wait hours for the dashboard to finish. So the key to optimizing performance is finding the right balance between speed and costs.

Let's explore some approaches to achieve this balance.

🛠️ Choose suitable materialization

The easiest way to optimize performance is to choose the correct data materialization.

Usually data models are materialized as database views or tables. Tables, in turn, can be materialized from scratch each run (full refresh) or aggregated incrementally.

Database views are useful when data models consist of light transformations and don't require joins or aggregations. Examples of light transformations are:

CASE WHEN statements
formulas on columns
data type casting
columns renaming

Most database engines can apply such transformations on the fly, so using table materialization could be a waste of time and compute. For example, dbt recommends making models in the staging layer as views because they don't contain joins or aggregations (GROUP BY).

Data models should be materialized as tables in two cases:

The model is a user-facing table that requires the best performance.
The model is heavy to calculate and is being used in multiple downstream models. All models in Marts or Gold layer should be materialized as tables to give the quickest response time for end-users.

Some intermediate models might contain complex business logic that requires multiple joins and are slow to query, so it's better to materialize such models as tables as well. This is especially beneficial if multiple downstream models use them in their calculations.

My general rule of thumb 👍 is materializing data marts as tables, the rest as views. And if views are too slow, switch them to tables as well.

As I mentioned earlier, tables can be created in two ways: by re-creating from scratch each run or by loading incrementally, processing only the new data.

Both approaches have their pros and cons.

Re-creating a table from scratch each run (i.e., full refresh) is definitely easier because you don't have to deal with schema migration or any additional DDL commands. But this method might be expensive, especially for big data sources.

Loading tables incrementally, on the other hand, allows you to process only the new data and save on processing time and money. On the flip side, you'll have to deal with occasional schema migrations (add/remove columns) and backfills. Here, my general tip is to start with the full-refresh approach and switch to incremental if the data size is big enough that you actually start having benefits from the more complex setup.

Action points:

❓ Check your data models to see if you can improve performance by applying the correct materialization strategy

❌ If all your data models are materialized as either only tables or only views, your model performance is likely suboptimal

✅ Ensure that models with light transformations are views, user-facing models are tables, and large tables are built incrementally. dbt Labs recommends this approach in their famous “How we structure our dbt projects” guide

🕵️‍♂️ Add Data Observability

The second important aspect is implementing robust data observability.

Data observability is the practice of monitoring and understanding your data pipelines. It helps with:

tracking model performance, for example run times
data lineage, to identify dependencies between models
monitoring data quality

When a model moves to production, its runtime can silently increase due to data growth or project complexity. Having reports that show build times for each data model helps prevent unexpected cost increase.

Another problem data observability can help with is so-called "modelnecks" — bottlenecks where one long-running model slows down the entire pipeline because dependent models must wait for it to finish. By analyzing model build times alongside data lineage helps you spot and proactively resolve these modelnecks.

You can improve a slow-running model by selecting the right materialization strategy, implementing incremental loading, or refactoring it into smaller models to increase the throughput.

The final component of effective data observability is the ability to logically break down the DAG into smaller, more manageable parts. Many data lineage tools allow you to add metadata to models, such as tags, helping you identify critical pipeline components to prioritize during optimization. For example, dbt supports adding tags to models and selecting models based on these tags when visualizing or building models. This feature makes pipeline management significantly more efficient.

Action points:

❓ Check which metrics you track regarding model performance: run times, number of failures, and time in queue

❌ Inability to see these metrics over time may lead to silent cost increases and performance degradation

✅ Choose from different options to implement data observability. For dbt projects, you can start with dbt Cloud (paid) or install open-source tools like Elementary

🏎️ Utilize database optimizations

Last but not least is utilizing all possible technical optimizations available in your database. This tip alone could fill an entire book, so I'll provide some general guidance instead.

First, understand that different database types require different optimization techniques. Postgres, BigQuery, and Apache Spark have fundamentally different architectures, so you need to learn the specific advantages and limitations of your particular database or processing engine.

At the same time, some tips are applicable across a range of technologies. For example, all columnar data warehouses (BigQuery, Snowflake, Redshift) share common optimization techniques, such as:

utilizing partitioning and clustering
applying data pruning as early as possible
avoiding ORDER BY when unnecessary
preferring window functions over self-joins
and so on

Therefore, the best approach is to begin by researching optimization techniques specific to your technology stack.

If you use Snowflake data platform, I recommend checking out the SELECT blog. This company specializes in optimizing Snowflake costs. They share valuable advice on how to tune data warehouse performance using not only the techniques mentioned above, but also vendor-specific settings like thread count or warehouse size.

Finally, an important tip is to learn how to read execution plans using the EXPLAIN keyword. This SQL command shows how your query will be executed in the database, revealing the actual operations performed.

Execution plans are very valuable for finding bottlenecks in SQL queries. I particularly look for:

full table scans instead of filtered views (typically caused by missing or misplaced WHERE clauses)
cardinality explosions (usually resulting from incorrect or poorly placed JOINs)

Some databases, like Snowflake, provide a user-friendly visualization of execution plans, making it easier to debug and identify performance bottlenecks.

Action points:

❓ Learn more about the underlying technology of your data stack

❌ Using only generic or random optimization tips cannot guarantee optimal performance for your specific technology

✅ Look for specialists or companies that focus on data stack optimizations. Excellent examples are this blog post about Snowflake or this post about Databricks optimization

✅ Data quality

The last crucial principle of building robust data models is maintaining the highest standards of data quality.

Data quality measures how well data meets consumer expectations. For example, consumers of your data models expect the data to be accurate, consistent, reliable, and much more. Higher quality data builds more trust and leads to greater use in business decisions.

Let's explore how to achieve a great data quality of your models.

🔁 Apply end-to-end data testing

A naive approach to data quality is to measure the characteristics mentioned above (like accuracy and consistency) only for business-facing data models. This means applying periodic testing just to tables that end users interact with. However, data quality measures can be applied to a much wider range of use cases throughout the data pipeline.

If we carefully examine how data flows into the data warehouse and is subsequently transformed into data models, we can distinguish several stages where data quality can be checked:

🌱 External data used to build analytical models should be monitored for unexpected values, incompleteness, and timeliness (freshness)
👷 During development of new models or modification of existing ones, data tests can be run on the final models to signal if the new or modified code breaks any assumptions about the data
🚢 Before deploying new changes to production, automated tests can assure us that they won't break any existing reports
⏱️ Finally, periodic tests on production tables can tell us if newly added data still meets our quality standards

This approach resembles the "Shift Left" approach I explained in the previous article. The earlier you catch a problem, the sooner you can fix it and avoid building production data models with errors.

Different types of tests should be applied to different stages.

For example, dbt freshness tests can be applied to source tables to ensure they don't have stale data. Also, data contracts can be applied to make sure that our assumptions about the incoming data still hold. Data contracts can check that the number of columns hasn't changed, or that data types remain the same as before.

During development, data tests and unit tests can greatly help in delivering high quality datasets. Data tests are used to check the assumptions about the data, for example that a column is not empty, or it contains only expected values. Unit tests are useful to check model transformation logic — by using mock inputs and expected outputs they can signal when changes break the logic of data models.

Periodic tests in Production typically include data tests and freshness tests. Fresh data sources provide confirmation that new data has arrived and models can be rebuilt. Data tests ensure that the new data remains correct and contains expected values.

Action points:

❓ Inspect different stages of the data flow and examine what kinds of tests are currently applied

❌ If you don't have any tests, implement them throughout your entire data pipeline, from ingestion to final data marts

✅ If you're using dbt, you can immediately start using data tests, unit tests, and freshness tests. For other stacks, try Soda tests or Great Expectations

☯️ Balance coverage and redundancy

While it may seem logical to test every possible model in your data pipeline, this approach is often counterproductive.

First, covering all data models with tests can significantly increase testing time and incur additional costs.

Second, many data tests will duplicate each other because some columns are simply carried from model to model without changes.

⚖️ You need to balance tests coverage with redundancy — on one hand you want good coverage, but not at the expense of longer waiting times and higher costs.

To streamline this process, you should implement testing conventions. These conventions should clearly define what to test, when to test it, and how frequently tests should be run.

Here are the testing conventions I use in my work:

Always add a primary key test to your models. If the model doesn't have a primary key, generate a surrogate key and test it.
Test columns as early in the pipeline as possible (applying the Shift Left principle). In dbt projects, this is typically in the staging layer.
Define tests based on column category:
- Primary keys should have not_null and uniqueness tests
- Categorical columns should have accepted values tests
- Booleans should be checked for accepted values (ensuring they contain BOOLEAN type, not STRING like "true")
- Numbers should be checked according to business logic (e.g., order_value should always be positive)
- And so on
In downstream models, only test columns that were modified or newly introduced. Testing columns carried over unchanged is redundant.
Redundancy is acceptable in the final layer with business-related models. You can add primary key tests and additional checks to all important columns, like testing for missing values, expected values, and business logic (for example, shipping date should always be later than the order date).

Action points:

❓ Check if your team has any conventions regarding data testing

❌ If you don't have any, it can mean that your strategy is not optimal, you are missing some tests cases or vice versa apply too many tests

✅ Start with the list of conventions I provided above and then extend and adapt to your needs. Try dbt-project-evaluator package that can measure tests coverage and some other useful metrics about the projects

🧠 Build with testing in mind

In software engineering, a popular approach to developing software is called test-driven development (TDD) involves writing tests before writing the code. Once tests are established, developers must create software that meets all quality standards to pass these tests.

This approach isn't always suitable for data models. We often design a model, begin implementation, adjust the initial design based on new information, and iterate again. In this context, writing tests first can be challenging.

That's why I suggest a more flexible alternative which I call test-informed development.

It's based on several principles:

When building models, consider which tests can be applied to them
Add necessary tests shortly after completing development, including those that might be valuable in the future
Follow a build-test-adjust-retest cycle to ensure your assumptions about the data remain valid after changing the code of the model

I'm advocating for intentional testing, where models are designed with testability in mind. This approach doesn't strictly follow TDD, but it prioritizes quality and correctness during development.

Classic TDD is technically possible when utilizing unit testing. For example, dbt 1.8 introduced unit tests as first-class citizens, allowing analytics engineers to test SQL logic without using any underlying data.

However, technical possibility doesn't guarantee universal adoption. Rather than requiring that everyone should write unit tests, it's better to embrace a more flexible approach where model developers naturally incorporate testing considerations into their process.

After building up a solid collection of tests in your project, you'll gain greater confidence when modifying logic, as comprehensive test coverage can identify bugs and potential issues before code reaches production. If you later decide to implement unit testing, refactoring large models becomes significantly easier because you can test not only the model's output but also its underlying logic directly.

Action points:

❓ Evaluate how your team develops and merges data models to Production

❌ Lack of testing guidelines and deploying to Production without proper tests indicates poor engineering practices that can lead to serious failures

✅ Educate your team about data testing importance, add "I have run data tests" to your PR checklist, and incorporate tests into your CI pipeline

Summary

This is the final post of “The Core Principles of Robust Data Modeling” a mini-series where I covered the last two pronciples. Check the first post here.

If you liked it, give it a heart and share with a colleagues.

Subscribe here or on LinkedIn to learn more about data modeling and analytics engineering!

#dbtips

Discussion about this post

Ready for more?