The Core Principles of Robust Data Modeling (Part 1)
How to build scalable and reusable analytical models
One of the main values analytics engineers provide is building a robust system of data models that supports the decision making process of the business.
This system holds on four key principles:
📈 scalability
♻️ reusability
⚡️ performance
✅ data quality
By focusing on all four, analytics engineers can build a truly remarkable analytical system that supports business stakeholders and data analysts, leading to faster insights and better decisions.
These four principles are interconnected. Progress on one will inevitably improve the others, as the boundaries between them are often blurry.
In this post series, I'll share practical tips on how to achieve each aspect of robust data modeling.
📈 Scalability
In analytics engineering, scalability is about handling changes with greater speed and confidence. This means data models should be designed for easy modification and extension.
In data world, requirements constantly evolve — new metrics, new dimensions, constantly changing business logic. The ability to quickly implement changes defines how scalable is your project.
Scalability also involves handling increased data volume, but we'll address that when discussing the performance aspects of analytical systems.
Several best practices can help achieve scalability.
🍔 Modeling layers abstraction
The first step in achieving scalability is organizing data pipeline’s assets into a system.
When transformation logic is just a pile of SQL and Python scripts hooked together, it's easy to get lost and make incorrect decisions about how to grow the system. There should be a clear pattern.
A classic example of such a pattern is the Medallion Architecture.
It suggests organizing all the data in your data warehouse or data lake into three layers (Bronze, Silver, and Gold) to progressively improve data models:
Bronze layer represents raw data ingested from external sources
Silver layer contains cleaned and validated data transformed from the Bronze layer, refined and cleaned of inconsistencies and errors
Gold layer contains curated and enriched data ready to be used by analysts, data scientists, and Business Intelligence tools; usually denormalized and structured for efficient querying
If you use dbt framework, another popular option is the modeling layers approach from dbt Labs.
It suggests organizing data models into three basic layers:
Staging — to perform basic clean-up and alignment of raw data
Intermediate — to build reusable building blocks that contain business logic
Marts — for final and polished data marts ready for analysis
Moreover, your models can be easily extended beyond these basic layers. For example, you can have reverse ETL layer, semantic models, and reporting layers. For inspiration, check how Gitlab structures their data warehouse layers.
Both approaches (medallion and modeling layers) share the same core idea: models should transform data from its raw state into a usable analytical model through progressive enhancements.
By logically organizing data models, we bring order to the project and ensure that changes can be implemented faster.
Action points:
❓ Look at your analytics project and try to figure out if you are utilizing layers to gradually transform raw data into final analytcal tables
❌ If you don't see any notion of layering, think about how you can logically split all your data assets into layers to promote standardization
✅ If you're already using medallion architecture or modeling layers - you're on the right track!
⭐ Data modeling methodology
Although the term "data modeling" means different things depending on context, in this post we're specifically discussing data modeling for analytics purposes.
In the data world, data modeling methodology defines how data models are designed logically and subsequently implemented physically.
Several popular methodologies exist, including Kimball's modeling, Inmon's modeling, and Data Vault, and maybe some others.
In real projects, teams rarely use a pure implementation of any single approach. More commonly, they adopt a hybrid that borrows effective patterns from multiple methodologies.
For example, a popular approach is applying Kimball's dimensional modeling to business entities, like orders, users, or bookings. This helps create specific data models that serve business needs. Patterns such as “star schema” or “snowflake schema” help enforcing consistent physical design.
As a practical guidance on applying dimensional modeling in your analytics, I highly recommend the BEAM framework (Business Event Analysis & Modeling).
This framework outlines how analytics engineers should iteratively build data models by:
collaborating with business stakeholders
capturing essential details of future models
building models that are both useful and well-structured
Here is an awesome video that explains implementation of BEAM in more details.
Integrating data modeling techniques with the modeling layers creates a more scalable architecture with exceptional clarity.
When you need to change logic or add a new data model, you'll know exactly where to make changes. New models typically go into the exposed layer (Gold or Marts) and then work backward through intermediate layers to raw data. Existing models are easily adjustable because dimensional modeling promotes modularity and clear relationships between entities.
Action points:
❓ Review your data models, especially the final layer that contains business-facing tables
❌ If you see a set of disconnected tables that don't have clear relationships or contain partially duplicated logic — rethink your approach, try Model-storming exercise from the BEAM framework to see if you can unify your models
✅ If you see well-organized marts that are easy to join and work with — nicely done!
🧱 Naming and coding conventions
This simple technique brings significant order to all analytics projects.
It's about consistently naming things and using the same coding conventions throughout your codebase.
This reduces cognitive load, allowing you to focus more on business logic more than worrying about file naming.
The same principle applies to files organization. Files that properly arranged into layers naturally indicate where to create new models or locate existing ones.
Organized systems tend to stay organized.
Additionally, properly organized files are easier to maintain and simplify the onboarding process for new team members.
Take inspiration from the same guide from dbt Labs about modeling layers:
Base layers sit at the top of your project (e.g., /staging, /intermediates, /marts)
Within each layer, files are organized either by source for staging models (e.g., facebook, postgres, salesforce) or by business domain in intermediate or marts layers (e.g., product, finance, marketing, core)
Files follow specific naming conventions, such as prefixing with stg_ in the staging layer, int_ for intermediates, or fct_/dim_ for marts
Action points:
❓ Review existing coding and naming conventions for your projects
❌ If you don't have any — it's time to design them! Sit with the team and discuss how you can standardize the naming and coding conventions
✅ Already having strong conventions? Try utilizing tools like SQLFluff, Ruff or other specialized packages to check your existing codebase and enforce the standards within the team!
♻️ Reusability
The second vital pillar of a robust data modeling system is reusability.
Reusability enables you to build new models on top of existing ones, eliminating the need to grow the number of models. You leverage code that has already been built and tested. This approach allows you to write less new code overall.
Here are a few approaches to achieve reusability in your project.
📦 Build Once, Use everywhere
The first principle is very straightforward — when creating a data model, think ahead about potential usage of that model beyond your current scope.
Quite often you might find yourself needing the same functionality over and over again. So instead of copy-pasting the code, you could implement a model that can be reused many times.
The main principle of reusable models is modularity and single-purpose responsibility.
General rule: one model = one responsibility.
Don't build overly complicated models that encapsulate a lot of logic. Such models are over-engineered and may be hard to reuse due to complexity or performance issues.
A particular red flag is having over-loaded intermediate models. Intermediate models need to be small building blocks, not final tables with all possible business logic.
Not only models can be modularized, but pieces of code as well. For example, in dbt you can use Jinja macros to create snippets with repeated logic. Things like currency or timestamp conversions, complex CASE statements, etc. can be encapsulated and reused across the project.
Action points:
❓ Search for data models in your project that are different yet have a lot of overlapping logic
❌ It’s probably hard to refactor all repeated models at once, so first good step is just creating a backlog of such models
✅ Start tackling the backlog as a second step, gradually cleaning up your DAG and making it more straightforward
😄 Fun fact: we used a metric called "number of screenshots required to capture a DAG" to measure how well we improved our models. We went from 8 to 3 screenshots after six months of work!
⬅️ Shift Left
The second popular approach to increase reusability is called "shift left."
This means implementing changes as early as possible in the data pipeline so they can be easily reused by all downstream models.
For example, if your “users” table contains first_name and last_name fields, and you frequently concatenate them to get the full_name in your data marts, this transformation is a good candidate to "shift left" to the staging model, making it available for everyone.
While this may sound trivial, I often find myself reinventing the same formula or dimension repeatedly in data marts when the better solution would be to push this calculation as far left as possible in the DAG.
Great examples of the “shifted left” pieces:
Mathematical formulas (e.g., calculating tax rates)
Timestamp conversions and unification
Implementation of CASE statements logic
Logic that requires JOINs (usually in intermediate models)
Another advantage of this approach is traceability. When all downstream processes use the same formula from upstream source, it becomes much easier to troubleshoot and backtrack changes to their source.
Action points:
❓ When working with data marts, start noticing similar (derived) columns that span across multiple models and check
❌ Are they coming from a single calculation upstream? If not, shift them left!
✅ In many cases, reusable logic can be shifted up to the staging layer, significantly reducing the number of copy-pasted pieces in downstream models
📝 Document your models
Including documentation in the Reusability section might seem unexpected, but stay with me on this one.
Engineers often create redundant models simply because they're unaware of existing models that could meet their needs. And this is where a comprehensive documentation can prevent creating redundancy by making existing models more discoverable and more understandable.
Effective documentation clearly states the purpose of the model, applied filters, limitations, and edge cases. Also, by reusing existing models you benefit from the experience of the model’s developer, who probably handled edge cases and unusual data behaviors.
Of course, if an existing model only meets 10-20% of your requirements, creating a new one is preferable. After creating a new model, document it and maybe explain why you chose creating a new model instead of extending an existing model.
Well-documented models, combined with proper organization (using the layers and naming conventions discussed earlier), can significantly reduce redundant code and enhance reusability across your analytics ecosystem.
Action points:
❓ Review the documentation coverage and its quality: do all models have descriptions that clearly explain their purpose?
❌ If documentation is sparse or unclear, prioritize documenting the most frequently used models first
✅ Consider implementing a documentation conventions, like including model's purpose, limitations, and usage examples
To be continued
In the next post I'll talk about the last two principles of robust data modeling: performance and data quality.
Make sure to subscribe here or on LinkedIn to not miss the second part!
Love this breakdown! The shift left mindset for reusability is 🔥
I’ve seen teams over optimize it and end up with brittle upstream models that everyone depends on (and nobody dares change). How do you balance reuse with flexibility for edge cases?