Self-service data modeling tools like Power BI Desktop, Power Pivot and SSAS Tabular promote the advantage of using data from multiple data sources. I’m including SSAS Tabular on the “self-service” list because it is derived from Power Pivot and often used for “quicker” IT-managed BI projects. The alleged advantage of sourcing data directly from within the modeling tool might seem to suggest that a formal ETL process and dimensional modeling exercise is unnecessary. The more experience I gain with these newer tools in business settings has me ever thinking about using best practices and the right approach for scaling BI solutions of all sizes.
The bottom line, before getting into the weeds on this topic, is that there are cases where I think it makes perfect sense to skip the usual steps of staging, transforming and modeling data into a relational data mart. Consequently, by using this approach, you are also removing the safety net and limiting options to scale and expand the solution. I love tools like Power Query and Power BI Desktop to develop simple BI solutions quickly. Perhaps a related topic for another time but always consider options to incorporate enterprise scale with modern analytics. Solutions always grow.
The Self-Service Modeling Paradigm
More and more of the projects I inherit from consulting customers and previous developers resemble the following diagram:
In less-formal solutions, the “ODS” layer in this diagram may be little more than a small database on an existing server with lookup tables populated by a weekly scheduled SSIS package or stored procedure. A lot of these projects are either departmental or the result of a self-service project handed-off to the IT organization to be “right-sized” and transitioned from the desktop to a server-hosted solution.
Power Pivot, Power BI Desktop or SSAS Tabular are often used to mash-up data from multiple data sources. Some of the data may be staged and transformed to simplify a complicated query or to work-around security restrictions from the original data source. The data feeding multiple tables in the model arrives in separate streams where the keys then need to be manipulated in order to match-up fact table keys to identifying keys in the lookup/dimension tables.
What’s the Problem With This Approach?
In some solutions with a manageable scale and a reasonable tolerance for a certain amount of data loss and inconsistency, this approach may be just fine. There are very good reasons for inconsistencies between sets of data that come from different sources. When 20 million fact rows are obtained from an online ordering system and .1% don’t have matching customers records that come from the system used to manage the master customer list, it may simply be a question of timing. Or, maybe the definition of “customer” is slightly different in the two systems. Chances are that there are also legitimate data quality issues affecting a small percentage of these records.
Whatever the case may be, a data conformity or potential data quality issue in this design scenario falls on the model designer to either fix or ignore. Again, this may or may not be a concern but it is a decision point and might raise the question of ownership when questions about data quality are raised.
Before I continue, let’s revisit how this common issue is handled in more conventional Business Intelligence solutions…
Managing Data Conformity in Traditional BI
Here’s a quick overview of the conventional Business Intelligence approach. For the past twenty years or so, we’ve promoted strict dimensional modeling principles that can be summarized in the following diagram. This diagram is my interpretation of The Kimball Method, applied to my experience and tools available from Microsoft. I learned to orchestrate BI solutions using this approach; however tedious, time-consuming and expensive. This is my frame of reference for any BI project. Many variables can influence adaptations from this model such as product choices, big data sources, real-time analysis, data volume, budget or the stakeholder’s realm of influence or ownership – but this is my starting point for a BI project done “right”.
If we take this general approach and scale it down to a departmental or business unit level, where an enterprise data warehouse doesn’t exist to satisfy analytic reporting requirements; we commonly end-up with a solution that looks more like the following diagram. Data is still procured from multiple sources through extract and transform processes, loaded into a staging database and/or directly into a dimensional data mart. Even when using an Analysis Services Tabular project to build the semantic model, the model only has one data source with all the fact and dimension tables populated from the relational data mart. The dimensionalized data is then surfaced through a semantic model and then presented through ad-hoc analysis, analytic and operational reports, and executive dashboards.
Different Solution Architecture – Same Problem
The most significant difference between the previous two diagrams and the one I started with is that the solution developer has control over the data before it enters the semantic model. Data conformity and quality can be managed in different places which include:
- At the source
- In the ETL process after source data is staged, and before it enters the data warehouse/data mart
- Within the semantic model
In theory, fixing “bad” or non-conforming data at the source seems to be ideal for everyone. In a utopian setting, the business can incorporate a master data management solution to let data stewards correct erroneous or unmatched records and then feed them back into source systems. Operational reports produced by business applications should match analytical reports and dashboards from the data warehouse/data marts and the models they feed. Please don’t misunderstand – many businesses incorporate MDM solutions and they work well. The reality in most scenarios is that this is a difficult undertaking. It requires business process changes and an enormous level of cooperation and coordination across business units, leadership and interdependent teams.
Weeding out bad records and making corrections as data flows downstream to the data warehouse is often a practical trade-off that has many advantages. Reports produced from the data warehouse or data marts should match-up to analytic reports and dashboards fed by semantic models which uses congruent data.
Any provisions to compensate for data anomalies within the data model will produce some degree of variation between reports produced upstream and analytic reports produced from the semantic model. This may be acceptable as long as users understand and expect certain discrepancies. Reconciling financial results and auditing to a narrow margin can be a difficult and time-consuming exercise. Often times user confidence can be maintained when the analytic model can produce the same bottom-line figure and discrepancies are evident only in different grouping buckets or categories – like a “catch-all” or “Unknown” member for records that don’t match.
How Analysis Services Multidimensional and Tabular Deal With Data Non-Conformity
SQL Server Analysis Services (which we now call “Multidimensional SSAS”) has handled mismatched keys the same way for several years. When fact rows are processed with key values that don’t match a related dimension, errors are reported and the model processing fails. The developer must make a deliberate exception for these records to match a specially-generated “Unknown” member of the dimension. Under the default behavior, a solution developer following The Kimball Method would have defined Unknown member records in every dimension table and then mapped the unmatched keys in the ETL process.
For the purpose of this discussion and as far as we’re concerned; SSAS Tabular, Power Pivot and Power BI Desktop are the same. All these tools, which are derived from different versions of the same code base, all deal with mismatched keys the same way.
SSAS Tabular is significantly different from multidimensional in that unmatched fact rows simply flow through model processing and a blank member is generated in each affected dimension table. Visually, this shows up as an aggregated value for an unlabeled heading. Casual users will miss this and won’t notice that a certain number of records have fallen through the cracks. There are currently no properties or settings to change this behavior. In fact, a Connect Item filed back in Feb 2014 was dismissed as a low priority suggestion. I’m not debating the issue but just informing you that this is how it works. I’d personally like to at least have the option to label the blank dimension record as “Unknown”. I’ve found some creative ways to work-around this limitation. The best documented approach is this one that Javier Guillen posted in 2012.
One of the reasons I like SSAS Tabular is that it’s simple. But, “simple” means that is has less “stuff”, less “clutter” and fewer features than its more complicated ancestor. I supposed that if every SSAS practitioner were able to get the product developers to add their favorite feature, we might end up right where we started – and that’s a dilemma. This is an over-simplification as there is always room for product improvement.
This is a topic worthy of deeper discussion so it’s on the list of items for me to explore in follow-up posts. Please let me know what you think, post your questions and provide some feedback.