This is the first in a series of five “lessons” which I created for SQL University on the topics of Dimensional Modeling and Data Warehousing. The rest of the lessons are as follows:
- If We Build It, Will They Come? Discusses the most important aspect of a data warehouse, business user input.
- Dimensions, It’s All Meaningless Without Them Covers the basics of building dimensions with examples.
- Facts, Show Me The Money! Discusses Fact table basics including the common types. Also covers crucial fact table aspects such as declaring and sticking to the grain.
- Resistance is Futile Brings it all together with descriptions of Star and Snowflake schemas. Includes inexpensive ways to learn these skills and how to convince your boss to sponsor a data warehouse initiative.
Welcome to those of you from SQL University attending this series of “classes” on Dimensional Modeling Fundamentals. If you are new to my blog, you may want to take a moment to read about me and why I blog.
This series of articles will serve as an introduction to Dimensional Modeling and Data Warehouses. While I am a SQL Server Business Intelligence DBA, the fundamentals of dimensional modeling are universal and platform agnostic. I’m a big proponent of Kimball Methodology, and most of what I will share with you is based on it.
What is Dimensional Modeling and what is a Data Warehouse?
Dimensional Modeling is a term for a set of techniques and concepts used to design data warehouses. Different people have their own definitions of the term data warehouse, but the one we’ll use is simple. A Data Warehouse is a relational database that is optimized for reporting. Typically they contain information from many different data sources and are loaded from those sources on a periodic basis.
Why do I need a Data Warehouse?
What would you do with a brain if you had one? – Dorothy. The Wizard of Oz
There are many reasons why you need a data warehouse. Let’s cover a few.
Your Database Design Sucks
Well, it sucks for reporting anyway. How can I make that blanket statement? Well, when most of us think of a database, we think of an On-Line Transaction Processing (OLTP) database whether we know it or not. This type of database is optimized for data entry: inserts, updates, and deletes. Part of this optimization is due to normalization. I won’t go into the gory details, but normalized databases are generally composed of narrow tables of closely related data, which link to other related tables to minimize redundancy. This results in a logical layout for the data, reduces the possibility of multiple values for the same attribute, and reduces database size.
For example, let’s assume you were designing the sales order tables of a database for manufacturing companies (like Made2Manage). You could not have all of your sales information in a single table. Logically, at the very minimum, you would require the following tables. I’ve listed some common fields at each level.
- Customer Table (SLCDPM), which would contain customer name, discount, credit rating, billing address.
- Sales Order Header (SOMAST)- Purchase Order Number, Order Date, Shipping Method.
- Sales Order Detail (SOITEM)- Quantity Ordered, Unit Price, Part Number.
Of course, I’m simplifying things. Even in a small database like Made2Manage the sales order module encompases more like 10 tables as you can see from the simplified model below.
One of the tables is in red because its a Visual FoxPro table. I have seen production systems which have as many as 15 – 20 tables involved in the sales process. While this may make sense in a transactional system, reporting from it can be nightmarish.
A dimensional model for a data warehouse is designed differently. We transform the standard OLTP structure into a simplified structure with two basic types of tables. Fact tables generally contain measurements such as dollar amounts, and dimensions which contain descriptive facts about them. I’ll go into this in more detail later this week, but in the dimensional model tables are combined and “flattened” out which makes reporting easier for the users and faster for the optimizer.
- Lacking Primary and Foreign Key Relationships – That’s the reason the first illustration lacks the arrows. Made2Manage lacks primary and foreign keys. Therefore, most query programs will not pick up the relationships for the users automatically.
- Incomplete Normalization – Your database may repeat the customer name in both the customer master and sales order header tables. This leads to situations where the customer has changed names, but the sales orders reflect the previous name.
- “Bad” Field Names – While developers may understand that Fsono stands for sales order number, the users should not have to see that. Some databases have field names that are in foreign languages, or use field names like F111, F143, etc. Notice that in the second illustration, the field names are typically in real English.
- Indexing issues – As mentioned before, OLTP databases are optimized for data manipulation and indexes tend to slow those processes. However, when retrieving millions of records from 15 joined tables and constraining by multiple values, indexing is critical for performance.
- Hard Updates and Deletes / Auditing – Some OLTP databases update records in place which makes it difficult to track changes. They may also “hard delete” as well, meaning that they delete the record from the table rather than marking it with some flag indicating that it has been deleted. While the data might be accurate as of now, it becomes difficult to track history which causes auditors to have fits. Often auditing functionality can be built into the extract, transform, and load (ETL) process for loading a data warehouse.
- Improper Data Types – Databases may save data in the wrong data type. For example, they use the Varchar or Char data types for dates, numbers, etc. While the data is saved, this can make data manipulation and reporting difficult.
Even if you are fortunate enough to have a well designed OLTP system, it is most likely not a good reporting platform.
Your Data Sucks
The consitency and completeness of data required in a data warehouse is much greater than typical OLTP databases. These issues are typically corrected during the extract, transform, and load (ETL) process. For example, in a typical OLTP source database, you may run into any of the following:
- Blank or Null Values. If an address is lacking a country designation, the package will still likely be delivered. However, when reporting blank values make analysis difficult.
- Inconsistent or mis-spelled values. Made2Manage doesn’t have any restrictions on its State fields. This allows the users to enter things like “IT” for Italy. Also, my database contains multiple versions of the same state name. The great state of California is represented as CA, Cali, California, Califrnia, etc. This incredibly frustrating for reporting.
- Duplicate Entities. Many database systems allow users to accidentally enter duplicate entities like customers, contacts, etc. Reports that group by these entities are often inaccurate because of this.
User Extracts Suck
Rest assured that if you aren’t extracting your data for analysis, your users are. They’re exporting reports from the source systems, using Excel to pull the data, etc. They’re most likely doing so with inefficient queries, and at inopportune times. Their interpretations of that data may be inaccurate and they often litter your servers with multiple copies that are at varying stages of obsolescence.
Multiple Data Sources Suck
Most larger companies have numerous legacy systems, which don’t “talk” to each other. Different systems for Customer Relationship Management (CRM), Shipping, Manufacturing, Accounting, Payroll, etc. can all be merged into one well designed data warehouse for analysis.
Another issue arises when you need to change your source OLTP system. Suppose you’re switching your ERP system product. Well, you’ve already spent forever creating dashboards and reports on that system and now you’ll have to do it all over again. Source systems may change, but a properly designed data warehouse doesn’t necessarily have to. If the business itself doesn’t fundamentally change, then the data warehouse structure remains effective. You can simply change your ETL process to load your current structure or add additional functionality to your current data warehouse.
Date Grouping and Manipulation Sucks
Without the use of a date table/dimension date grouping is a royal pain, especially if you use non-standard fiscal periods. My current employer uses 4-4-5 fiscal periods. Therefore, when grouping by fiscal period, I simply have to use a date table. Data warehouses almost always have a date dimension.
The Report Process Sucks
First, let me tell you what the majority of your business users think about reports. When they want a report, they send a request to IT. It sits in the queue for weeks or perhaps longer. Eventually the DBA asks them a ton of questions which the business user thinks should be a matter of common sense. After several drafts, the approval is finally made and the user forgets why they requested the report in the first place. Reports evoke visions of green bar paper and dust. Wouldn’t it be nice for them to open Excel and simply drag fields into a pivot table or fire PowerPivot up and create an interactive application? They want analytics, not reports.
And what about you, the DBA or Report Writer? How many times can you write the same report with slightly different groupings and parameters? It’s a never ending treadmill and that’s one of the reasons I pursued this path in my free time.
Data Warehousing Doesn’t Suck
A few years ago I attended a SQL Server Analysis Services class with the great Mike Hotek, and he started out with dimensional modeling. He spent less than 20 minutes discussing it, briefly describing dimension and fact tables. However, when he did the clouds parted and a beam of light shown through, and I heard angels singing “aaaah”.
It’s a UNIX system! I know this! – Lex. Jurassic Park
Dimensional modeling just clicked with me and I instantly “got it.” I proceeded to purchase nearly Kimball Group book on the subject and have been studying it ever since. Data warehousing is fascinating to me as it gets me out of the daily grind, and it’s good for the business as well. Its like getting paid to work on puzzles all day long. In short, its full of win.
Tune in every day this week, and I’ll show you why.