Agile Data Warehouse Development using Biml

Agile development is good, Waterfall is bad. We hear this all the time, along with un-documented statistics claiming that approximately 80% of Data Warehouse projects fail. So, we as Business Intelligence Professionals are told that we must be Agile. Whenever I taught Dimensional Modeling and Data Warehouse classes, I could count on at least one student asking me how to apply Agile principles to this process. My answer, up until several months ago, was that true Agile Data Warehouse development was a pipe dream. The reason why is fairly obvious if you follow the Kimball method. In a nutshell, when building Star or Snowflake Schemas, one should design, build, and load related Dimensions before Fact Tables. Since each Fact Table, which contains your metrics, typically relates to a majority of the Dimensions, the first Fact Table load requires a great deal of time from the Data Warehouse Development Team and business stakeholders. Dimension tables require a lot more time and resources to develop than Facts. After that first “wave” if you’ll pardon the pun, additional Fact tables and metrics can be delivered quite rapidly. The natural consequence of this is a form of Waterfall development.

We’ve always tried to ameliorate this problem by prioritizing business processes by their “Business Value” and feasibility. In other words, we typically chose to create our first Fact Table(s) for business processes that were worth a lot of money and were the easiest to build and load. We traded a true Waterfall, delivering the entire solution at the end, for a partial Waterfall. Less painful? Sure, but that doesn’t equate to Agile development.

Agile Data Warehouse Design

Several months ago Tim Costello (Blog/Twitter), a good friend of mine from the community, urged me to read the following book:

Buy This Book

Buy This Book

Agile Data Warehouse Design: Collaborative Dimensional Modeling, from Whiteboard to Star Schema

I have a great deal of respect for Tim, so I read it and the book is excellent. It offers comprehensive tools and techniques for improving the entire process of designing a data warehouse. The book emphasizes quick iterations, each of which involves:

  1. ModelStorming – Their new method of conducting business process interviews.
  2. JIT (Just In Time) Dimensional Modeling
  3. ETL Development
  4. Visual BI Delivery
  5. Revisions

These ideas and concepts are great and they address several issues with the traditional Data Warehouse Lifecycle. However, up until recently it was impossible to build a data warehouse this way.

Agile Data Warehouse Design Development?

In an ideal world, building a data warehouse would be an Agile process. The first sprint would deliver one or more Fact Tables, and possibly a couple of prototype or primitive Dimensions. In each successive sprint, more Fact Tables could be deployed, new Dimension tables could be added, and existing Dimensions could have additional attributes created. Existing tables would easily “adapt” to the changes. For example, existing Fact Tables would automatically be related to new Dimension tables, if appropriate. Ideally, this whole process would be fluid and quick. You know umm… Agile. However, anyone who develops ETL processes knows this is not the case.

Let’s consider the work involved in making just one small minor change such as adding a new attribute to an already existing Dimension. The process may involve the following steps.

  1. Alter the Dimensional Model if using a dedicated tool.
  2. Alter the Source Query(ies).
  3. Any dependent objects need to be altered as well such as staging tables, stored procedures, user views, etc.
  4. T-SQL must be written or generated to make these changes.
  5. The SSIS Package(s) needs to be edited.
  6. If SSAS Cubes are being used they also need to be edited.

In traditional ETL development, none of these objects above react dynamically to the change, they all have to be handled separately. This is only scratching the surface. Best practices suggest using Source Control on all of the objects above, which is very time consuming. Also all of these changes should be fully tested in a separate environment before being deployed to production. So, this single small incremental improvement could require several days (perhaps weeks) of development work. So, why did I preface this with section with “until recently?”

Biml Facilitates Agile Data Warehousing

“BimlScript enables developers to build and load a Data Warehouse nearly as fast as they can draw the boxes and lines.”

Using a quality BimlScript framework eliminates a lot of the pain, tedium, and costs involved in Data Warehouse Development and ETL. How would the same change from above be accomplished using a good BimlScript framework?

  1. Alter the Dimensional Model if using a dedicated tool.
  2. Alter the Source Query(ies).
  3. Automatically import the changes into Biml.
  4. Automatically regenerate every single object listed above including staging tables, user views, SSIS Packages, SSAS Cubes, etc.

Well, what about Source Control? If the BimlScript framework automatically generates all of those objects above from the Dimensional Model and its metadata, what really needs to be checked into Source Control? The Dimensional Model and the BimlScripts. Since the only object that was altered during this “sprint” was the Dimensional Model itself, that’s the only object which needs to be checked in and out of Source Control.

What about testing? If the results of your BimlScripts have been tested in prior sprints, and you didn’t change any of that code, how much QA testing is really required? I would argue that only a cursory amount of testing would be required to validate your metadata.

In short, BimlScript is the perfect companion to the Agile design methods proposed in that book. Stay tuned to the blog for more details.

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>