Archives

If We Build It, Will They Come?

Note: This is the second in a series of five “lessons” which I created for SQL University on the topics of Dimensional Modeling and Data Warehousing. Click here to read the rest of the series.

Creating a data warehouse is a daunting process. It’s not something you just whip out in a weekend. So, let’s get a few things straight to start with.

Who Designs the Data Warehouse (Dimensional Model)?

If you build it, will they come? It seems obvious that you design the data warehouse, that’s why you’re taking your time read this blog, or maybe you’re just fascinated with my movie quotes. But seriously, who designs this thing? The business users.

The business users have a large part in designing a data warehouse. This isn’t an IT process, its a business process. The subjects of the data warehouse are decided by the business, as well as what information needs to be kept on each. In reality the data warehouse exists to assist users in making better decisions. The data warehouse architect is there to assist and guide in this process. One sure way to fail is to have IT department centered data warehouse because the users won’t see value in it.

Notice that I didn’t suggest that you should ask the business users what they want. If you do so, you are handing over the design role to the business user. They don’t know what they want because they don’t know what you provide. They will tell you something, but because they don’t understand what you can do, it will be wrong. It’s up to you to bridge the gap and learn the business to so you can design a successful dimensional model.

So, the business users don’t understand tables, keys, referential integrity, etc. Part of the data architect’s job is to educate the users about what can be done so they can provide better input into the process. So, how can you teach them the basics of fact and dimension tables? The same way I’m going to teach you.

Let’s start out with pivot tables, which most users understand. Ask the user for a typical reporting requirement that they might have. Suppose they say, “I want to know incoming order amounts by fiscal year, period, and product category.” Well, the most users understand how to use Excel pivot tables so that’s the method we’ll use.

Open Excel, query the data, and create a pivot table. The user may already know that they need to drag the amount to the Values box, and already know how to pull the values for year, month, and product category to the columns and rows as shown. The values, or the column you want to measure, are your facts. The things you measure facts by (by fiscal year, period, and product category) are your dimensions. Dimensions and facts go in separate tables, but the users may never see them. Now that they understand that, what’s next?

How do we model the model?

bus_matrix_example

Do you want to know what the Matrix is Neo? – Morpheus The Matrix

Start with the Kimball Bus Matrix, which is kind of like a high level road map of your project. Along the left hand side are the business processes which bring value to your organization. The columns are the descriptive objects (dimensions) for your organization. If a dimension is involved with a specific business process, place an X in the intersecting cell. Notice that I have added a column called “Business Priority” which I use to indicate the order in which I will handle each process.

Typically the Bus Matrix is much larger because enterprises typically have more business processes. The example presented is a simplified version demo purposes.

How do we determine which order to tackle the business processes?

Well, the first business process takes the longest. You’re going to find that dealing with the dimensions typically takes a lot longer than fact tables. When you work on your first business process, you will need to create all of the dimensions associated with it. In my example above, if we start with Incoming Orders, every dimension but Vendors has to be created. However, each additional business process becomes easier because we re-use the already existing dimensions.
business_process_order
So, which business process do you start with? Start with the low hanging fruit. You want to pick a process which is relatively easy and has high business value. Perhaps that value represents money savings, increasing sales opportunities, or something that is a mandated compliance issue. Once again, all of this is considered with a focus on the business. We want a quick return on investment so we continue to have the support of management.

So, you’ve identified the business process and the facts and dimensions associated with it. Note that we haven’t started fleshing out the dimension fields with the descriptive data. That’s what we’ll handle next when I show you the basics of modeling dimensions.

SQL University - Dimensional Modeling and Why It Doesn't Suck

This is the first in a series of five “lessons” which I created for SQL University on the topics of Dimensional Modeling and Data Warehousing. The rest of the lessons are as follows:

SQL University

Welcome to those of you from SQL University attending this series of “classes” on Dimensional Modeling Fundamentals. If you are new to my blog, you may want to take a moment to read about me and why I blog.

This series of articles will serve as an introduction to Dimensional Modeling and Data Warehouses. While I am a SQL Server Business Intelligence DBA, the fundamentals of dimensional modeling are universal and platform agnostic. I’m a big proponent of Kimball Methodology, and most of what I will share with you is based on it.

What is Dimensional Modeling and what is a Data Warehouse?

Dimensional Modeling is a term for a set of techniques and concepts used to design data warehouses. Different people have their own definitions of the term data warehouse, but the one we’ll use is simple. A Data Warehouse is a relational database that is optimized for reporting. Typically they contain information from many different data sources and are loaded from those sources on a periodic basis.

Why do I need a Data Warehouse?

What would you do with a brain if you had one? – Dorothy. The Wizard of Oz

There are many reasons why you need a data warehouse. Let’s cover a few.

Your Database Design Sucks

Well, it sucks for reporting anyway. How can I make that blanket statement? Well, when most of us think of a database, we think of an On-Line Transaction Processing (OLTP) database whether we know it or not. This type of database is optimized for data entry: inserts, updates, and deletes. Part of this optimization is due to normalization. I won’t go into the gory details, but normalized databases are generally composed of narrow tables of closely related data, which link to other related tables to minimize redundancy. This results in a logical layout for the data, reduces the possibility of multiple values for the same attribute, and reduces database size.

For example, let’s assume you were designing the sales order tables of a database for manufacturing companies (like Made2Manage). You could not have all of your sales information in a single table. Logically, at the very minimum, you would require the following tables. I’ve listed some common fields at each level.

  1. Customer Table (SLCDPM), which would contain customer name, discount, credit rating, billing address.
  2. Sales Order Header (SOMAST)- Purchase Order Number, Order Date, Shipping Method.
  3. Sales Order Detail (SOITEM)- Quantity Ordered, Unit Price, Part Number.

Of course, I’m simplifying things. Even in a small database like Made2Manage the sales order module encompases more like 10 tables as you can see from the simplified model below.

One of the tables is in red because its a Visual FoxPro table. I have seen production systems which have as many as 15 – 20 tables involved in the sales process. While this may make sense in a transactional system, reporting from it can be nightmarish.

A dimensional model for a data warehouse is designed differently. We transform the standard OLTP structure into a simplified structure with two basic types of tables. Fact tables generally contain measurements such as dollar amounts, and dimensions which contain descriptive facts about them. I’ll go into this in more detail later this week, but in the dimensional model tables are combined and “flattened” out which makes reporting easier for the users and faster for the optimizer.

After being dimensionalized, the tables look similar to the following:

  • Lacking Primary and Foreign Key Relationships – That’s the reason the first illustration lacks the arrows. Made2Manage lacks primary and foreign keys. Therefore, most query programs will not pick up the relationships for the users automatically.
  • Incomplete Normalization – Your database may repeat the customer name in both the customer master and sales order header tables. This leads to situations where the customer has changed names, but the sales orders reflect the previous name.
  • “Bad” Field Names – While developers may understand that Fsono stands for sales order number, the users should not have to see that. Some databases have field names that are in foreign languages, or use field names like F111, F143, etc. Notice that in the second illustration, the field names are typically in real English.
  • Indexing issues – As mentioned before, OLTP databases are optimized for data manipulation and indexes tend to slow those processes. However, when retrieving millions of records from 15 joined tables and constraining by multiple values, indexing is critical for performance.
  • Hard Updates and Deletes / Auditing – Some OLTP databases update records in place which makes it difficult to track changes. They may also “hard delete” as well, meaning that they delete the record from the table rather than marking it with some flag indicating that it has been deleted. While the data might be accurate as of now, it becomes difficult to track history which causes auditors to have fits. Often auditing functionality can be built into the extract, transform, and load (ETL) process for loading a data warehouse.
  • Improper Data Types – Databases may save data in the wrong data type. For example, they use the Varchar or Char data types for dates, numbers, etc. While the data is saved, this can make data manipulation and reporting difficult.

Even if you are fortunate enough to have a well designed OLTP system, it is most likely not a good reporting platform.

Your Data Sucks

The consitency and completeness of data required in a data warehouse is much greater than typical OLTP databases. These issues are typically corrected during the extract, transform, and load (ETL) process. For example, in a typical OLTP source database, you may run into any of the following:

  • Blank or Null Values. If an address is lacking a country designation, the package will still likely be delivered. However, when reporting blank values make analysis difficult.
  • Inconsistent or mis-spelled values. Made2Manage doesn’t have any restrictions on its State fields. This allows the users to enter things like “IT” for Italy. Also, my database contains multiple versions of the same state name. The great state of California is represented as CA, Cali, California, Califrnia, etc. This incredibly frustrating for reporting.
  • Duplicate Entities. Many database systems allow users to accidentally enter duplicate entities like customers, contacts, etc. Reports that group by these entities are often inaccurate because of this.

User Extracts Suck

Rest assured that if you aren’t extracting your data for analysis, your users are. They’re exporting reports from the source systems, using Excel to pull the data, etc. They’re most likely doing so with inefficient queries, and at inopportune times. Their interpretations of that data may be inaccurate and they often litter your servers with multiple copies that are at varying stages of obsolescence.

Multiple Data Sources Suck

Most larger companies have numerous legacy systems, which don’t “talk” to each other. Different systems for Customer Relationship Management (CRM), Shipping, Manufacturing, Accounting, Payroll, etc. can all be merged into one well designed data warehouse for analysis.

Another issue arises when you need to change your source OLTP system. Suppose you’re switching your ERP system product. Well, you’ve already spent forever creating dashboards and reports on that system and now you’ll have to do it all over again. Source systems may change, but a properly designed data warehouse doesn’t necessarily have to. If the business itself doesn’t fundamentally change, then the data warehouse structure remains effective. You can simply change your ETL process to load your current structure or add additional functionality to your current data warehouse.

Date Grouping and Manipulation Sucks

Without the use of a date table/dimension date grouping is a royal pain, especially if you use non-standard fiscal periods. My current employer uses 4-4-5 fiscal periods. Therefore, when grouping by fiscal period, I simply have to use a date table. Data warehouses almost always have a date dimension.

The Report Process Sucks

First, let me tell you what the majority of your business users think about reports. When they want a report, they send a request to IT. It sits in the queue for weeks or perhaps longer. Eventually the DBA asks them a ton of questions which the business user thinks should be a matter of common sense. After several drafts, the approval is finally made and the user forgets why they requested the report in the first place. Reports evoke visions of green bar paper and dust. Wouldn’t it be nice for them to open Excel and simply drag fields into a pivot table or fire PowerPivot up and create an interactive application? They want analytics, not reports.

And what about you, the DBA or Report Writer? How many times can you write the same report with slightly different groupings and parameters? It’s a never ending treadmill and that’s one of the reasons I pursued this path in my free time.

Data Warehousing Doesn’t Suck


A few years ago I attended a SQL Server Analysis Services class with the great Mike Hotek, and he started out with dimensional modeling. He spent less than 20 minutes discussing it, briefly describing dimension and fact tables. However, when he did the clouds parted and a beam of light shown through, and I heard angels singing “aaaah”.

It’s a UNIX system! I know this! – Lex. Jurassic Park

Dimensional modeling just clicked with me and I instantly “got it.” I proceeded to purchase nearly Kimball Group book on the subject and have been studying it ever since. Data warehousing is fascinating to me as it gets me out of the daily grind, and it’s good for the business as well. Its like getting paid to work on puzzles all day long. In short, its full of win.

Tune in every day this week, and I’ll show you why.

Upcoming Presentation and Blog Events

I mentioned in the last article that I am taking a temporary break from the T-SQL Basics Series of articles. I’ll return to it in a few weeks, but in the meantime I’ve been working other stuff for your amusement.

Next week Wednesday, I’ll be presenting at the Fort Worth SQL Server User Group. Come see me present:

Zombie Database Survival Guide: Implementing PowerPivot with Zombie Databases

Do you have a Zombie Database? Does your database suffer from roaming hordes of problems like the lack of primary and foreign keys, poor normalization, in place updates and deletes, and an insatiable drive to eat your braaaaaaaaaaaaaains (CPU cycles).

Fear not fellow survivor, for I’ll show you how to control the zombie apocalypse and prepare your data for use with PowerPivot, a free add on for Excel 2010. I’ll show you how to use PowerPivot to slice and dice your data and keep your users safe from the zombie databases which threaten their very way of life.

A great time will be had by all.

SQL University

My good friend Jorge Segarra (Blog/Twitter) aka SQLChicken has asked me to become a professor at SQL University. Next week I’ll present a series of articles on Dimensional Modeling Fundamentals.

Tune in and you learn the following:

  • Why your database sucks, and what you can do about it.
  • The reasons you actually have to listen to business folks.
  • What’s with all the dimensional modeling techno-babble? What does it really mean?
  • How to break a Data Warehouse project into digestible chunks so you don’t get overwhelmed from the start.
  • What you really need to know about Dimension and Fact Tables.

There will be chills, thrills, and most likely spills; charts, pretty pictures, dancing girls…. Well, maybe not dancing girls, but I’ll share my views on Dimensional Modeling.

So, attend my presentation in Forth Worth if you can, and tune in next week to see if I can’t get myself disbarred from SQL University.

T-SQL Basics: Different Types of Joins

In this seventh article of the T-SQL Basics Series, we will be covering both the basics as well as some specific issues with SQL Joins. If you’ll recall our last lesson, we wrote a sales query for Master Po. He then requested that we add customer information to the report.

Note, after receiving e-mails from non-M2M users complaining that they couldn’t follow along, I’ve decided to switch to using a database that everyone can use. So, I’ve created a quick database for our Kung Fu Temple Gift Shop. If you want to follow along, please run the following code on a test server: Kung Fu Database Script. It will create a small and imperfect database for us to use in this series.

Prerequisite Reading

Before we begin this lesson, I ask that you read an ancient manuscript by Master Jeff Atwood (Blog/Twitter). Yes, I know he belongs to the developer’s temple, but he’s already written the excellent primer, A Visual Explanation of SQL Joins. Please read the text and come back when you’re finished so I can fill in the details. Go ahead Grasshopper, I’ll wait.

Additional Wisdom about Joins

I love the primer because of his use of Venn Diagrams. The first time I worked with SQL Joins, memories of middle school math lessons about Venn Diagrams popped into my mind. While nothing he wrote is incorrect, I wish to add to it and flesh out your knowledge of SQL Joins.

Let me give you a brief run down of the tables in the Kung Fu Temple Gift Shop Database, and how they are related. The tables we will be using are listed below. I’ve listed the Primary Keys, but have not yet set up Foreign Keys relationships. We’ll discuss those in a later lesson.

Table NamePrimary Key
CustomersCustomerID
ProductsProductID
OrderMasterOrderID
OrderDetailsOrderID, LineNo

Inner Join

The Inner Join is most commonly used and is in fact the default join type. This type of join only returns records which match from both tables. There are two basic ways to write it. Let’s link the Order Master and Order Items tables and return a few fields from each.

Select
  OM.OrderID
 ,OM.CustomerID
 ,OM.OrderDate
 ,OD.[LineNo]
 ,OD.ProductID
 ,OD.OrderQty
From
  dbo.OrderMaster OM
 ,dbo.OrderDetails OD
where
  OM.OrderID = OD.OrderID

Does that code look strange to you? Well, if you’ve been following my writings, it should. Some of the M2M Reports, which I’ll cover in a later lesson, use this old syntax and there is technically nothing wrong with writing it. The optimizer processes it the same way as the syntax I will cover in a minute and it runs just as fast. However, Master Po has forbidden it as an affront to the laws of nature, so you will never see me use it again. Why? I’m glad you asked.

  1. It’s hard to read. This example looks simple because it only includes two tables. However, when more tables are involved its just too easy to forget one of the Where clause conditions. Therefore, using this joining method is…..
  2. Dangerous. If you forget to include a table join in the Where clause, you will create a cartesian product with that join. A cartesian product returns every combination of each table where the join is either missing or invalid. For example, in our Kung Fu Gift Shop database there are 4 records in Order Master and 5 in Order Items, if we neglect to add the Where clause, the recordset jumps to 20 (4 * 5) records. In a production database with millions of rows in each table, the result set becomes ridiculously large. Also, every additional join you miss exponentially increases your recordset.

For these reasons we will explicitly the Join this way:

SELECT
  OM.OrderID
 ,OM.CustomerID
 ,OM.OrderDate
 ,OD.[LineNo]
 ,OD.ProductID
 ,OD.OrderQty
FROM
  dbo.OrderMaster OM INNER JOIN dbo.OrderDetails OD ON OM.OrderID = OD.OrderID

In T-SQL the use of the word Inner is optional as well. I personally add it, but many don’t.

SELECT
  OM.OrderID
 ,OM.CustomerID
 ,OM.OrderDate
 ,OD.[LineNo]
 ,OD.ProductID
 ,OD.OrderQty
FROM
  dbo.OrderMaster OM JOIN dbo.OrderDetails OD ON OM.OrderID = OD.OrderID

Please note that you can mix both styles in the same query, but I don’t recommend it for the same reasons.

SELECT
  OM.OrderID
 ,OM.CustomerID
 ,OM.OrderDate
 ,OD.[LineNo]
 ,OD.ProductID
 ,OD.OrderQty
FROM
  dbo.Products PRO, dbo.OrderMaster OM -- Do Not do this. 
  JOIN
    dbo.OrderDetails OD
  ON OM.OrderID = OD.OrderID
WHERE
  PRO.ProductID = OD.ProductID -- Join between Products and Order Details.

Outer Joins – Left, Right, and Full

Left Outer Join

The Left Outer Join returns all of the records (not filtered by a where clause) from the field to the left of the Join and then any corresponding records from the table to the right of the Join. The records from the first (left) table are returned even if there are no corresponding records in the other table. As with other Joins the word Outer is optional as well. Left Outer Join works the same as Left Join.

For example, lets say that Master Po wants a list of all of the Kung Fu Temple Gift Shop’s customers and any of the orders they have placed. You could write this query two ways. First, let’s use a Left Outer Join.
Option 1 Left Outer Join

SELECT
  CUST.CustomerID
 ,CUST.Title
 ,CUST.CustFirstName
 ,CUST.CustLastName
 ,OM.OrderID
 ,OM.OrderDate
 ,OM.DueDate
 ,OM.PONo
FROM
  dbo.Customers CUST
  Left JOIN dbo.OrderMaster OM
    ON CUST.CustomerID = OM.CustomerID

Take a look at the first couple rows of the result set for this query. 5 Total Records were returned.

My name is listed twice because I placed 2 orders. Also, notice the Null values for Master Kan? The values for Order Master fields is null because there is a customer record for him but he hasn’t placed any sales orders.

Let’s look at the same query with an Inner Join. 61 Total Records were returned, one less than the previous query because we only have 1 customer which hasn’t placed orders.
Option 2 – Inner Join

SELECT
  CUST.CustomerID
 ,CUST.Title
 ,CUST.CustFirstName
 ,CUST.CustLastName
 ,OM.OrderID
 ,OM.OrderDate
 ,OM.DueDate
 ,OM.PONo
FROM
    dbo.Customers CUST
  INNER JOIN
    dbo.OrderMaster OM
  ON CUST.CustomerID = OM.CustomerID

When run, the record for Master Kan is removed. So, which is right? Well, it depends on what Master Po meant. Does he want to include customers which have no orders? Option 1 includes all customers whereas option 2 omits customers without orders. I used this particular example to emphasize that you need to learn to ask the right questions in order to get the right answers. In this case, we’ll stick with the Left Join.

What if Master Po wants a list of only those customers who never placed an order? We can achieve this using a Left Join and adding “WHERE OM.OrderID IS NULL”. In this case, only one record shows. Do you remember seeing that type of join before? I used it to find orphan records in your M2M Database. M2M Support still hasn’t forgiven me for posting that as so many of you called them asking them to fix your database. I’ll cover more about Left Join usage and problems with Nulls in a future lesson.

Right Outer Join

Take everything I just said for Left Outer Join and reverse it. Period. In fact, I have NEVER written a Right Outer Join. There isn’t a reason that why one would need to use one. There’s a rumor that persists that Left Joins are faster than Right Joins, but that isn’t true either. I only use Left Joins for consistency sake, and I suggest you do as well.

Full Outer Join

A Full Outer Join is returns all rows from both tables regardless of whether there is a match. In a way, it combines a Left and Right Outer Join. It’s not all that commonly used. The most common instance in which I use it is to compare the results of two queries. For example, let’s compare simplified versions the Inner Join and Outer Join queries I listed previously.

SELECT First.CustomerID FirstCustID,
       First.OrderID FirstOrderId,
       Second.CustomerID SecondCustID,
       Second.OrderID SecondorderID
  FROM    (SELECT CUST.CustomerID, OM.OrderID
             FROM    dbo.Customers CUST
                  INNER JOIN
                     dbo.OrderMaster OM
                  ON CUST.CustomerID = OM.CustomerID) First
       FULL OUTER JOIN
          (SELECT CUST.CustomerID, OM.OrderID
             FROM    dbo.Customers CUST
                  LEFT JOIN
                     dbo.OrderMaster OM
                  ON CUST.CustomerID = OM.CustomerID) Second
       ON First.OrderID = Second.OrderID
 WHERE    First.CustomerID <> Second.CustomerID
       OR Second.CustomerID IS NULL
       OR First.CustomerID IS NULL


I know it looks complicated, and the results are underwhelming, but we’re working with a small, simplistic data set. Basically I took the first query and creatively aliased it as “First”, the second as “Second”, put them together with a Full Outer Join, and then looked for any differences in their CustomerID fields. Don’t be thrown off by the “Is Null” statement as we’ll be covering Nulls in detail in a future article.

Another real world example where I use Full Outer Joins is if I’m changing report logic and need to prove that the results are the same. I’ll join the results of the first query, with my new query, and look for differences. There are also some 3rd party products which do that kind of comparison automatically.

Non Equi Joins

All of the examples we’ve used so far are Equi Joins because the join conditions have always been equal. Table A Inner Join Table B on Field A = Field B. Non Equi Joins are not equal, greater than, less than, etc. One common use for Non Equi Joins is joining dates to a table of fiscal periods. I’ll cover dates and calendar tables in detail later, but let’s assume that we had a table with a list of fiscal periods with a begin and end date for each period (Like M2M’s GLRULE table). In psudo-code you would join them like this:

SELECT *
  FROM    dbo.OrderMaster OM
       INNER JOIN
          FiscalPeriods FIS
       ON OM.OrderDate >= Fis.BeginDate AND OM.OrderDate < Fis.BeginDate

SQL Statements in Review

So far in this series, we have discussed the basics of T-SQL Syntax, importance of coding consistently, elements of the Where clause, and summarizing and grouping data. Let’s take a moment to review the order of statements in a SQL clause as a way of closing this series of lessons in the T-SQL Basics Series.

  1. Select – followed by a list of fields, functions, etc. and should be properly aliased.
  2. From – followed by the tables to be used in the query and how they are to be joined.
  3. Where – followed by the conditions by which individual records are to be filtered out.
  4. Group By – followed by fields by which records are to be grouped. Remember that order is important.
  5. Having – followed by the conditions by which the groups are to be filtered out.
  6. Order By – followed by the fields by which the final result set is to be ordered.

This is not the end of our lessons in T-SQL. In future installments I’ll discuss in more depth topics like sub-queries, the challenges associated with Nulls, common functions and their usage, working with dates and date tables, as many others. Patience Grasshopper, we’ll get there.

What other topics would you like me to cover?

T-SQL Basics - Ordering, Summarizing, and Grouping Data the Kung Fu Way

Today Grasshopper, I will teach you the basics of ordering, summarizing, and grouping your data to give it meaning. Let’s say that the Kung Fu Temple Gift Shop has had poor sales. Master Po wants a list of sales details from you, but he wants them ordered by the part number.

Note for the non Made2Manage people, I am using the M2M Educational Database. Sorels is the Sales Order Release table. I’m using it because it is the sales table which has dollar values. I’ve attempted to document the code with comments so students from other temples can follow the lesson.

Also, all of the screenshots are the data grids produced by Toad for SQL Server which recently released a new beta. Anyone can use the full beta until March 3rd.

Order By

The Order By clause does just what the name suggests, orders records. The Default order is ascending (ASC), but you can also return them in descending (DESC) order as well.

SELECT
  SOR.fsono                             SalesNo
 ,SOR.fenumber                          ItemNo
 ,SOR.frelease                          Release
 ,SOR.fpartno                           PartNo
 ,(SOR.forderqty * SOR.funetprice)      NetAmount -- Order Quantity * Net Unit Price 
FROM
  dbo.sorels SOR
ORDER BY
  SOR.partno -- ASC is default and optional, DESC needs to be specified.

Master Po now decides that he wants the records sorted in descending (DESC) order and then sorted by part revision. Note that I’ve used the aliased field names as well, more on that later.

SELECT
  SOR.fsono                             SalesNo
 ,SOR.fenumber                          ItemNo
 ,SOR.frelease                          Release
 ,SOR.fpartno                           PartNo
 ,SOR.fpartrev                          PartRev
 ,(SOR.forderqty * SOR.funetprice)      NetAmount -- Order Quantity * Net Unit Price 
FROM
  dbo.sorels SOR
ORDER BY
  PartNo Desc, PartRev

Master Po likes the report, but he would like to have the details summarized for him. How do we accomplish this?

Using Aggregate Functions

The most commonly used aggregate functions are listed below. The complete list and definitions can be found here.

  • Sum(SomeField)
  • Avg(SomeField)
  • Count(SomeField)
  • Max(SomeField)
  • Min(SomeField)

So Grasshopper, rather than tediously explaining what you can already guess, let’s look at the following examples.

SELECT
  sum(SOR.forderqty * SOR.funetprice)   TotalAmountOrdered
 ,Avg(SOR.funetprice)                   AverageUnitPrice
 ,count(*)                              SorelsRecordCount
 ,count(distinct SOR.fsono)             SalesOrderCount -- distinct removes duplicates
 ,Max(SOR.funetprice)                   MaxUnitPrice
 ,MIN(SOR.funetprice)                   MinUnitPrice
FROM
  dbo.sorels SOR


This works, but Master Po is not happy. The data has little meaning because it doesn’t differentiate by Part Number, Customer, Date, or any other useful attributes. So, how can you summarize the data by part number for him? Well, you might think that you could do this:

SELECT SOR.fpartno                      PartNumber
 ,sum(SOR.forderqty * SOR.funetprice)   TotalAmountOrdered
 ,Avg(SOR.funetprice)                   AverageUnitPrice
 ,count(*)                              SorelsRecordCount
 ,count(distinct SOR.fsono)             SalesOrderCount
 ,Max(SOR.funetprice)                   MaxUnitPrice
 ,MIN(SOR.funetprice)                   MinUnitPrice
FROM
  dbo.sorels SOR

However, SQL Server will return an error like “Column ‘dbo.sorels.fpartno’ is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.”

Using Group By

Grouping returns a single row for each unique value in the selected column with the corresponding aggregate value. To appease Master Po, let’s add a Group By clause to the above query.

SELECT SOR.fpartno                      PartNumber
 ,sum(SOR.forderqty * SOR.funetprice)   TotalAmountOrdered
 ,Avg(SOR.funetprice)                   AverageUnitPrice
 ,count(*)                              SorelsRecordCount
 ,count(distinct SOR.fsono)             SalesOrderCount
 ,Max(SOR.funetprice)                   MaxUnitPrice
 ,MIN(SOR.funetprice)                   MinUnitPrice
FROM
  dbo.sorels SOR
  Group By SOR.fpartno


So, what have we returned? Well, now the information has more meaning. Look at it again, after I’ve adjusted the alias names.

We can now see important patterns in the data. For example, why is there such a large discrepancy between the minimum and maximum unit price for part number CP2010 and CP0500? Could it be that we’ve been swindled by Ninjas? Group by can also be used for more than one column. In the following example, we are grouping first by the Part Number and then by the Sales Order Number.

SELECT SOR.fpartno                      PartNumber
 ,SOR.fsono                             SalesNo
 ,sum(SOR.forderqty * SOR.funetprice)   TotalAmountOrdered
 ,Avg(SOR.funetprice)                   AverageUnitPrice
 ,count(*)                              SorelsRecordCount
 ,count(distinct SOR.fsono)             SalesOrderCount
 ,Max(SOR.funetprice)                   MaxUnitPrice
 ,MIN(SOR.funetprice)                   MinUnitPrice
FROM
  dbo.sorels SOR
    Group By SOR.fpartno, SOR.fsono
    order by SOR.fpartno, SOR.fsono


Granted, the results are not earth shattering because I have a very limited data set. Also, since we have not joined to other tables, I can’t group by less specific attributes like Customer or Product Class, but you get the idea.

Let’s say that Master Po wants to know the total quantity due to ship per month per part number in the year 2010. In the future, I’ll show you a much easier way to do this using a date table, for now, we’ll use the following code.

SELECT
  SOR.fpartno           PartNumber
 ,month(SOR.fduedate)   MonthDue
 ,Sum(forderqty)        OrderQty
FROM
  dbo.sorels SOR
WHERE
  SOR.fduedate >= '1/1/2010' AND
  SOR.fduedate < '1/1/2011'
GROUP BY
  SOR.fpartno
 ,month(SOR.fduedate)
ORDER BY
  SOR.fpartno
 ,month(SOR.fduedate)


The Month function returns the month number of a specific date as a number. Master Po can now scan the list, or better yet you could prepare a good looking report for him using SQL Server Reporting Services (SSRS). Using it he’ll know when certain parts are due and can schedule his monk’s time accordingly. Before any of you geeks out there call me on this, yes I know Master Po was blind, but he was also magical and I’m sure he could “read” SSRS reports using those abilities.

Master Po is a demanding Sifu, he must be considering what Caine had to do with his forearms before leaving the temple, and suspects you have some duplicate records in the sorels table because your figures seem a little too high. Bitter experience has shown Master Po that M2M was designed by the Drunken Master and lacks primary and foreign keys. How can we verify that we do not have duplicates in sorels and be allowed to leave the temple?

The Having Clause

If Master Po had designed M2M the sorels would have a primary key consisting of Sales Order Number, Sales Order Line Item, and Sales Order Release. You can learn more about Primary Keys and Referential Integrity problems in M2M here.

Having is nearly identical to the Where clause. Having filters groups in the same way that Where filters records. Since we want to know if we have duplicates, we’ll group by the columns which should be unique in each record.

SELECT   count(*)
FROM     dbo.sorels SOR
GROUP BY SOR.fsono, SOR.finumber, SOR.frelease
HAVING   count(*) > 1

Zero records are returned, meaning you have no duplicates. After telling Master Po, he wants another variation of the report. He wants to know the total dollar amount per sales order, but only wants to see those with a total more than $10,000. Yes, I know he would want it in Yen but my data set is limited to USD so just play along.

So, how would we do this?

Select SOR.fsono
,sum(SOR.forderqty * SOR.funetprice) TotalPerOrder
From dbo.sorels SOR
Group by SOR.fsono
Having sum(SOR.forderqty * SOR.funetprice) > 10000
Order by sum(SOR.forderqty * SOR.funetprice) desc

To impress Master Po further, we ordered them from highest to lowest dollar amount.

Field Name Alias Usage

Notice that the aliased field name cannot be used in the Group By or Where clause , but they are permissable in the Order By clause.

 Group by PartNumber -- not legal
Order by PartNumber --Legal

Why? The reason field aliases cannot be used in these clauses statements is due to the order in which SQL Server evaluates statements. Basically, SQL Server evaluates the statement in the following steps.

  1. The From clause including the Joins (more on Joins later).
  2. The Where clause filters out any unnecessary records.
  3. The Group By is then applied.
  4. Having filters out unneeded groups.
  5. The Select is evaluated including the aliased field names assuming they’ve been assigned.
  6. The Order By then sorts the records.

Since the Select portion of the SQL statement is evaluated after the Where, Group By, and Having clauses, none of them can use aliased field names. SQL Server really has no idea that sor.fpartno is PartNumber at that point. The table alias can be used throughout because it is assigned in the first step. Make Sense?

The Demanding Master Po

Anyway, back to our story. You’ve accomplished what Master Po asked, verified that your table has no duplicates, and you are sure he will be pleased. However, when you bring it to him, he smiles and says, “Very good Grasshopper… but which customers placed these orders? Oh, and be sure to practice your pebble snatching as well.

Well Grasshoppers, I’ll show you how to get the customer information in the next lesson. As far as snatching pebbles, you’re on your own.

Does anyone else have pebble snatching tips to share?

Ooh! Ooh! Pick Me!

horshack

Pick Me Mr. Kotter!

SQL Rally is a brand new event for this year. It’s kind of a hybrid between the PASS Summit and a SQL Saturday. It’s fairly low cost and being held May 11-13 in Orlando, Florida.

Anyway, the reason for this post is completely self serving. I really want to present at this event. Please register to vote and vote for my sessions. Mine are in the 10th section in the SSRS category. The abstracts I submitted are:

Title: Choose Your Own (Reporting) Adventure – SSRS Survivor Series
Abstract: Create a story of guided analytics using Reporting Services drill down, drill through, and expressions. Sure, you may be using drill down already, but I’ll show you how to make the most of Reporting Services interactions, and more importantly how to avoid becoming dragon chow at the end.

Title: The Power of Parameters – SSRS Survivor Series
Abstract: Want to know the best techniques to help your user narrow their search in the most efficient ways possible? Come watch me demonstrate which parameter techniques are best for certain situations and field types. Real world problems and applications are stressed, and I’ll demonstrate why some commonly accepted methods simply do not work and should be avoided.

–My name is Made2Mentor, and I approved this message.

T-SQL Basics - Filtering Records with the Where Clause

All of the queries we’ve looked at so far will return all of the records in a table. This isn’t very efficient. The Where clause filters records out of your result set that you do not want.

For example, suppose we’re only interested in Sales Order Master records for “Acme Rockets Inc.” The following select statement will work:

SELECT SOM.fcustno CustNo
 ,SOM.fcompany CustName
 ,SOM.fsono SalesNo
 ,SOM.fsorev SORev
 ,SOM.fcustpono CustPONo
 ,SOM.forderdate OrderDate
FROM dbo.somast SOM
  where SOM.fcompany = 'Acme Rockets Inc.'

Notice that I have used single quotes around the company name, they are required for character values, but not numerics. Please note that many fields that M2M refers to as numbers are actually not numeric fields. Most record number fields such as Sales Order Number, Purchase Order Number, and Invoice Number are 6 digit character fields even though they may contain only numeric values. Therefore, this is how to write the statement if you only want to return information for Sales Order “Number” 000001.

SELECT SOM.fcustno CustNo
 ,SOM.fcompany CustName
 ,SOM.fsono SalesNo
 ,SOM.fsorev SORev
 ,SOM.fcustpono CustPONo
 ,SOM.forderdate OrderDate
FROM dbo.somast SOM
  where SOM.fsono = '000001'

And/Or Operators

What if we took the above example and wanted to retrieve 2 sales order records? An Or statement will return as true if either expression is true. Therefore, an Or statement like this will retrieve both records.

SELECT SOM.fcustno CustNo
 ,SOM.fcompany CustName
 ,SOM.fsono SalesNo
 ,SOM.fsorev SORev
 ,SOM.fcustpono CustPONo
 ,SOM.forderdate OrderDate
FROM dbo.somast SOM
  where SOM.fsono = '000001'
  or SOM.fsono = '000002'

Let’s say you wanted to check for records for Acme Rockets Inc. with a particular Sales Order Number. The And statement will only evaluate to true if both conditions are true.

SELECT SOM.fcustno CustNo
 ,SOM.fcompany CustName
 ,SOM.fsono SalesNo
 ,SOM.fsorev SORev
 ,SOM.fcustpono CustPONo
 ,SOM.forderdate OrderDate
FROM dbo.somast SOM
  where SOM.fsono = '000001'
  and SOM.fcompany =  'Acme Rockets Inc.'

I recorded a quick video to explain the gotchas of using And with Or in a Where clause.

Parenthesis Use

As I mentioned in the video, parenthesis can be used to modify expression evaluation in a Where clause. Suppose you wanted to run the previous query for two separate sales order numbers. In English we are looking for any Acme orders with an order number of ‘000001’ or ‘000002’ that are for Acme Rockets Inc.

SELECT SOM.fcustno CustNo
 ,SOM.fcompany CustName
 ,SOM.fsono SalesNo
 ,SOM.fsorev SORev
 ,SOM.fcustpono CustPONo
 ,SOM.forderdate OrderDate
FROM dbo.somast SOM
  where (SOM.fsono = '000001' or SOM.fsono = '000002') -- evaluated as a group
  and SOM.fcompany =  'Acme Rockets Inc.'

Greater Than, Less Than, Not Equal To

Where clauses are not limited to equal conditions. Not Equal To (<> , !=), Greater Than or Equal To (>=) Less Than or Equal To (<=) and simple Greater Than or Less Than are all supported. So, the previous query could also be written:

SELECT SOM.fcustno CustNo
 ,SOM.fcompany CustName
 ,SOM.fsono SalesNo
 ,SOM.fsorev SORev
 ,SOM.fcustpono CustPONo
 ,SOM.forderdate OrderDate
FROM
  dbo.somast SOM
where SOM.fsono <= '000002'
and SOM.fcompany =  'Acme Rockets Inc.'

It’s important to note that since somast.fsono is a character field the less than or equal to is treated a like an alphabetical lookup rather than a numeric one. Typically this isn’t a problem because the numbers consistently include preceding zeros, but it is something to keep in mind.

The Between Statement

Suppose we wanted to return all of the Sales Order Numbers between ‘000001’ and ‘000100’. Well, we’d use the Between statement of course.

Where fsono between '000001' and '000100'

Keep in mind that the statement is inclusive meaning that the first and last value are also returned. If you didn’t want to include ‘000100’ you would need to edit the last value or use a less than instead. The most common use of between is with date comparison and there are gotchas with that. I’ll explain date manipulation and comparison in a later blog article.

The IN Operator

Let’s say that you want retrieve Sales Orders from several of Acme’s subsidiaries. In this example, you want to extract records for “Acme Rockets Inc.” and “Acme Rocket Skates” as well. You could use the following Where clause:

Where SOM.fcompany = 'Acme Rockets Inc.' or SOM.fcompany = 'Acme Rocket Skates'

However, suppose Acme has 20 subsidiary companies all with some version of Acme in the name. Using Or would be unwieldy. Therefore, you can use the In statement like this:

WHERE
  SOM.fcompany IN
    ('Acme Rockets Inc.'
    ,'Acme Rocket Skates'
    ,'The Acme Company'
    ,'Acme Inc'
    ,'A Company Making Everything'
    ,'Acme'
    ,'Acme Rackets Inc.'
    ,'Acme Rickets Inc.')
    -- etc.

While this is better, listing 20 companies this way would still be a pain. So, for this case, we’re going to use the Like statement.

Using Like and Pattern Matching

The most commonly used wild card character is %. % matches any string of characters or no characters at all. For the sake of argument, let’s say that all of the variations of Acme actually started with the word Acme. They could be returned like this:

WHERE
  SOM.fcompany LIKE ('Acme%')

Returns:
CustName
Acme Rockets Inc.
Acme Rocket Skates
Acme Inc
Acme
Acme Rackets Inc.

Notice that we lost “The Acme Company”. If you wanted to include them the following statement will work.

WHERE
  SOM.fcompany LIKE ('%Acme%')

Returns:
CustName
Acme Rockets Inc.
Acme Rocket Skates
The Acme Company
Acme Inc
Acme
Acme Rackets Inc.

Use a preceding % sparingly because the statement becomes non-Sargable. Basically that query will be unable to use an index placed on the field and may cause performance problems.

The astute reader will notice that we missed the Acme company named “A Company Making Everything”. How can we make sure they are returned as well? More than one wildcard character can be used.

WHERE
  SOM.fcompany LIKE 'a%c%m%e%'

Returns:
CustName
Acme Rockets Inc.
Acme Rocket Skates
Acme Inc
A Company Making Everything
Acme
Acme Rackets Inc.
Acme Rickets Inc.

Important to note. You will notice that my last wildcard statement is in a different case than the records in the table. Depending on the collation setting of your database, your results may vary as some are case sensitive.

An Underscore (_) replaces a single character.

WHERE
  SOM.fcompany LIKE 'Acme R_ckets Inc.'

Returns:
CustName
Acme Rockets Inc.
Acme Rackets Inc.
Acme Rickets Inc.

Let’s raise the complexity a notch. Brackets [] can also be used to match a single character in a range.

WHERE
  SOM.fcompany LIKE 'Acme R[a-i]ckets Inc.'

Returns:
CustName
Acme Rackets Inc.
Acme Rickets Inc.

The range can also represent distinct letters.

WHERE
  SOM.fcompany LIKE 'Acme R[a,o]ckets Inc.'

CustName
Acme Rockets Inc.
Acme Rackets Inc.

Brackets can also exclude characters in a range. The symbol ^ (typically over the 6 key) represents Not.

WHERE
  SOM.fcompany LIKE 'Acme R[^a-i]ckets Inc.'

Only Acme Rockets Inc is returned. Brackets and underscores are pretty powerful, but aren’t used nearly as frequently as the general wildcard %.

Using Not

I’ve already eluded to the use of Not with wildcards. However, it’s easy to use in non-wildcard clauses as well. Not negates any expression that it precedes. Using the sample database the following query:

SELECT
 SOM.fsono SalesNo
FROM
  dbo.somast SOM
WHERE
  SOM.fcustno = '000104' AND
  SOM.fcompany NOT LIKE ('Acme Rockets Inc.')

Returns:
SalesNo
000003
000020
000052

That about covers the basics of filtering records using a Where clause. In future articles I’ll discuss Joins, Grouping, and the use of Sub Queries.

Any questions? Is there anything in particular you’d like me to cover?

How to Code with Toad

I mentioned in my last article that I would be posting a video showing you the basics of working with Toad for SQL Server.

What do you think of my first effort?

My Way of Coding T-SQL Consistently

As I mentioned in the last article, you need to develop or adopt a standardized way of writing T-SQL code. In this article, I’m going to share the habits I’ve developed, “my way” if you will.

Vizzini: Finish him. Finish him, your way.
Fezzik: Oh good, my way. Thank you Vizzini… what’s my way?
Vizzini: Pick up one of those rocks, get behind a boulder, in a few minutes the man in black will come running around the bend, the minute his head is in view, hit it with the rock.
Fezzik: My way’s not very sportsman-like.
The Princess Bride

Just like Fezzik, you may be asking, “Well, which way is my way?” The best way for me to help you with that is to share the method(s) I’ve developed and reasons for doing so. Keep in mind that many people will disagree with my way, and that’s alright. This article however, is all about me, and since these methods work for me, I think they’ll work for you too. Adoption of a 3rd party query tool helps to keep your code consistent.

Capitalization Usage

I write all of the T-SQL functions in ALL CAPS and typically CamelCase field and database names. More on this later.

Field Order

I list the fields in an order which makes sense to me. Consider the script I shared with you in the last article:

SELECT  fcompany, fcustno, fcustpono, forderdate, fsono, fsorev FROM somast

While the computer doesn’t care, I list my fields in roughly most general to most specific order. Therefore, I would order those in the following manner:

SELECT  fcustno, fcompany, fsono, fsorev, fcustpono, forderdate  FROM somast

This order just “feels” right to me. Customers “are larger” or less specific than Sales Orders and the fields are listed in this order. This will become a lot more important when we discuss joining tables.

I use a lot of vertical space in my queries. The query above becomes this:

SELECT fcustno,
       fcompany,
       fsono,
       fsorev,
       fcustpono,
       forderdate
  FROM somast

Once again, the computer couldn’t care less, it just makes everything easier for me to read.

Leading vs. Trailing Commas

I prefer to use leading commas. The query now looks like this with leading commas:

SELECT fcustno
 ,fcompany
 ,fsono
 ,fsorev
 ,fcustpono
 ,forderdate
FROM
  somast

It may take you awhile to get used to seeing the query like that, but there’s a good reason for doing so. If you are editing a long query with trailing commas, and you need to remove the last entry in the list, you will invariably forget to remove the final comma and receive an error.

SELECT fcustno,
       fcompany,
       fsono,
       fsorev,
       fcustpono,
  --     forderdate  **** Using Double dashes comments out this line. This code now ends with a comma and will error out. 
  FROM somast

Leading comma queries are easier to edit in my opinion.

Aliasing

Butch: Would you hand me a dry towel, miss beautiful tulip?
Fabienne: Oh, I like that. I like being called a tulip. Tulip is much better than mongoloid.
Pulp Fiction

Aliasing is used in two places in a SQL Statement, the Select clause and the From clause. The From clause alias renames the database object and I typically use 3 or 4 character alias names. I also include the database owner or schema as well. Alias names can be assigned following the AS keyword, but it is optional. For example, the leading comma query becomes this:

SELECT SOM.fcustno
 ,SOM.fcompany
 ,SOM.fsono
 ,SOM.fsorev
 ,SOM.fcustpono
 ,SOM.forderdate
FROM
  dbo.somast AS SOM  -- Or dbo.somast SOM

I know that seems like extra work, but its absolutely necessary when you write multiple table queries. Some people alias their tables with meaningless single letter values like A, B, and C. I can’t stand this because when I read a SQL statement aliased in this way, I have to constantly look at the from clause to remind myself what each alias represents.

Once again, consistency is key. I always use the following alias names:

somast = SOM
soitem = SOI
sorels = SOR
slcdpm = SLC
etc.

Since I am consistent, I always know that a field aliased with ARI represents Aritem; and I can cut and paste between queries as well. Incidentally, FastForms adds _EXT to the name of it’s extension tables.. Therefore, I add a lowercase e to any extension table alias. Soitem_EXT becomes SOIe. Keep it simple.

The second alias type applies to the field name. What the heck does fsono mean to a user? Therefore, I alias my queries like this:

SELECT SOM.fcustno CustNo
 ,SOM.fcompany CustName
 ,SOM.fsono SONo
 ,SOM.fsorev SORev
 ,SOM.fcustpono CustPONo
 ,SOM.forderdate OrderDate
FROM
  dbo.somast SOM

Much Easier to Read


Also, because I’ve aliased my table names in CamelCase, many query tools such as SSRS will automatically format them as two words. “OrderDate” becomes “Order Date”. You can include spaces in your alias names, but you must put brackets around them. [Order Date] will render properly in almost any reporting tool, but I dislike all the extra typing.

Tomorrow, I’ll post my first video which will be a short demonstration of how I follow these “rules” with Toad for SQL Server.

T-SQL Basics

“First learn stand, then learn fly. Nature rule, Daniel-san, not mine.” – Karate Kid

What is SQL?

SQL (Structured Query Language) is a language for manipulating database data and structure. It can be pronounced by stating it’s letters “S-Q-L” as well as the word “sequel”, as in the Matrix had two crappy sequels.

SQL is mostly standardized across Relational Database Management Systems RDMBS, but each, like SQL Server, may have it’s own unique syntax. T-SQL is Microsoft’s version of the language and it’s the version I focus on.

Two Main SQL Statement Types

There are other types of SQL Statements but we’ll restrict our discussions to these two.

  • Data Definition Language (DDL) – These define database structure and schema. Eg. Create Database.
  • Data Manipulation Language (DML) – These statements manipulate data. Eg. Select * from

Most of what I’ll cover with you are DML since most Made2Manage Admins will not be creating a lot of databases, manipulating indexes, etc.

Basic Syntax Rules of Using SQL

“Do you understand the words that are coming out of my mouth? – Rush Hour

These rules are true for the most part and in most circumstances. I’m generalizing to make things as simple to learn as possible. The syntax requirements of T-SQL are incredibly flexible. In this article, I’ll explain what you can do, and in a future article, I’ll explain what I think you should do.

I’m explaining these rules in the context of a very basic select statement. This following statement requests all of the values in the column (or field) fsono from the somast table.

SELECT fsono FROM somast
  1. T-SQL statements are case insensitive. For example the following scripts will function the same way:
  2. SELECT fsono FROM somast

    SeleCT fSoNO froM SOMast

  3. The “*” is a wildcard and will select all fields from a table. “Never” use this as it wastes resources and creates maintenance headaches. If you delve into M2M’s own code however, you will see many instances where they do that very thing.
  4. When selecting fields, separate them with commas. The order of fields selected doesn’t matter to the computer, but as I said earlier I’ll explain my preferences in a future post.
  5. White Space, and placement of some punctuation, doesn’t matter. The following statements are equivalent to the computer.
  6. SELECT  fcompany, fcustno, fcustpono, forderdate, fsono, fsorev FROM somast
     
    -- Extra Spaces
    SELECT  fcompany       , fcustno,       fcustpono, forderdate, fsono, fsorev FROM somast
     
    -- Extra Carriage Returns
    SELECT fcompany,
           fcustno,
           fcustpono,
           forderdate,
           fsono,
           fsorev
      FROM somast
     
    --Leading Commas
    SELECT
      fcompany
     ,fcustno
     ,fcustpono
     ,forderdate
     ,fsono
     ,fsorev
    FROM
      somast

Consistency

If there’s one thing I’ve learned over the years its that consistency is key. Since most M2M Admins are the lone DBA and reporting person, you must create your own standard ways of writing SQL code and stick to them. Consistency has several benefits.

  1. Your code will be easily read by you and others.
  2. You will be able to cut and paste code snippets between SQL Scripts (more on this later).
  3. Your efficiency goes up dramatically when you work this way.

So Daniel-san, as I said before this is your first lesson. Tomorrow, I’ll show you around our dojo that will use for future lessons in this T-SQL Basics Series.

Page 10 of 33« First...89101112...2030...Last »