Looking for:
Microsoft sql server 2014 business intelligence freeDownload SQL Server Data Tools (SSDT) - SQL Server Data Tools (SSDT) | Microsoft Docs.
Because of this fact, working on data quality is one of the components of the BI systems. As an example, Auckland might be written as "Auck land" in some Excel files or be typed as "Aukland" by the user in the input form.
As a solution to improve the quality of data, Microsoft provided users with DQS. DQS works based on Knowledge Base domains, which means a Knowledge Base can be created for different domains, and the Knowledge Base will be maintained and improved by a data steward as time passes.
There are also matching policies that can be used to apply standardization on the data. A data warehouse is a database built for analysis and reporting. In other words, a data warehouse is a database in which the only data entry point is through ETL, and its primary purpose is to cover reporting and data analysis requirements.
This definition clarifies that a data warehouse is not like other transactional databases that operational systems write data into. When there is no operational system that works directly with a data warehouse, and when the main purpose of this database is for reporting, then the design of the data warehouse will be different from that of transactional databases.
If you recall from the database normalization concepts, the main purpose of normalization is to reduce the redundancy and dependency. The following table shows customers' data with their geographical information:. Let's elaborate on this example.
As you can see from the preceding list, the geographical information in the records is redundant. This redundancy makes it difficult to apply changes. For example, in the structure, if Remuera , for any reason, is no longer part of the Auckland city, then the change should be applied on every record that has Remuera as part of its suburb. The following screenshot shows the tables of geographical information:. So, a normalized approach is to retrieve the geographical information from the customer table and put it into another table.
Then, only a key to that table would be pointed from the customer table. In this way, every time the value Remuera changes, only one record in the geographical region changes and the key number remains unchanged.
So, you can see that normalization is highly efficient in transactional systems. This normalization approach is not that effective on analytical databases. If you consider a sales database with many tables related to each other and normalized at least up to the third normalized form 3NF , then analytical queries on such databases may require more than 10 join conditions, which slows down the query response.
In other words, from the point of view of reporting, it would be better to denormalize data and flatten it in order to make it easier to query data as much as possible. This means the first design in the preceding table might be better for reporting. However, the query and reporting requirements are not that simple, and the business domains in the database are not as small as two or three tables. So real-world problems can be solved with a special design method for the data warehouse called dimensional modeling.
There are two well-known methods for designing the data warehouse: the Kimball and Inmon methodologies. The Inmon and Kimball methods are named after the owners of these methodologies. Both of these methods are in use nowadays. The main difference between these methods is that Inmon is top-down and Kimball is bottom-up. In this chapter, we will explain the Kimball method. Both of these books are must-read books for BI and DW professionals and are reference books that are recommended to be on the bookshelf of all BI teams.
This chapter is referenced from The Data Warehouse Toolkit , so for a detailed discussion, read the referenced book. To gain an understanding of data warehouse design and dimensional modeling, it's better to learn about the components and terminologies of a DW.
A DW consists of Fact tables and dimensions. The relationship between a Fact table and dimensions are based on the foreign key and primary key the primary key of the dimension table is addressed in the fact table as the foreign key. Facts are numeric and additive values in the business process. For example, in the sales business, a fact can be a sales amount, discount amount, or quantity of items sold.
All of these measures or facts are numeric values and they are additive. Additive means that you can add values of some records together and it provides a meaning. For example, adding the sales amount for all records is the grand total of sales. Dimension tables are tables that contain descriptive information. Descriptive information, for example, can be a customer's name, job title, company, and even geographical information of where the customer lives.
Each dimension table contains a list of columns, and the columns of the dimension table are called attributes. Each attribute contains some descriptive information, and attributes that are related to each other will be placed in a dimension. For example, the customer dimension would contain the attributes listed earlier. Each dimension has a primary key, which is called the surrogate key.
The surrogate key is usually an auto increment integer value. The primary key of the source system will be stored in the dimension table as the business key. The Fact table is a table that contains a list of related facts and measures with foreign keys pointing to surrogate keys of the dimension tables.
Fact tables usually store a large number of records, and most of the data warehouse space is filled by them around 80 percent. Grain is one of the most important terminologies used to design a data warehouse.
Grain defines a level of detail that stores the Fact table. For example, you could build a data warehouse for sales in which Grain is the most detailed level of transactions in the retail shop, that is, one record per each transaction in the specific date and time for the customer and sales person.
Understanding Grain is important because it defines which dimensions are required. There are two different schemas for creating a relationship between fact and dimensions: the snow flake and star schema. In the start schema, a Fact table will be at the center as a hub, and dimensions will be connected to the fact through a single-level relationship. There won't be ideally a dimension that relates to the fact through another dimension. The following diagram shows the different schemas:. The snow flake schema, as you can see in the preceding diagram, contains relationships of some dimensions through intermediate dimensions to the Fact table.
If you look more carefully at the snow flake schema, you may find it more similar to the normalized form, and the truth is that a fully snow flaked design of the fact and dimensions will be in the 3NF.
The snow flake schema requires more joins to respond to an analytical query, so it would respond slower. Hence, the star schema is the preferred design for the data warehouse. It is obvious that you cannot build a complete star schema and sometimes you will be required to do a level of snow flaking. However, the best practice is to always avoid snow flaking as much as possible. After a quick definition of the most common terminologies in dimensional modeling, it's now time to start designing a small data warehouse.
One of the best ways of learning a concept and method is to see how it will be applied to a sample question. Assume that you want to build a data warehouse for the sales part of a business that contains a chain of supermarkets; each supermarket sells a list of products to customers, and the transactional data is stored in an operational system.
Our mission is to build a data warehouse that is able to analyze the sales information. Before thinking about the design of the data warehouse, the very first question is what is the goal of designing a data warehouse? What kind of analytical reports would be required as the result of the BI system? The answer to these questions is the first and also the most important step.
This step not only clarifies the scope of the work but also provides you with the clue about the Grain. Defining the goal can also be called requirement analysis. Your job as a data warehouse designer is to analyze required reports, KPIs, and dashboards. After requirement analysis, the dimensional modeling phases will start.
Based on Kimball's best practices, dimensional modeling can be done in the following four steps:. In our example, there is only one business process, that is, sales.
Grain, as we've described earlier, is the level of detail that will be stored in the Fact table. Based on the requirement, Grain is to have one record per sales transaction and date, per customer, per product, and per store. Once Grain is defined, it is easy to identify dimensions. Based on the Grain, the dimensions would be date, store, customer, and product. It is useful to name dimensions with a Dim prefix to identify them easily in the list of tables.
The next step is to identify the Fact table, which would be a single Fact table named FactSales. This table will store the defined Grain. After identifying the Fact and dimension tables, it's time to go more in detail about each table and think about the attributes of the dimensions, and measures of the Fact table.
Next, we will get into the details of the Fact table and then into each dimension. There is only one Grain for this business process, and this means that one Fact table would be required.
To connect to each dimension, there would be a foreign key in the Fact table that points to the primary key of the dimension table. The table would also contain measures or facts.
For the sales business process, facts that can be measured numeric and additive are SalesAmount, DiscountAmount, and QuantitySold. The Fact table would only contain relationships to other dimensions and measures. The following diagram shows some columns of the FactSales :.
As you can see, the preceding diagram shows a star schema. We will go through the dimensions in the next step to explore them more in detail. Fact tables usually don't have too many columns because the number of measures and related tables won't be that much. However, Fact tables will contain many records.
The Fact table in our example will store one record per transaction. As the Fact table will contain millions of records, you should think about the design of this table carefully. The String data types are not recommended in the Fact table because they won't add any numeric or additive value to the table.
The relationship between a Fact table and dimensions could also be based on the surrogate key of the dimension. The best practice is to set a data type of surrogate keys as the integer; this will be cost-effective in terms of the required disk space in the Fact table because the integer data type takes only 4 bytes while the string data type is much more.
Using an integer as a surrogate key also speeds up the join between a fact and a dimension because join and criteria will be based on the integer that operators works with, which is much faster than a string.
If you are thinking about adding comments in this made by a sales person to the sales transaction as another column of the Fact table, first think about the analysis that you want to do based on comments. No one does analysis based on a free text field; if you wish to do an analysis on a free text, you can categorize the text values through the ETL process and build another dimension for that.
Then, add the foreign key-primary key relationship between that dimension to the Fact table. The customer's information, such as the customer name, customer job, customer city, and so on, will be stored in this dimension.
You may think that the customer city is, as another dimension, a Geo dimension. But the important note is that our goal in dimensional modeling is not normalization. So resist against your tendency to normalize tables. For a data warehouse, it would be much better if we store more customer-related attributes in the customer dimension itself rather than designing a snow flake schema.
The following diagram shows sample columns of the DimCustomer table:. The DimCustomer dimension may contain many more attributes. The number of attributes in your dimensions is usually high. Actually, a dimension table with a high number of attributes is the power of your data warehouse because attributes will be your filter criteria in the analysis, and the user can slice and dice data by attributes. So, it is good to think about all possible attributes for that dimension and add them in this step.
As we've discussed earlier, you see attributes such as Suburb , City , State , and Country inside the customer dimension. This is not a normalized design, and this design definitely is not a good design for a transactional database because it adds redundancy, and making changes won't be consistent. However, for the data warehouse design, not only is redundancy unimportant but it also speeds up analytical queries and prevents snow flaking.
The CustomerKey is the surrogate key and primary key for the dimension in the data warehouse. The CustomerKey is an integer field, which is autoincremented. It is important that the surrogate key won't be encoded or taken as a string key; if there is something coded somewhere, then it should be decoded and stored into the relevant attributes.
The surrogate key should be different from the primary key of the table in the source system. There are multiple reasons for that; for example, sometimes, operational systems recycle their primary keys, which means they reuse a key value for a customer that is no longer in use to a new customer.
CustomerAlternateKey is the primary key of the source system. It is important to keep the primary key of the source system stored in the dimension because it would be necessary to identify changes from the source table and apply them into the dimension. The primary key of the source system will be called the business key or alternate key.
The date dimension is one of the dimensions that you will find in most of the business processes. There may be rare situations where you work with a Fact table that doesn't store date-related information. This is obvious as you can fetch all other columns out of the full date column with some date functions, but that will add extra time for processing.
So, at the time of designing dimensions, don't think about spaces and add as many attributes as required. The following diagram shows sample columns of the date dimension:. It would be useful to store holidays, weekdays, and weekends in the date dimension because in sales figures, a holiday or weekend will definitely affect the sales transactions and amounts.
So, the user will require an understanding of why the sale is higher on a specific date rather than on other days. You may also add another attribute for promotions in this example, which states whether that specific date is a promotion date or not. The date dimension will have a record for each date.
The table, shown in the following screenshot, shows sample records of the date dimension:. As you can see in the records illustrated in the preceding screenshot, the surrogate of the date dimension DateKey shows a meaningful value.
This is one of the rare exceptions where we can keep the surrogate key of this dimension as an integer type but with the format of YYYYMMDD to represent a meaning as well. In this example, if we store time information, where do you think would be the place for time attributes? Inside the date dimension? Definitely not.
The date dimension will store one record per day, so a date dimension will have records per year and records for 10 years. However, 5 million records for a single dimension are too much; dimensions are usually narrow and they occasionally might have more than one million records.
So in this case, the best practice would be to add another dimension as DimTime and add all time-related attributes in that dimension. The following screenshot shows some example records and attributes of DimTime :.
Usually, the date and time dimensions are generic and static, so you won't be required to populate these dimensions through ETL every night; you just load them once and then you could use them. I've written two general-purpose scripts to create and populate date and time dimensions on my blog that you can use.
The product dimension will have a ProductKey , which is the surrogate key, and the business key, which will be the primary key of the product in the source system something similar to a product's unique number.
The product dimension will also have information about the product categories. Again, denormalization in dimensions occurred in this case for the product subcategory, and the category will be placed into the product dimension with redundant values.
However, this decision was made in order to avoid snow flaking and raise the performance of the join between the fact and dimensions.
We are not going to go in detail through the attributes of the store dimension. The most important part of this dimension is that it can have a relationship to the date dimension. For example, a store's opening date will be a key related to the date dimension. This type of snow flaking is unavoidable because you cannot copy all the date dimension's attributes in every other dimension that relates to it.
On the other hand, the date dimension is in use with many other dimensions and facts. So, it would be better to have a conformed date dimension.
Outrigger is a Kimball terminology for dimensions, such as date, which is conformed and might be used for a many-to-one relationship between dimensions for just one layer. In the previous example, you learned about transactional fact. Transactional fact is a fact table that has one record per transaction. This type of fact table usually has the most detailed Grain. There is also another type of fact, which is the snapshot Fact table. In snapshot fact, each record will be an aggregation of some transactional records for a snapshot period of time.
For example, consider financial periods; you can create a snapshot Fact table with one record for each financial period, and the details of the transactions will be aggregated into that record. Transactional facts are a good source for detailed and atomic reports.
They are also good for aggregations and dashboards. The Snapshot Fact tables provide a very fast response for dashboards and aggregated queries, but they don't cover detailed transactional records. Based on your requirement analysis, you can create both kinds of facts or only one of them. There is also another type of Fact table called the accumulating Fact table. This Fact table is useful for storing processes and activities, such as order management. You can read more about different types of Fact tables in The Data Warehouse Toolkit , Ralph Kimball , Wiley which was referenced earlier in this chapter.
We've explained that Fact tables usually contain FKs of dimensions and some measures. However, there are times when you would require a Fact table without any measure. These types of Fact tables are usually used to show the non-existence of a fact.
For example, assume that the sales business process does promotions as well, and you have a promotion dimension. So, each entry in the Fact table shows that a customer X purchased a product Y at a date Z from a store S when the promotion P was on such as the new year's sales. This Fact table covers every requirement that queries the information about the sales that happened, or in other words, for transactions that happened.
However, there are times when the promotion is on but no transaction happens! This is a valuable analytical report for the decision maker because they would understand the situation and investigate to find out what was wrong with that promotion that doesn't cause sales. So, this is an example of a requirement that the existing Fact table with the sales amount and other measures doesn't fulfill. This Fact table doesn't have any fact or measure related to it; it just has FKs for dimensions.
However, it is very informative because it tells us on which dates there was a promotion at specific stores on specific products. We call this Fact table as a Factless Fact table or Bridge table. Using examples, we've explored the usual dimensions such as customer and date. When a dimension participates in more than one business process and deals with different data marts such as date , then it will be called a conformed dimension.
Sometimes, a dimension is required to be used in the Fact table more than once. For example, in the FactSales table, you may want to store the order date, shipping date, and transaction date. All these three columns will point to the date dimension. Publisher's Note: Products purchased from Third Party sellers are not guaranteed by the publisher for quality, …. Skip to main content. Start your free trial. Publisher resources Download Example Code. Table of contents Product information.
Free access for Packt account holders Instant updates on new Packt books Preface What this book covers What you need for this book Who this book is for Conventions Time for action — heading What just happened? Reader feedback Customer support Downloading the example code Downloading color versions of the images for this book Errata Piracy Questions 1. Time for action — creating the first cube What just happened? Time for action — viewing the cube in the browser What just happened?
Dimensions and measures Time for action — using the Dimension Designer What just happened? Time for action — change the order of the Month attribute What just happened? Time for action — modifying the measure properties What just happened? Time for action — using a Named Query What just happened?
Using dimensions Time for action — adding a Fact relationship What just happened? Hierarchies Time for action — creating a hierarchy What just happened? Time for action — calculated members What just happened? Time for action — processing the data What just happened? Summary 3. Time for action — creating measures What just happened?
Creating hierarchies Time for action — creating a hierarchy from a single table What just happened? Time for action — creating a hierarchy from multiple tables What just happened?
Data Analysis eXpression, calculated columns, and measures Time for action — using time intelligence functions in DAX What just happened? Securing the data Time for action — security in tabular What just happened? Storage modes Time for action — creating a model with the DirectQuery storage mode What just happened? The Data Flow tab Time for action — loading customer information from a flat file into a database table with a Data Flow Task What just happened?
Containers and dynamic packages Time for action — looping through CSV files in a directory and loading them into a database table What just happened? Summary 5. Creating models and entities Time for action — creating a model and an entity What just happened? Time for action — creating an entity with data from the Excel Add-in What just happened?
Time for action — change tracking What just happened? The entity relationship Time for action — creating a domain-based relationship What just happened? Business rules Time for action — creating a simple business rule What just happened? Working with hierarchies Time for action — creating a derived hierarchy What just happened? Security and permission Time for action — permission walkthrough What just happened?
Integration management Time for action — a subscription view What just happened?
Comments
Post a Comment