Skip to main content

Introduction to Datawarehouse

In shot what is DWH (Data warehouse)?

A DWH is a Large Database which consists of Dimension and Fact tables.

Or Collection of DataMart is called as a DWH.

Father of Data Warehouse:

Dr. William H. Inmon

Dr. Ralph Kimball

-Introduced Dimension Modelling to design DWH in 1990.

What is Data Warehouse?

As per Dr. William Inmon,

Data Warehouse is a

-      Subject Oriented (Loans, Mutual Funds etc.)

-      Integrated (Get data from various sources)

-      Time Variant (Different time period data)

 

·         *Datamart is Subject Oriented

·         *DWH is a collection of Datamart

·         *DWH is a superset of Datamart

·         *Datamart is also another database.

Top-down Approach: (DWH->Datamarts)

 

  • Proposed by Dr. Bill Inmon (used by 20% companies)
  • DWH is created first and then the Data Marts are derived from it.
  • Data Marts have consistent data and subject Oriented Data
  • It is robust against business changes. The cost of the project is very high.

 

Bottom-up Approach: (Datamarts->DWH)

 

  • Proposed by Dr.Ralph Kimball (used by 80% companies)
  • Data Marts are created first and then the integrated and a comprehensive DWH is created.
  • As the data marts are created first, business solutions can be answered quickly.

Dimension Modeling:

 

  • Developed by Dr. Kimball
  • It’s a DWH design concept
  • It is a process to convert OLTP data into OLAP data in the form of Dimension tables and Fact tables and maintain relation b/w them.
  • Dimension Modeling is heart of data warehouse.

 

What is the dimension modeling?

  • Divide the data into dimension and Facts
  • Have relation between them.

 

Fact or Measure:

 

  • Measures are the numeric values that we want to aggregate, slice, dice and analyze.
  • Measure has following properties:

è How they should be formatted

è How they should aggregate up

è How they interact with specific dimensions and so on

 

ð  Measure always answers questions for ‘How much’ or ‘How many’ etc

ð  Data in the fact table can be filtered and grouped (“Sliced and diced”)

 

Dimension:

 

  • Dimension is a string, which always answer the questions like ‘When’,’Where’,’What’,’Who’,’Which’ etc
  • Examples for Dimensions:

-      Customer, Product

-      Time, Geography etc

  • Dimension Table PK is FK of Fact table.
  • Dimension contains multiple Hierarchies
  • Hierarchy supports drilling up and down.
  • Hierarchy speeds the aggregation in a cube while processing.
  • Even though Hierarchies are optional; these are recommended.

 

Master tables -> Dimension table (20% data)

Transactional tables-> Fact tables (80% data)

 

Master/Dimension tables:

  • Primary key
  • String data
  • Less Data

 

Transactional/Fact tables:

  • Foreign Keys
  • Numeric data
  • Large data

 

To create best DWH:

  1. Study OLTP database
  2. Study Client data Analytics/Reports/Data science/Dashboards/ MIS/Visualizations requirement.

 

OLTP to OLAP Table conversion:

 

ð  OLTP contains

o   Master Tables

§  Product

§  Course

§  Student etc

 

o   Transactional Tables

§  Payment Details

§  Order Details etc

ð  In OLAP

o   Master Table converted as Dimensional Tables

o   Transactional Tables converted as Fact Tables

 

Why DWH must contain less no. of tables?

=> More tables’ means more joins, more joins means reading will be slow.

 

DWH – Best Practice to create a DWH:

=========================

 

  1. What does Fact table contain?

Ans: FK and Measures only.

 

  1. DWH must contain less no of tables.

                        Dim -- More  --- reduce no of Dim tables

                                Fact --  Less

  1. Reduce no of columns in Fact tables (Fact table size need to be reduced)

 

DWH Design Best Practices:

==================

  • Reduce no of Tables in DWH.
  • Study fully OLTP DB
  • Study Reporting Requirements/Data Analytics
  • Proper Relationships b/w tables (Dimension and Fact tables)
  • Reduce no. of columns in Fact Table
  • Reduce size of keys (PK and FKs)
  • Have Junk Dimensions
  • Have Degenerate Dimensions


Popular posts from this blog

How to remove special characters in the Data(using Script component Task and Regex)

Start with the Below screen shot ( Before Script Task) and you will come to know about the senario. Note : The Script Component also works for the special characters within the Data as well . for ex "A$hish" Hover over the below Regex. it will show > Show potential fixes > in that Select System.Text.RegularExpressions. The Below will disappear.  Use the Below Script.  public override void Input0_ProcessInputRow(Input0Buffer Row) { Row.CustomerID = RemoveSpecialCharacters(Row.CustomerID); Row.NameStyle = RemoveSpecialCharacters(Row.NameStyle); Row.Title = RemoveSpecialCharacters(Row.Title); Row.FirstName = RemoveSpecialCharacters(Row.FirstName); Row.LastName = RemoveSpecialCharacters(Row.LastName); Row.CompanyName = RemoveSpecialCharacters(Row.CompanyName); Row.SalesPerson = RemoveSpecialCharacters(Row.SalesPerson); Row.EmailAddress = RemoveSpecialCharacters(Row.EmailAddress); Row.Phone = RemoveSpecialCharacters(Row.Phone); } public static string ...

DBMS

Database : database can be defined as a collection of information stored at a particular place written in a predetermined manner, belongs to a particular topic.  DBMS: (DATA BASE MANAGEMENT SYSTEM):  it is defined as software which is present inside the database which is used o maintain or manage the data within the database .                                 In the above diagram, if the user needs any info from the database. He has to contact with  dbms , then the dbms goes to the database, takes the information from the  database and gives it back to the user. Database models :  these models are broadly divided into the following types: 1.         FMS/FDMS 2.         HMS/HDMS 3.         NDBMS 4.         RDBMS 5.      ...

Different Schemas in DWH

  SCHEMAS:   Star Schema: -       Star Schema means Dimension Tables are directly linked to the Fact Table. Snow Flake Schema: -       Dimension Tables are indirectly linked to the Fact Table OR -       Dimension table is linking to another dimension table .       Star Schema   Snow Flake Schema 1 Has redundant data and hence tough to maintain/change 1 No redundancy and hence more easy to maintain/change 2 Less complex queries and hence easy to understand 2 More complex queries and hence less easy to understand 3 Less foreign keys so faster execution time of cube. 3 More foreign keys so take more execution of cube 4 Has De-Normalized Tables 4 Has Normalized Tabl...