Skip to main content

SSIS Basics _ Different Transformations in Data Flow Task


Control Flow Tasks Contains

Data Flow Tasks Contains (ETL)

Tasks

Sources

Containers

Destinations

 

Transformations

 

Data Flow Tasks:

Data Flow – Sources and Destinations:

1.     ADO.NET Source/Dest

2.     EXCEL Source/Dest

3.     FLAT FILE or CSV Source/Dest

4.     OLEDB Source/Dest: You can read the data from any RDBMS like SQL, Oracle etc.

5.     RAW File Source/Dest

6.     XML(Extensive Markup Language) Source

7.     CDC (change data capture) Source – SS 2012 : Used in Incremental Loading of data

Destinations:

1.     ADO.NET Destination: (any Microsoft app)

2.     Data Mining Model Training:

3.     Data Reader Destination: If you wanted to read data from any .NET platform.

4.     Dimension Processing : Processing the data of dimensions (Cube)

5.     Excel Destination

6.     Flat File Destination

7.     OLEDB Destination

8.     ODBC Destination – SS 2012

9.     Partition Processing : For Cube processing purpose

10.  Raw File Destination

11.  Record Set Destination:

12.  SQL Server Compact Destination: To dump/load the data from small databases like Android phone

13.  SQL Server Destination

Flat file Vs Raw File:

Flat File

Raw File

Has Structure

No Structure

Original Source Data

Used for staging(temporary)

Less Performance

High Performance

Requires Connection (Driver Required)

No Connection Required (No Driver Required)

Extension is .txt

Extension is .raw

 

Raw files:

·         These are faster

·         They do not parse.

·         They do not require drivers.

·         They can be used as staging files in SSIS

Types of Flat Files:

1.    Delimited Flat Files:

·         The column data in a flat file is separated using Delimiter like Tab Space, Space, ’,’ or ‘.’ Etc.

·         TSV(Tab space Separated Value),CSV(Comma Separated Value)

Ex:  EID, NAME, SALARY, GENDER, DEPT

            1, RAM, 500, M, 10

            2, KIRAN, 2000, M, 10

2.    Fixed Width Flat File:

·         It has width to define columns and rows.

Ex: eidnamesalgender

     001Peter     01000M

Here, column width is defined, for eid it is 3, name 10, sal 5, gender 1

Column size is fixed.

Row size is fixed.

3.    Row Ragged Flat File:

·         The last line of the flat file will end with new line.

 

4.    Ragged Right Flat File:

·         Space is added after end of the row.

Excel Files:

·         Excel Source/Dest

o   Emp.xls -> Work Book Name

o   Sheet$  -> $ Indicates the sheet

OLEDB Vs. SQL Server Destinations:

OLEDB Destination

SQL Server Destination

It works for local and remote servers.

It works with local servers.

It is slower.

It is faster

Any RDBMS tools

Only SQL server.

 

Transformations:

There are 3 types of Categories:

1.   Blocked Transformations (Asynchronous):

·         These transformations cannot send a row on to the next transformations until all of the input rows have been read.

·         These will degrade the performance of the Data Flow.

 

1)  Aggregate:

·         Can use Avg, Sum, Min, Max, and Count on source data.

·         Supports GROUP BY clause

Ex: Here we don’t need source data as it is, rather we need aggregated data in dest.

SOURCE

EID

NAME

CITY

1

Ram

CHN

2

John

BLR

3

Venu

CHN

4

Salman

HYD

DEST

CITY

NO OF EMP

CHN

2

HYD

1

BLR

1


2)  Balanced Data Distributor Transformation:

·         Distribute data to many destinations

·         Faster inserts to destination

·         It is for Load Balance

·         Round Robin Technique to store data.












3)  Cache Transformation:

·         This will cache the data from source to speed up the process.

·         The Lookup transformation performs lookup on the data.

*** RDBMS is disk based technology, it stores data on hard drive and not on RAM (Read only/Temporary means when you shut down the comp the data will no longer be available.)

SSAS:

Mulit Dim Model --  Disk

Tabular Model -- RAM -- Faster

4)  Sort Transformation:

o   This sorts input data in ASC/DESC

o   Eliminate duplicate rows.

o   Blocked transformation



Tips for Sort:

·         Avoid sorting operation during data flow for large data.

·         If necessary, use Sort transformation for smaller size data.

·         When the data size is large, perform SORT operation prior to data flow OR after the data flow.

 

5)  Fuzzy Grouping Transformation: *****

·         This performs data cleaning tasks by identifying rows of data that are likely to be duplicates.

·         It is a blocked Transformation.

·         Ex:

o   To identify duplicate Ration Cards, PAN Cards, Adhar Cards

o   To identify duplicate Rows

o   In below example, 70-80% data is similar and hence can consider it as duplicate.



Very Imp: It is used to group values.



Ex: Below it is grouping the duplicate rows and getting the smallest number in temp table.

 


Q. How do you eliminate duplicates?

 

6)  Fuzzy Lookup Transformation:

·         Lookup search for Exact Match.

·         Fuzzy Lookup search for Close Match. (not exact or not 100% match)

·         This performs data cleaning tasks such as standardizing data, correcting data and providing missing values.

·         It uses fuzzy matching to return one or more close matches from the reference table.

·         It supports Unicode and Non Unicode Strings.

·         It is a Blocked Transformation.

·         Ex: Below see misspelled data of city Hyderabad, so in such cases Fuzzy lookup takes close match of the data/city.



NOTE:

Use Fuzzy Lookup, to remove mis-spelled data.

Use Fuzzy Grouping, to remove duplicate data.

Q. What is the difference between Lookup and Joins?

Ans:

7)  Merge:

·         This combines TWO sorted INPUTS into a single OUTPUT

·         It requires sorted data for its inputs.

·         Sort the data using key column.

·         Sources can be homogeneous or heterogeneous.



            Limitations:

-      You cannot have more than 2 sources.

-      No. of columns should match.

 


 

 

2.   Partially Blocked Transformations:

1.    Data Mining

 

2.    Merge Join Transformation:

·         This is similar to JOINS in RDBMS.

·         This is used to joining two sorted datasets using a FULL, LEFT, or INNER join.

·         Two inputs and one output.

·         Not support an error output.

·         Sorted data for its inputs.

Limitations:

o   Only 2 inputs

o   Requires SORTED inputs

 

RDBMS Joins

SSIS Joins

1.     Inner Join

2.     Outer Join

a.     Left

b.    Right

c.     Full

3.     Cross Join

4.     Self Join

1.     Inner Join

2.     Outer Join

a.     Left

b.    Full

 

Max 255 tables can be joined.

Max 2 tables can be joined.

Sorting is not needed (Index)

Sort is required.

 

 

           

Interview Questions:

·         How do you eliminate/separate Duplicates in source data?

·         How do you find duplicates in the given data? Fuzzy Grouping

·         How do you perform delta loading or incremental loading?

 

3.    Pivot Transformation:

·          This converts ROWS into COLUMNS.

·         This makes a normalized data into a De-normalized data.

 


 

Un Pivot Transformation:

·         This converts COLUMNS into ROWS.

·         This makes a de-normalized data into a normalized data.

 

4.    Term Extraction Transformation:

·         This extracts terms from text and write to output.

·         For Example, Text that contains e-mail messages may provide useful feedback about products, to analyze the product’s feedback.

 

5.    Term Lookup Transformation:

·         It matches terms extracted from text in an input column with terms in a reference table.

·         It then counts the number of times a term in the lookup table occurs in the input data set and writes to output.

·         This transformation is useful for creating a custom word list based on the input text, complete with word frequency statistics.

 

6.    Union All Transformation:

·         This combines multiple inputs into one output.

·         For Example, the data from five different Flat file sources can be combined as one output.

·         It does not require sorted data for its input.

 

Merge vs Union All:

 

Merge

Union All

Takes only 2 inputs

Takes Multiple Inputs

Requires Sorted Data

Sorting not required

Produces Sorted Output

Does not produce sorted output

 

Q. Do you know merge?

Ans: Yes,

In SQL: Merge statement- to perform incremental loading

In SSIS: Merge Transformation- to combine the data

 

 

3.   Non Blocked Transformations (Synchronous):

 

1)   Audit Transformation:

·         This gets the data about environment in which the package runs.

·         Eg. Name of the package, computer, and operator names.

·         SSIS includes system variables to get above info.

·         Ex:

  -\


 

2)   Character Map Transformation:

·         Uppercase or Lowercase

·         Only String data

·         Add new column/ replace existing column.

 

3)   Conditional Split Transformation:

·         Splits the input data on a filter condition.

·         Unmatched rows go to the default output.



 

4)   Copy Column Transformation:

·         It is used to create a copy of given input column.

·         Perform new operations on new column.

·         Multi Cast creates multiple copies of whole data whereas Copy Column creates copy of a given column.

      Ex:

             


5)    Data Conversion Transformations:

·         This converts data from one type to another.

·         Can apply multiple conversions to a single input column.

Characters (Data)

Non Unicode

Unicode

English

All Language + English

1 char- 1 byte

1 char- 2 bytes

SQL server Data types: char, varchar, text

SQL server Data types: nchar, nvarchar, ntext (n stands for national)

SSIS Data types: DT_STR

SSIS Data types: DT_WSTR

Eg: Notepad, text files

Eg: EXCEL, XML

 

6)   Derived Column Transformation:

·         The purpose is to have complex expressions.

·         This creates new column by using existing columns.

·         It can add a new column or replace an existing column

·         Ex: Take only 1st letter from Gender column

o   Female - > F

o   Male -> M

·         Ex:

o   If Gender==’M’ then Name must be Mr.Name

o   If Gender==’F’ then Name must be Ms.Name

·         Trim string data

7)   Export Column Transformation:

·         This reads data in a data flow and inserts the data into a file.

·         For example a picture of each product, could be Exported to different files.

 

8)   Import Column Transformation:

·         This reads data from files and adds the data to column in a data flow.

·         Using this, a package can add text and images stored in separate files to a data flow.

·         Note: These are used with Databases to file systems and vice versa.

            


9)    Multicast Transformation:

·         It distributes its input to one or more outputs.

·         Does not have properties to set

·         1 input many outputs.

        


 

10)        Lookup Transformation:

·         Perform the lookup (search) by comparing the input with reference table data.

·         The reference table can be a cache file/SQL Table or Query.

·         By default, this treats rows without matching entries as Errors

·         Must be configured for Error Output

·         Lookup data is case sensitive.

·         Lookup is similar to Inner Join but not exactly.

·         Lookup gets exact Matches.

Difference between InnerJoin and Lookup:



Lookup Cache Options:

A)   Full Cache:

·         Default mode in Lookup

·         In this mode, entire reference table is loaded on to cache during pre-execution phase.

·         This approach used the most memory, and adds additional startup time for our data flow.

·         Lookup operations will be very fast during execution.

·         Lookup will not swap memory, so SSIS package may fail, if we run out of memory in the system.

When to use Full Cache?

·         When we access a large portion of the reference table.

o   Need to have huge memory.

o   Takes extra Pre-Execution time to load data from reference table.

·         When database of reference table is remote or under heavy load.

Execution Phases of SSIS in Full Cache Mode: (Package.dtsx)

1.     Pre-Execution Phase:

-      Validations like if table exists or not etc

-      Load all the data from reference table to Local Cache/RAM/Buffer.

-      Disadvantage:

o   More time to load the data.

o   You must have more RAM space.

-      Advantage:

o   Faster

2.     Progress:

-      Transform/ Loading

3.     Post Execution:

-      Clean up like temp table, memory etc

 

B)   Partial Cache:

·         In this mode, the lookup cache starts with empty at the beginning.

·         When a new row comes in, the lookup transform checks its cache for the matching values.

·         If no match is found, it queries the reference table. If the match is found at the reference table, the values are cached so they can be used the next time a matching row comes in.

·         Advantage:

o   Less Memory

o   Pre-Execution time is less

·         Disadvantage:

o   Lookup operation is slow, may hit the database more often.

Execution Phases of SSIS in Partial Cache Mode: (Package.dtsx)

1.     Pre-Execution Phase:

-      Validations like if table exists or not etc

2.     Progress:

-      Transform/ Loading

-      RAM is Empty

-      Step1: Read a row from source.

-      Step2: Checks in Cache/RAM

-      Step3: If not exists in the RAM it checks in Reference table.

-      Step4: If exists in Reference table, the value loaded in RAM.

3.     Post Execution:

-      Clean up like temp table, memory etc

 

2      No Cache:

·         In this mode, Lookup Transform doesn’t maintain a lookup cache.

·         Hitting the reference table for every row.

·         This is used when there is no enough cache/RAM.

 

11)        Percentage Sampling Transformation:

·         Used to sample data (percentage of randomly selected rows) to audit the Quality of the data.

12)        Row count :

·         To count number of rows returned by the previous transformations/task.

 

13)        Row Sampling Transformation:

·         Used to sample data (number of randomly selected rows) to audit the Quality of the Data,

 

14)        Script Component Transformation:

·         The transformation is used to integrate scripting code (C#.NET, or VB.NET)

·         Script written using VSA (Visual studio for applications)

·         When to use Script:

·         To perform complex operations.

Script Task vs Script Component

Script Task

Script Component

It’s there in control Flow  as a task

It’s there in Data Flow as a Source, Transformation, Destination

To perform complex operations

To perform complex operations

 

 

 

 




 

Synchronous vs. Asynchronous:

Synchronous (Non Blocked)  

Asynchronous (Blocked)

These process a row by row.

These process a block or set of rows

No Buffer hence it has to interact with external source.

Has Buffer

These do not keep data.

These accumulate data.

These are faster.

These are slower.

Eg. Data Conversion, conditional split Transformations etc

Eg. Sort, Merge, Aggregate Transformations etc

 

OLEDB Command Transformation:******

·         This is used to execute SQL commands in DataFlow.

·         This is similar to Execute SQL task in Control Flow.

·         It performs, row by row operations hence it is slow.

Execute SQL task vs OLEDB Command:*****

Execute SQL Task

OLEDB Command

It is part of Control flow

It is part of Data Flow

Performs batch operations

Performs row by row operations

Faster

Slow

 

***Errors in SSIS:

·         In SSIS, errors are 2 types.

a.    Common Errors : Any kind of errors

b.    Truncate Errors: These errors occurred because of the data type size problems in the data flow.

·         Note: SORT does not support error (red pipe) output.

Error Output Configurations:

The following are the options to configure errors in SSIS.

A.    Failed Component (default setting in SSIS): Once error occurred data flow execution is stopped.

B.    Ignore Failure: Data flow execution is continued with next steps, by ignoring the error.

C.    Redirect Rows (Using Red Pipe): It redirects the mismatched rows to some other destination and data flow execution is continued.

Unit testing (Things to check):

 

n  No of rows in Source.

n  No of rows in Destination

n  Date Columns

Rolling Back:

Transactions (ALL/NONE - do all or nothing):

Go to Control Flow and R/Click -> Properties-> TransactionOption

-      Supported – By default supported

-      Not Supported

-      Required

*** SSIS does not have Transactions by default.

Q. Is it recommended transactions to be implemented in SSIS? Why?

=> No, because if we have to load millions of data and if there is failure occurred in 1 row, all the data will be reverted back.

Transactions can be implemented but not recommended and is not must.

 

Questions for Errors:

ð  How do you handle error in SSIS?

ð  If there is an error in the SSIS, how do you handle?

ð  There is an error data in the source, how you handle the error data in SSIS?

 

Interview Questions:

1.     OLEDB destinations means send data to any RDBMS then why we have SQL server Destination? So in which case we use OLEDB destination and SQL Server Destination

Ans: OLEDB Destination works for local and remote servers, and SQL server Destination works with local servers.



2.     How do you trim the data using SSIS package?

Ans:

3.    How do you load Flatfile data to destination database? Which task you use? Bulk insert task or dataflow task?

Ans:  Bulk insert task is faster which cannot clean the data. If you need clean data as well then Dataflow task ( flatfile source and oledb as dest) can be used.

4.    Why you use raw file?

Ans: Raw files are faster because they don’t parse anything. Whenever you insert data into raw file it will be faster. Does not need driver to be installed. Reading and writing are faster.

5

Popular posts from this blog

How to remove special characters in the Data(using Script component Task and Regex)

Start with the Below screen shot ( Before Script Task) and you will come to know about the senario. Note : The Script Component also works for the special characters within the Data as well . for ex "A$hish" Hover over the below Regex. it will show > Show potential fixes > in that Select System.Text.RegularExpressions. The Below will disappear.  Use the Below Script.  public override void Input0_ProcessInputRow(Input0Buffer Row) { Row.CustomerID = RemoveSpecialCharacters(Row.CustomerID); Row.NameStyle = RemoveSpecialCharacters(Row.NameStyle); Row.Title = RemoveSpecialCharacters(Row.Title); Row.FirstName = RemoveSpecialCharacters(Row.FirstName); Row.LastName = RemoveSpecialCharacters(Row.LastName); Row.CompanyName = RemoveSpecialCharacters(Row.CompanyName); Row.SalesPerson = RemoveSpecialCharacters(Row.SalesPerson); Row.EmailAddress = RemoveSpecialCharacters(Row.EmailAddress); Row.Phone = RemoveSpecialCharacters(Row.Phone); } public static string ...

DBMS

Database : database can be defined as a collection of information stored at a particular place written in a predetermined manner, belongs to a particular topic.  DBMS: (DATA BASE MANAGEMENT SYSTEM):  it is defined as software which is present inside the database which is used o maintain or manage the data within the database .                                 In the above diagram, if the user needs any info from the database. He has to contact with  dbms , then the dbms goes to the database, takes the information from the  database and gives it back to the user. Database models :  these models are broadly divided into the following types: 1.         FMS/FDMS 2.         HMS/HDMS 3.         NDBMS 4.         RDBMS 5.      ...

Different Schemas in DWH

  SCHEMAS:   Star Schema: -       Star Schema means Dimension Tables are directly linked to the Fact Table. Snow Flake Schema: -       Dimension Tables are indirectly linked to the Fact Table OR -       Dimension table is linking to another dimension table .       Star Schema   Snow Flake Schema 1 Has redundant data and hence tough to maintain/change 1 No redundancy and hence more easy to maintain/change 2 Less complex queries and hence easy to understand 2 More complex queries and hence less easy to understand 3 Less foreign keys so faster execution time of cube. 3 More foreign keys so take more execution of cube 4 Has De-Normalized Tables 4 Has Normalized Tabl...