SSIS Basics _ Different Transformations in Data Flow Task

Control Flow Tasks Contains	Data Flow Tasks Contains (ETL)
Tasks	Sources
Containers	Destinations
	Transformations

Data Flow Tasks:

Data Flow – Sources and Destinations:

1. ADO.NET Source/Dest

2. EXCEL Source/Dest

3. FLAT FILE or CSV Source/Dest

4. OLEDB Source/Dest: You can read the data from any RDBMS like SQL, Oracle etc.

5. RAW File Source/Dest

6. XML(Extensive Markup Language) Source

7. CDC (change data capture) Source – SS 2012 : Used in Incremental Loading of data

Destinations:

1. ADO.NET Destination: (any Microsoft app)

2. Data Mining Model Training:

3. Data Reader Destination: If you wanted to read data from any .NET platform.

4. Dimension Processing : Processing the data of dimensions (Cube)

5. Excel Destination

6. Flat File Destination

7. OLEDB Destination

8. ODBC Destination – SS 2012

9. Partition Processing : For Cube processing purpose

10. Raw File Destination

11. Record Set Destination:

12. SQL Server Compact Destination: To dump/load the data from small databases like Android phone

13. SQL Server Destination

Flat file Vs Raw File:

Flat File	Raw File
Has Structure	No Structure
Original Source Data	Used for staging(temporary)
Less Performance	High Performance
Requires Connection (Driver Required)	No Connection Required (No Driver Required)
Extension is .txt	Extension is .raw

Raw files:

· These are faster

· They do not parse.

· They do not require drivers.

· They can be used as staging files in SSIS

Types of Flat Files:

1. Delimited Flat Files:

· The column data in a flat file is separated using Delimiter like Tab Space, Space, ’,’ or ‘.’ Etc.

· TSV(Tab space Separated Value),CSV(Comma Separated Value)

Ex: EID, NAME, SALARY, GENDER, DEPT

1, RAM, 500, M, 10

2, KIRAN, 2000, M, 10

2. Fixed Width Flat File:

· It has width to define columns and rows.

Ex: eidnamesalgender

001Peter 01000M

Here, column width is defined, for eid it is 3, name 10, sal 5, gender 1

Column size is fixed.

Row size is fixed.

3. Row Ragged Flat File:

· The last line of the flat file will end with new line.

4. Ragged Right Flat File:

· Space is added after end of the row.

Excel Files:

· Excel Source/Dest

o Emp.xls -> Work Book Name

o Sheet$ -> $ Indicates the sheet

OLEDB Vs. SQL Server Destinations:

OLEDB Destination	SQL Server Destination
It works for local and remote servers.	It works with local servers.
It is slower.	It is faster
Any RDBMS tools	Only SQL server.

Transformations:

There are 3 types of Categories:

1. Blocked Transformations (Asynchronous):

· These transformations cannot send a row on to the next transformations until all of the input rows have been read.

· These will degrade the performance of the Data Flow.

1) Aggregate:

· Can use Avg, Sum, Min, Max, and Count on source data.

· Supports GROUP BY clause

Ex: Here we don’t need source data as it is, rather we need aggregated data in dest.

SOURCE
EID	NAME	CITY
1	Ram	CHN
2	John	BLR
3	Venu	CHN
4	Salman	HYD

DEST
CITY	NO OF EMP
CHN	2
HYD	1
BLR	1

2) Balanced Data Distributor Transformation:

· Distribute data to many destinations

· Faster inserts to destination

· It is for Load Balance

· Round Robin Technique to store data.

3) Cache Transformation:

· This will cache the data from source to speed up the process.

· The Lookup transformation performs lookup on the data.

*** RDBMS is disk based technology, it stores data on hard drive and not on RAM (Read only/Temporary means when you shut down the comp the data will no longer be available.)

SSAS:

Mulit Dim Model -- Disk

Tabular Model -- RAM -- Faster

4) Sort Transformation:

o This sorts input data in ASC/DESC

o Eliminate duplicate rows.

o Blocked transformation

Tips for Sort:

· Avoid sorting operation during data flow for large data.

· If necessary, use Sort transformation for smaller size data.

· When the data size is large, perform SORT operation prior to data flow OR after the data flow.

5) Fuzzy Grouping Transformation: *****

· This performs data cleaning tasks by identifying rows of data that are likely to be duplicates.

· It is a blocked Transformation.

· Ex:

o To identify duplicate Ration Cards, PAN Cards, Adhar Cards

o To identify duplicate Rows

o In below example, 70-80% data is similar and hence can consider it as duplicate.

Very Imp: It is used to group values.

Ex: Below it is grouping the duplicate rows and getting the smallest number in temp table.

Q. How do you eliminate duplicates?

6) Fuzzy Lookup Transformation:

· Lookup search for Exact Match.

· Fuzzy Lookup search for Close Match. (not exact or not 100% match)

· This performs data cleaning tasks such as standardizing data, correcting data and providing missing values.

· It uses fuzzy matching to return one or more close matches from the reference table.

· It supports Unicode and Non Unicode Strings.

· It is a Blocked Transformation.

· Ex: Below see misspelled data of city Hyderabad, so in such cases Fuzzy lookup takes close match of the data/city.

NOTE:

Use Fuzzy Lookup, to remove mis-spelled data.

Use Fuzzy Grouping, to remove duplicate data.

Q. What is the difference between Lookup and Joins?

Ans:

7) Merge:

· This combines TWO sorted INPUTS into a single OUTPUT

· It requires sorted data for its inputs.

· Sort the data using key column.

· Sources can be homogeneous or heterogeneous.

Limitations:

- You cannot have more than 2 sources.

- No. of columns should match.

2. Partially Blocked Transformations:

1. Data Mining

2. Merge Join Transformation:

· This is similar to JOINS in RDBMS.

· This is used to joining two sorted datasets using a FULL, LEFT, or INNER join.

· Two inputs and one output.

· Not support an error output.

· Sorted data for its inputs.

Limitations:

o Only 2 inputs

o Requires SORTED inputs

RDBMS Joins	SSIS Joins
1. Inner Join 2. Outer Join a. Left b. Right c. Full 3. Cross Join 4. Self Join	1. Inner Join 2. Outer Join a. Left b. Full
Max 255 tables can be joined.	Max 2 tables can be joined.
Sorting is not needed (Index)	Sort is required.

Interview Questions:

· How do you eliminate/separate Duplicates in source data?

· How do you find duplicates in the given data? Fuzzy Grouping

· How do you perform delta loading or incremental loading?

3. Pivot Transformation:

· This converts ROWS into COLUMNS.

· This makes a normalized data into a De-normalized data.

Un Pivot Transformation:

· This converts COLUMNS into ROWS.

· This makes a de-normalized data into a normalized data.

4. Term Extraction Transformation:

· This extracts terms from text and write to output.

· For Example, Text that contains e-mail messages may provide useful feedback about products, to analyze the product’s feedback.

5. Term Lookup Transformation:

· It matches terms extracted from text in an input column with terms in a reference table.

· It then counts the number of times a term in the lookup table occurs in the input data set and writes to output.

· This transformation is useful for creating a custom word list based on the input text, complete with word frequency statistics.

6. Union All Transformation:

· This combines multiple inputs into one output.

· For Example, the data from five different Flat file sources can be combined as one output.

· It does not require sorted data for its input.

Merge vs Union All:

Merge	Union All
Takes only 2 inputs	Takes Multiple Inputs
Requires Sorted Data	Sorting not required
Produces Sorted Output	Does not produce sorted output

Q. Do you know merge?

Ans: Yes,

In SQL: Merge statement- to perform incremental loading

In SSIS: Merge Transformation- to combine the data

3. Non Blocked Transformations (Synchronous):

1) Audit Transformation:

· This gets the data about environment in which the package runs.

· Eg. Name of the package, computer, and operator names.

· SSIS includes system variables to get above info.

· Ex:

2) Character Map Transformation:

· Uppercase or Lowercase

· Only String data

· Add new column/ replace existing column.

3) Conditional Split Transformation:

· Splits the input data on a filter condition.

· Unmatched rows go to the default output.

4) Copy Column Transformation:

· It is used to create a copy of given input column.

· Perform new operations on new column.

· Multi Cast creates multiple copies of whole data whereas Copy Column creates copy of a given column.

Ex:

5) Data Conversion Transformations:

· This converts data from one type to another.

· Can apply multiple conversions to a single input column.

Characters (Data)
Non Unicode	Unicode
English	All Language + English
1 char- 1 byte	1 char- 2 bytes
SQL server Data types: char, varchar, text	SQL server Data types: nchar, nvarchar, ntext (n stands for national)
SSIS Data types: DT_STR	SSIS Data types: DT_WSTR
Eg: Notepad, text files	Eg: EXCEL, XML

6) Derived Column Transformation:

· The purpose is to have complex expressions.

· This creates new column by using existing columns.

· It can add a new column or replace an existing column

· Ex: Take only 1^st letter from Gender column

o Female - > F

o Male -> M

· Ex:

o If Gender==’M’ then Name must be Mr.Name

o If Gender==’F’ then Name must be Ms.Name

· Trim string data

7) Export Column Transformation:

· This reads data in a data flow and inserts the data into a file.

· For example a picture of each product, could be Exported to different files.

8) Import Column Transformation:

· This reads data from files and adds the data to column in a data flow.

· Using this, a package can add text and images stored in separate files to a data flow.

· Note: These are used with Databases to file systems and vice versa.

9) Multicast Transformation:

· It distributes its input to one or more outputs.

· Does not have properties to set

· 1 input many outputs.

10) Lookup Transformation:

· Perform the lookup (search) by comparing the input with reference table data.

· The reference table can be a cache file/SQL Table or Query.

· By default, this treats rows without matching entries as Errors

· Must be configured for Error Output

· Lookup data is case sensitive.

· Lookup is similar to Inner Join but not exactly.

· Lookup gets exact Matches.

Difference between InnerJoin and Lookup:

Lookup Cache Options:

A) Full Cache:

· Default mode in Lookup

· In this mode, entire reference table is loaded on to cache during pre-execution phase.

· This approach used the most memory, and adds additional startup time for our data flow.

· Lookup operations will be very fast during execution.

· Lookup will not swap memory, so SSIS package may fail, if we run out of memory in the system.

When to use Full Cache?

· When we access a large portion of the reference table.

o Need to have huge memory.

o Takes extra Pre-Execution time to load data from reference table.

· When database of reference table is remote or under heavy load.

Execution Phases of SSIS in Full Cache Mode: (Package.dtsx)

1. Pre-Execution Phase:

- Validations like if table exists or not etc

- Load all the data from reference table to Local Cache/RAM/Buffer.

- Disadvantage:

o More time to load the data.

o You must have more RAM space.

- Advantage:

o Faster

2. Progress:

- Transform/ Loading

3. Post Execution:

- Clean up like temp table, memory etc

B) Partial Cache:

· In this mode, the lookup cache starts with empty at the beginning.

· When a new row comes in, the lookup transform checks its cache for the matching values.

· If no match is found, it queries the reference table. If the match is found at the reference table, the values are cached so they can be used the next time a matching row comes in.

· Advantage:

o Less Memory

o Pre-Execution time is less

· Disadvantage:

o Lookup operation is slow, may hit the database more often.

Execution Phases of SSIS in Partial Cache Mode: (Package.dtsx)

1. Pre-Execution Phase:

- Validations like if table exists or not etc

2. Progress:

- Transform/ Loading

- RAM is Empty

- Step1: Read a row from source.

- Step2: Checks in Cache/RAM

- Step3: If not exists in the RAM it checks in Reference table.

- Step4: If exists in Reference table, the value loaded in RAM.

3. Post Execution:

- Clean up like temp table, memory etc

2 No Cache:

· In this mode, Lookup Transform doesn’t maintain a lookup cache.

· Hitting the reference table for every row.

· This is used when there is no enough cache/RAM.

11) Percentage Sampling Transformation:

· Used to sample data (percentage of randomly selected rows) to audit the Quality of the data.

12) Row count :

· To count number of rows returned by the previous transformations/task.

13) Row Sampling Transformation:

· Used to sample data (number of randomly selected rows) to audit the Quality of the Data,

14) Script Component Transformation:

· The transformation is used to integrate scripting code (C#.NET, or VB.NET)

· Script written using VSA (Visual studio for applications)

· When to use Script:

· To perform complex operations.

Script Task vs Script Component

Script Task	Script Component
It’s there in control Flow as a task	It’s there in Data Flow as a Source, Transformation, Destination
To perform complex operations	To perform complex operations

Synchronous vs. Asynchronous:

Synchronous (Non Blocked)	Asynchronous (Blocked)
These process a row by row.	These process a block or set of rows
No Buffer hence it has to interact with external source.	Has Buffer
These do not keep data.	These accumulate data.
These are faster.	These are slower.
Eg. Data Conversion, conditional split Transformations etc	Eg. Sort, Merge, Aggregate Transformations etc

OLEDB Command Transformation:******

· This is used to execute SQL commands in DataFlow.

· This is similar to Execute SQL task in Control Flow.

· It performs, row by row operations hence it is slow.

Execute SQL task vs OLEDB Command:*****

Execute SQL Task	OLEDB Command
It is part of Control flow	It is part of Data Flow
Performs batch operations	Performs row by row operations
Faster	Slow

***Errors in SSIS:

· In SSIS, errors are 2 types.

a. Common Errors : Any kind of errors

b. Truncate Errors: These errors occurred because of the data type size problems in the data flow.

· Note: SORT does not support error (red pipe) output.

Error Output Configurations:

The following are the options to configure errors in SSIS.

A. Failed Component (default setting in SSIS): Once error occurred data flow execution is stopped.

B. Ignore Failure: Data flow execution is continued with next steps, by ignoring the error.

C. Redirect Rows (Using Red Pipe): It redirects the mismatched rows to some other destination and data flow execution is continued.

Unit testing (Things to check):

n No of rows in Source.

n No of rows in Destination

n Date Columns

Rolling Back:

Transactions (ALL/NONE - do all or nothing):

Go to Control Flow and R/Click -> Properties-> TransactionOption

- Supported – By default supported

- Not Supported

- Required

*** SSIS does not have Transactions by default.

Q. Is it recommended transactions to be implemented in SSIS? Why?

=> No, because if we have to load millions of data and if there is failure occurred in 1 row, all the data will be reverted back.

Transactions can be implemented but not recommended and is not must.

Questions for Errors:

ð How do you handle error in SSIS?

ð If there is an error in the SSIS, how do you handle?

ð There is an error data in the source, how you handle the error data in SSIS?

Interview Questions:

1. OLEDB destinations means send data to any RDBMS then why we have SQL server Destination? So in which case we use OLEDB destination and SQL Server Destination

Ans: OLEDB Destination works for local and remote servers, and SQL server Destination works with local servers.

2. How do you trim the data using SSIS package?

Ans:

3. How do you load Flatfile data to destination database? Which task you use? Bulk insert task or dataflow task?

Ans: Bulk insert task is faster which cannot clean the data. If you need clean data as well then Dataflow task ( flatfile source and oledb as dest) can be used.

4. Why you use raw file?

Ans: Raw files are faster because they don’t parse anything. Whenever you insert data into raw file it will be faster. Does not need driver to be installed. Reading and writing are faster.

Data With Gaurav

Search This Blog

SSIS Basics _ Different Transformations in Data Flow Task

Labels

Popular posts from this blog

DBMS

Introduction to Datawarehouse

Different Schemas in DWH