Control Flow Tasks Contains |
Data Flow Tasks Contains (ETL) |
Tasks |
Sources |
Containers |
Destinations |
|
Transformations |
Data Flow Tasks:
Data Flow – Sources
and Destinations:
1.
ADO.NET
Source/Dest
2.
EXCEL
Source/Dest
3.
FLAT
FILE or CSV Source/Dest
4.
OLEDB
Source/Dest: You can read the data from any RDBMS like SQL, Oracle etc.
5.
RAW
File Source/Dest
6.
XML(Extensive
Markup Language) Source
7.
CDC
(change data capture) Source – SS 2012 : Used in Incremental Loading of data
Destinations:
1.
ADO.NET
Destination: (any Microsoft app)
2.
Data
Mining Model Training:
3.
Data Reader Destination: If you wanted to read data from any .NET platform.
4.
Dimension
Processing : Processing the data of dimensions (Cube)
5.
Excel
Destination
6.
Flat
File Destination
7.
OLEDB
Destination
8.
ODBC
Destination – SS 2012
9.
Partition Processing : For Cube processing purpose
10. Raw File Destination
11. Record Set Destination:
12. SQL
Server Compact Destination:
To dump/load the data from
small databases like Android phone
13. SQL Server Destination
Flat file Vs Raw File:
Flat File |
Raw File |
Has Structure |
No Structure |
Original Source Data |
Used for staging(temporary) |
Less Performance |
High Performance |
Requires Connection (Driver
Required) |
No Connection Required (No Driver
Required) |
Extension is .txt |
Extension is .raw |
Raw files:
·
These
are faster
·
They
do not parse.
·
They
do not require drivers.
·
They
can be used as staging files in SSIS
Types of Flat Files:
1. Delimited Flat Files:
·
The
column data in a flat file is separated using Delimiter like Tab Space, Space, ’,’
or ‘.’ Etc.
·
TSV(Tab
space Separated Value),CSV(Comma Separated Value)
Ex: EID, NAME, SALARY, GENDER, DEPT
1,
RAM, 500, M, 10
2,
KIRAN, 2000, M, 10
2. Fixed Width Flat File:
·
It has width to define columns and
rows.
Ex:
eidnamesalgender
001Peter 01000M
Here,
column width is defined, for eid it is 3, name 10, sal 5, gender 1
Column
size is fixed.
Row
size is fixed.
3. Row Ragged Flat File:
·
The last line of the flat
file will end with new line.
4. Ragged Right Flat File:
·
Space is added after end of
the row.
Excel Files:
·
Excel
Source/Dest
o
Emp.xls
-> Work Book Name
o
Sheet$ -> $ Indicates the sheet
OLEDB Vs. SQL Server Destinations:
OLEDB Destination |
SQL Server Destination |
It works for local and remote
servers. |
It works with local servers. |
It is slower. |
It is faster |
Any RDBMS tools |
Only SQL server. |
Transformations:
There are 3 types of Categories:
1.
Blocked Transformations (Asynchronous):
·
These transformations cannot send a row
on to the next transformations until all of the input rows have been read.
·
These
will degrade the performance of the Data Flow.
1)
Aggregate:
·
Can
use Avg, Sum, Min, Max, and Count on source data.
·
Supports
GROUP BY clause
Ex:
Here we don’t need source data as it is, rather we need aggregated data in
dest.
SOURCE |
||
EID |
NAME |
CITY |
1 |
Ram |
CHN |
2 |
John |
BLR |
3 |
Venu |
CHN |
4 |
Salman |
HYD |
DEST |
|
CITY |
NO OF EMP |
CHN |
2 |
HYD |
1 |
BLR |
1 |
2)
Balanced Data Distributor Transformation:
·
Distribute data to many destinations
·
Faster
inserts to destination
·
It
is for Load Balance
·
Round Robin Technique to store data.
3)
Cache Transformation:
·
This
will cache the data from source to speed up the process.
·
The
Lookup transformation performs lookup on the data.
*** RDBMS is disk based
technology, it stores data on hard drive and not on RAM (Read only/Temporary
means when you shut down the comp the data will no longer be available.)
SSAS:
Mulit Dim Model -- Disk
Tabular Model -- RAM -- Faster
4)
Sort Transformation:
o
This
sorts input data in ASC/DESC
o
Eliminate
duplicate rows.
o
Blocked
transformation
Tips for Sort:
·
Avoid
sorting operation during data flow for large data.
·
If
necessary, use Sort transformation for smaller size data.
·
When
the data size is large, perform SORT operation prior to data flow OR after the
data flow.
5)
Fuzzy Grouping Transformation: *****
·
This performs data cleaning tasks by
identifying rows of data that are likely to be duplicates.
·
It
is a blocked Transformation.
·
Ex:
o
To
identify duplicate Ration Cards, PAN Cards, Adhar Cards
o
To
identify duplicate Rows
o
In
below example, 70-80% data is similar and hence can consider it as duplicate.
Very Imp: It is used to group values.
Ex:
Below it is grouping the duplicate rows and getting the smallest number in temp
table.
Q. How do you eliminate duplicates?
6)
Fuzzy Lookup Transformation:
·
Lookup
search for Exact Match.
·
Fuzzy Lookup search for Close Match. (not exact or not 100% match)
·
This
performs data cleaning tasks such
as
standardizing data, correcting data and providing missing values.
·
It
uses fuzzy matching to return one or more close matches from the reference
table.
·
It
supports Unicode and Non Unicode Strings.
·
It
is a Blocked Transformation.
·
Ex:
Below see misspelled data of city Hyderabad, so in such cases Fuzzy lookup
takes close match of the data/city.
NOTE:
Use Fuzzy Lookup, to remove mis-spelled data.
Use Fuzzy Grouping, to remove duplicate data.
Q. What is the difference
between Lookup and Joins?
Ans:
7)
Merge:
·
This combines TWO sorted
INPUTS into a single OUTPUT
·
It
requires sorted data for its inputs.
·
Sort
the data using key column.
·
Sources
can be homogeneous or heterogeneous.
Limitations:
-
You
cannot have more than 2 sources.
-
No.
of columns should match.
2.
Partially Blocked Transformations:
1. Data Mining
2. Merge Join Transformation:
·
This
is similar to JOINS in RDBMS.
·
This
is used to joining two sorted datasets using a FULL, LEFT, or INNER join.
·
Two
inputs and one output.
·
Not support an error output.
·
Sorted
data for its inputs.
Limitations:
o
Only
2 inputs
o
Requires
SORTED inputs
RDBMS Joins |
SSIS Joins |
1.
Inner
Join 2.
Outer
Join a.
Left b.
Right c.
Full 3.
Cross
Join 4.
Self
Join |
1.
Inner
Join 2.
Outer
Join a.
Left b.
Full |
Max 255 tables can be joined. |
Max 2 tables can be joined. |
Sorting is not needed (Index) |
Sort is required. |
|
|
Interview Questions:
·
How
do you eliminate/separate Duplicates in source data?
·
How
do you find duplicates in the given data? Fuzzy Grouping
·
How
do you perform delta loading or incremental loading?
3.
Pivot Transformation:
·
This converts ROWS into COLUMNS.
·
This makes a normalized data into a
De-normalized data.
Un Pivot
Transformation:
·
This
converts COLUMNS into ROWS.
·
This
makes a de-normalized data into a normalized data.
4.
Term Extraction Transformation:
·
This extracts terms from text and write
to output.
·
For
Example, Text that contains e-mail messages may provide useful feedback about
products, to analyze the product’s feedback.
5.
Term Lookup Transformation:
·
It matches terms extracted from text in
an input column with terms in a reference table.
·
It
then counts the number of times a term in the lookup table occurs in the input
data set and writes to output.
·
This
transformation is useful for creating a custom word list based on the input
text, complete with word frequency statistics.
6. Union All Transformation:
·
This combines multiple inputs into one output.
·
For
Example, the data from five different Flat file sources can be combined as one
output.
·
It
does not require sorted data for its input.
Merge vs Union All:
Merge |
Union All |
Takes
only 2 inputs |
Takes
Multiple Inputs |
Requires
Sorted Data |
Sorting
not required |
Produces
Sorted Output |
Does not
produce sorted output |
Q.
Do you know merge?
Ans:
Yes,
In
SQL: Merge statement- to perform incremental loading
In
SSIS: Merge Transformation- to combine the data
3.
Non Blocked Transformations (Synchronous):
1)
Audit Transformation:
·
This gets the data about environment in
which the package runs.
·
Eg.
Name of the package, computer, and operator names.
·
SSIS
includes system variables to get above info.
·
Ex:
2)
Character Map Transformation:
·
Uppercase or Lowercase
·
Only
String data
·
Add
new column/ replace existing column.
3)
Conditional Split Transformation:
·
Splits
the input data on a filter condition.
·
Unmatched
rows go to the default output.
4)
Copy Column Transformation:
·
It is used to create a copy of given
input column.
·
Perform
new operations on new column.
·
Multi
Cast creates multiple copies of whole data whereas Copy Column creates copy of
a given column.
Ex:
5) Data Conversion
Transformations:
·
This
converts data from one type to another.
·
Can
apply multiple conversions to a single input column.
Characters (Data) |
|
Non Unicode |
Unicode |
English |
All Language
+ English |
1 char- 1
byte |
1 char- 2
bytes |
SQL
server Data types: char, varchar, text |
SQL
server Data types: nchar, nvarchar, ntext (n stands for national) |
SSIS Data
types: DT_STR |
SSIS Data
types: DT_WSTR |
Eg:
Notepad, text files |
Eg:
EXCEL, XML |
6)
Derived Column Transformation:
·
The
purpose is to have complex expressions.
·
This creates new column by using
existing columns.
·
It can add a new column or replace an
existing column
·
Ex:
Take only 1st letter from Gender column
o
Female
- > F
o
Male
-> M
·
Ex:
o
If
Gender==’M’ then Name must be Mr.Name
o
If
Gender==’F’ then Name must be Ms.Name
·
Trim
string data
7)
Export Column Transformation:
·
This reads data in a data flow and
inserts the data into a file.
·
For
example a picture of each product, could be Exported to different files.
8)
Import Column Transformation:
·
This reads data from files and adds the
data to column in a data flow.
·
Using
this, a package can add text and images stored in separate files to a data
flow.
·
Note:
These are used with Databases to file systems and vice versa.
9) Multicast Transformation:
·
It distributes its input to one or more
outputs.
·
Does
not have properties to set
·
1
input many outputs.
10)
Lookup Transformation:
·
Perform the lookup (search) by
comparing the input with reference table data.
·
The
reference table can be a cache file/SQL Table or Query.
·
By
default, this treats rows without matching entries as Errors
·
Must
be configured for Error Output
·
Lookup
data is case sensitive.
·
Lookup
is similar to Inner Join but not exactly.
·
Lookup
gets exact Matches.
Difference
between InnerJoin and Lookup:
Lookup Cache Options:
A) Full Cache:
·
Default
mode in Lookup
·
In
this mode, entire reference table is loaded on to cache during pre-execution
phase.
·
This
approach used the most memory, and adds additional startup time for our data
flow.
·
Lookup
operations will be very fast during execution.
·
Lookup
will not swap memory, so SSIS package may fail, if we run out of memory in the
system.
When to use Full Cache?
·
When
we access a large portion of the reference table.
o
Need
to have huge memory.
o
Takes
extra Pre-Execution time to load data from reference table.
·
When
database of reference table is remote or under heavy load.
Execution Phases of
SSIS in Full Cache Mode: (Package.dtsx)
1.
Pre-Execution
Phase:
-
Validations
like if table exists or not etc
-
Load
all the data from reference table to Local Cache/RAM/Buffer.
-
Disadvantage:
o
More
time to load the data.
o
You
must have more RAM space.
-
Advantage:
o
Faster
2.
Progress:
-
Transform/
Loading
3.
Post
Execution:
-
Clean
up like temp table, memory etc
B) Partial Cache:
·
In
this mode, the lookup cache starts with empty at the beginning.
·
When
a new row comes in, the lookup transform checks its cache for the matching
values.
·
If
no match is found, it queries the reference table. If the match is found at the
reference table, the values are cached so they can be used the next time a
matching row comes in.
·
Advantage:
o
Less
Memory
o
Pre-Execution
time is less
·
Disadvantage:
o
Lookup
operation is slow, may hit the database more often.
Execution Phases of
SSIS in Partial Cache Mode: (Package.dtsx)
1.
Pre-Execution
Phase:
-
Validations
like if table exists or not etc
2.
Progress:
-
Transform/
Loading
-
RAM
is Empty
-
Step1:
Read a row from source.
-
Step2:
Checks in Cache/RAM
-
Step3:
If not exists in the RAM it checks in Reference table.
-
Step4:
If exists in Reference table, the value loaded in RAM.
3.
Post
Execution:
-
Clean
up like temp table, memory etc
2 No Cache:
·
In
this mode, Lookup Transform doesn’t maintain a lookup cache.
·
Hitting
the reference table for every row.
·
This
is used when there is no enough cache/RAM.
11)
Percentage Sampling Transformation:
·
Used
to sample data (percentage of randomly selected rows)
to audit the Quality of the data.
12)
Row count :
·
To count number of rows returned by the
previous transformations/task.
13)
Row Sampling Transformation:
·
Used
to sample data (number of randomly selected rows) to
audit the Quality of the Data,
14)
Script Component Transformation:
·
The
transformation is used to integrate scripting code (C#.NET, or VB.NET)
·
Script
written using VSA (Visual studio for applications)
·
When
to use Script:
·
To
perform complex operations.
Script Task vs Script
Component
Script Task |
Script Component |
It’s there in control Flow as a task |
It’s there in Data Flow as a
Source, Transformation, Destination |
To perform complex operations |
To perform complex operations |
|
|
Synchronous vs. Asynchronous:
Synchronous
(Non Blocked) |
Asynchronous (Blocked) |
These process a row by row. |
These process a block or set of
rows |
No Buffer hence it has to interact
with external source. |
Has Buffer |
These do not keep data. |
These accumulate data. |
These are faster. |
These are slower. |
Eg. Data Conversion, conditional
split Transformations etc |
Eg. Sort, Merge, Aggregate
Transformations etc |
OLEDB Command
Transformation:******
·
This
is used to execute SQL commands in DataFlow.
·
This
is similar to Execute SQL task in Control Flow.
·
It
performs, row by row operations hence it is slow.
Execute SQL task vs OLEDB Command:*****
Execute SQL Task |
OLEDB Command |
It is part of Control flow |
It is part of Data Flow |
Performs batch operations |
Performs row by row operations |
Faster |
Slow |
***Errors in SSIS:
·
In
SSIS, errors are 2 types.
a. Common
Errors : Any kind of
errors
b. Truncate
Errors: These errors
occurred because of the data type size problems in the data flow.
·
Note: SORT does not support error (red
pipe) output.
Error Output Configurations:
The
following are the options to configure errors in SSIS.
A. Failed Component (default setting in
SSIS): Once
error occurred data flow execution is stopped.
B. Ignore Failure: Data flow execution is continued
with next steps, by ignoring the error.
C. Redirect Rows (Using Red Pipe): It redirects the mismatched rows to
some other destination and data flow execution is continued.
Unit testing (Things to check):
n
No
of rows in Source.
n
No
of rows in Destination
n
Date
Columns
Rolling Back:
Transactions (ALL/NONE - do all or
nothing):
Go to
Control Flow and R/Click -> Properties-> TransactionOption
-
Supported
– By default supported
-
Not
Supported
-
Required
*** SSIS
does not have Transactions by default.
Q.
Is it recommended transactions to be implemented in SSIS? Why?
=> No,
because if we have to load millions of data and if there is failure occurred in
1 row, all the data will be reverted back.
Transactions can be
implemented but not recommended and is not must.
Questions for
Errors:
ð
How
do you handle error in SSIS?
ð
If
there is an error in the SSIS, how do you handle?
ð
There
is an error data in the source, how you handle the error data in SSIS?
Interview Questions:
1.
OLEDB
destinations means send data to any RDBMS then why we have SQL server
Destination? So in which case we use OLEDB destination and SQL Server
Destination
Ans: OLEDB Destination works for local and remote servers, and
SQL server Destination works with local servers.
2.
How
do you trim the data using SSIS package?
Ans:
3. How
do you load Flatfile data to destination database? Which task you use? Bulk
insert task or dataflow task?
Ans: Bulk insert task is faster which cannot clean
the data. If you need clean data as well then Dataflow task ( flatfile source
and oledb as dest) can be used.
4. Why
you use raw file?
Ans: Raw
files are faster because they don’t parse anything. Whenever you insert data
into raw file it will be faster. Does not need driver to be installed. Reading
and writing are faster.
5