SSIS
ETL- Extract, Transform and Load Data
Transformation: Convert data from one format to
another in the Data flow.
ETL tools in the Market:
- ·
SSIS
- ·
Informatica
- ·
Ab
Initio
- ·
Data
Stage (IBM)
- ·
Talend
- ·
Pentaho
(Freeware)
- ·
ODI
– Oracle Data Integrator
Prior to SSIS:
-
SQL
server 2000 has DTS
-
DTS
–Data Transformation Services
-
DTS
is a ETL tool
-
Build
using COM component
-
DTS
package extension is .dts
DTS vs SSIS:
DTS |
SSIS |
COM based tool |
.NET based tool |
Memory leakage issue |
No memory leakage |
No robust error handling |
Has robust error handling |
No parallel execution allowed |
Parallel execution is allowed |
No high level language support,
only scripting language is supported (VB Script) |
Supports high level language (C# or
VB .NET) to perform complex operations |
Core Elements of SSIS Designer:
1. Control
Flow (Controller- main)
2. Data
Flow
3. Parameters
4. Event
Handlers
5. Package
Explorer
6. Connection
Manager
Control Flow:
·
Controls
the execution of a package
·
Data
flow is part of control Flow
·
Contains
one or more tasks
·
Contains
zero or more Data flows
·
Control
flow contains Sub Elements
o
Tasks
o
Containers
o
Precedence
Constraints
Data Flow:
·
It
is a Task
·
Contains
Sources, Destinations and Transformations
·
It
performs ETL operations
·
It
is part of Control flow
·
It
is not mandatory
IQ: What is
the difference between Control flow and Data flow?
Control flow |
Data Flow |
Controls the flow of executions in SSIS
package |
Transfers
the data from source to target |
It’s Process oriented |
It’s
Information oriented |
Smallest unit is Task |
Smallest
unit is Component |
Made up of Tasks connected with precedence
constraints |
Made
up of Source(s), Transformation(s) and Destination(s) |
Tasks in Control flow need to be completed
before moving to the next task |
None
of the component will wait for other component to work |
Control
Flow contains 3 Sub elements:
1.
Tasks
2.
Containers:
a.
For
Loop container
b.
Foreach
Loop Container
c.
Sequence
Container
3.
Precedence
Constraints
Parameters:
·
Parameters
are used to store the
static values for deployment purpose.
·
Parameters
are 2 types
o
Project Level: It can be used in all packages in
the project.
o
Package Level: It can be used in package only.
·
Cannot
change data @ runtime.
Event Handler:
·
Used
for executing package
related events and debugging purpose.
Package Explorer:
·
Displays
all the elements of the package.
o Connection mgrs.
o Logs
o Event Handler
o Variables etc
Connection Manager:
·
Has
connection information for Sources,
Destinations and Log Files etc.
Upgrade Packages:
·
Upgrade
All packages Wizard (SQL server 2005 and later packages)
·
Project
Menu->Upgrade All packages
New tasks in SSIS 2016 onwards:
ð
Hadoop
File system Task
ð
Hadoop
Hive Task
ð
Hadoop
Pig Task
·
SSIS now can integrate with Hadoop
Source and Destinations
·
SQL server 2016 can be installed in
Linux.
·
R Language is integrated in SQL server
2016.
Hadoop: (Structured/Semi
Structured/Unstructured)
·
HDFS(Hadoop
Distributed File system) (Store)
·
MapReduce
(Process) – Java
·
HIVE
(Process) -- sql
·
PIG
(Process) -- Script
·
No
SQL (Process)
·
HBase….etc
BI: (Structured Data)
·
DWH
(Store)
·
SSAS
(Process)
Drivers:
1. ODBC
(Open Database Connectivity)
a.
Only
for relational databases (SQL server, Oracle etc)
2. OLEDB
(Object Linking and embedding Database Connectivity)
a.
For
both relational and non-relational databases (Oracle,SQL server, Excel, raw
files ,etc)
Control Flow Tasks:
1. Data flow Task:
·
It
is used to perform ETL Operations
·
It
has sources, destinations and transformations
2. Execute SQL Task:
·
It
is used to execute SQL
statements (DDL, DML, TCL, DCL, Call SP, Call Functions)
·
It
performs batch operations.
·
It
is faster.
3. Analysis Services Processing Task:
·
It
is used to process the cube from SSAS.
4. Bulk Insert Task:
·
Used
to import text files data into a SQL server.
Advantages:
o
Faster
than the Data Flow Task
o
Can
get range of Rows
Disadvantages:
o
Source
must be flat file or text
file or CSV file (no other sources)
o
Destination
must be a SQL server table
(no other destination)
o
Cannot
validate, clean or transform data with this task.
o
Cannot
get error rows separately.
Important Configuration Properties:
·
Connection:
o
Connection
o
DestinationTable
o
ColumnDelimiter
o
File
·
Options:
o
BatchSize=0 (you can specify the value here, ex:
100, it means batch of 100 rows will be inserted)
o
FirstRow=2 ( if you are inserting value from
txt file and the first row will be column name then we can start inserting from
2nd row)
5. Data Profiling Task:
·
It
works only with data that
is stored in SQL server.
·
It
is used to sample the data for quality purpose.
·
Can
Profile
o
Column
Null Ratios
o
Column
Value Distribution
o
Key
Column Profile
o
Column
Pattern Profile etc.
6. Execute Package Task:
·
It
is used to call other packages
·
Can
create parent and child packages
·
Packages
are executed parallel.
·
Execute
Package Task Editor:=>ReferenceType: ProjectReference
7. Execute Process Task:
·
Used
to run 3rd party executable
Ex: .exe,
.bat files
·
Can
pass parameters
·
Usages: call any applications such as
o
Zip
o
Unzip
o
Send
SMSs etc
·
Execute
Process Task Editor for Zip:
o
Executable=.
Exe path e.g. C:\Program Files\WinRAR\WinRAR.exe
o
Arguments
(i.e. Dest)= a E:\MSBI_VIMP\Archive\SSASFiles
§ a
-> Archive means Zip
§ e
-> Extract means Unzip
o
Working
directory (i.e. source)= E:\MSBI_VIMP\Temp
·
Execute
Process Task Editor for UnZip:
o
Executable=.
Exe path e.g. C:\Program Files\WinRAR\WinRAR.exe
o
Arguments
(it become source here)= e E:\MSBI_VIMP\Archive\SSASFiles
§
a
-> Archive means Zip
§
e
-> Extract means Unzip
o
Working
directory (it become Destination here)= E:\MSBI_VIMP\Temp
8. Expression Task:
·
It
is used to write simple expressions
·
Usage: Find out number of files in a given
folder.
9. File System Task:
·
It
is used to create, delete,
move, copy, and rename files and folders.
(File system
server means: You install Operating system and leave it, no other
program is installed. It
is only used to store Files/Folder only.)
·
File
System Task Editor=>
o
UseDirectoryIfExists=
True (means if it exists don’t create it else create.)
10. FTP(File Transfer Protocall) Task:
·
Download and upload data files and manages directories on servers.
Ex: FTP
server: ftp.ed.ac.uk
Path: /pub/mmsurvey/*.*
11. Hadoop file system:
·
To
move local files to HDFS file system.
12. Script Task:
·
It
is used to perform complex
operations with C#.NET or VB.NET languages.
·
Purpose:
o
Access
data by using other technologies that are not supported by built-in connection
types.
o
Identify
whether specified files are empty or how many rows they contain.
o
Identify
whether given folder has files or not.
o
Identify
whether the file has received today or not.
13. Send Mail Task:
·
It
is used to send mail using SMTP server.
14. Web Service Task:
·
It
is used to call web services method.
15. XML Task:
·
It
is used to work with XML data.
Package Properties:
1)
MaxConcurrentExecutables= -1 -> Maximum how many tasks can run
parallelly.
If
you have i3 Processor (3 processor in your computer) so at a time you can run 2
tasks run simultaneously.
Q:
what is minus 1 in this property?
A:
Means no. of processors minus 1 tasks can be running.