Skip to main content

SSIS Introduction

 

SSIS

ETL- Extract, Transform and Load Data

Transformation: Convert data from one format to another in the Data flow.

ETL tools in the Market:

  • ·         SSIS
  • ·         Informatica
  • ·         Ab Initio
  • ·         Data Stage (IBM)
  • ·         Talend
  • ·         Pentaho (Freeware)
  • ·         ODI – Oracle Data Integrator

Prior to SSIS:

-      SQL server 2000 has DTS

-      DTS –Data Transformation Services

-      DTS is a ETL tool

-      Build using COM component

-      DTS package extension is .dts

DTS vs SSIS:

DTS

SSIS

COM based tool

.NET based tool

Memory leakage issue

No memory leakage

No robust error handling

Has robust error handling

No parallel execution allowed

Parallel execution is allowed

No high level language support, only scripting language is supported (VB Script)

Supports high level language (C# or VB .NET) to perform complex operations

 

Core Elements of SSIS Designer:

1.    Control Flow (Controller- main)

2.    Data Flow

3.    Parameters

4.    Event Handlers

5.    Package Explorer

6.    Connection Manager

Control Flow:

·         Controls the execution of a package

·         Data flow is part of control Flow

·         Contains one or more tasks

·         Contains zero or more Data flows

·         Control flow contains Sub Elements

o   Tasks

o   Containers

o   Precedence Constraints

Data Flow:

·         It is a Task

·         Contains Sources, Destinations and Transformations

·         It performs ETL operations

·         It is part of Control flow

·         It is not mandatory

IQ: What is the difference between Control flow and Data flow?

Control flow

Data Flow

Controls the flow of executions in SSIS package

Transfers the data from source to target

It’s Process oriented

It’s Information oriented

Smallest unit is Task

Smallest unit is Component

Made up of Tasks connected with precedence constraints

Made up of Source(s), Transformation(s) and Destination(s)

Tasks in Control flow need to be completed before moving to the next task

None of the component will wait for other component to work

 

Control Flow contains 3 Sub elements:

1.     Tasks

2.     Containers:

a.     For Loop container

b.    Foreach Loop Container

c.     Sequence Container

3.     Precedence Constraints

Parameters:

·         Parameters are used to store the static values for deployment purpose.

·         Parameters are 2 types

o   Project Level: It can be used in all packages in the project.

o   Package Level: It can be used in package only.

·         Cannot change data @ runtime.

Event Handler:

·         Used for executing package related events and debugging purpose.

Package Explorer:

·         Displays all the elements of the package.

o   Connection mgrs.

o   Logs

o   Event Handler

o   Variables etc

Connection Manager:

·         Has connection information for Sources, Destinations and Log Files etc.

Upgrade Packages:

·         Upgrade All packages Wizard (SQL server 2005 and later packages)

·         Project Menu->Upgrade All packages

New tasks in SSIS 2016 onwards:

ð  Hadoop File system Task

ð  Hadoop Hive Task

ð  Hadoop Pig Task

 

·         SSIS now can integrate with Hadoop Source and Destinations

·         SQL server 2016 can be installed in Linux.

·         R Language is integrated in SQL server 2016.

Hadoop: (Structured/Semi Structured/Unstructured)

·         HDFS(Hadoop Distributed File system) (Store)

·         MapReduce (Process) – Java

 

 

·         HIVE (Process) -- sql

·         PIG (Process) -- Script

·         No SQL (Process)

·         HBase….etc

BI: (Structured Data)

·         DWH (Store)

·         SSAS (Process)

Drivers:

1.    ODBC (Open Database Connectivity)

a.     Only for relational databases (SQL server, Oracle etc)

2.    OLEDB (Object Linking and embedding Database Connectivity)

a.     For both relational and non-relational databases (Oracle,SQL server, Excel, raw files ,etc)

 

Control Flow Tasks:

1.    Data flow Task:

·         It is used to perform ETL Operations

·         It has sources, destinations and transformations

 

2.    Execute SQL Task:

·         It is used to execute SQL statements (DDL, DML, TCL, DCL, Call SP, Call Functions)

·         It performs batch operations.

·         It is faster.

 

3.    Analysis Services Processing Task:

·         It is used to process the cube from SSAS.

 

4.    Bulk Insert Task:

·         Used to import text files data into a SQL server.

Advantages:

o   Faster than the Data Flow Task

o   Can get range of Rows

 

Disadvantages:

o   Source must be flat file or text file or CSV file (no other sources)

o   Destination must be a SQL server table (no other destination)

o   Cannot validate, clean or transform data with this task.

o   Cannot get error rows separately.

Important Configuration Properties:

·         Connection:

o   Connection

o   DestinationTable

o   ColumnDelimiter

o   File

·         Options:

o   BatchSize=0 (you can specify the value here, ex: 100, it means batch of 100 rows will be inserted)

o   FirstRow=2 ( if you are inserting value from txt file and the first row will be column name then we can start inserting from 2nd row)

 

5.    Data Profiling Task:

·         It works only with data that is stored in SQL server.

·         It is used to sample the data for quality purpose.

·         Can Profile

o   Column Null Ratios

o   Column Value Distribution

o   Key Column Profile

o   Column Pattern Profile etc.

 

6.    Execute Package Task:

·         It is used to call other packages

·         Can create parent and child packages

·         Packages are executed parallel.

         

·         Execute Package Task Editor:=>ReferenceType: ProjectReference

 

7.    Execute Process Task:

·         Used to run 3rd party executable

Ex: .exe, .bat files

·         Can pass parameters

·         Usages: call any applications such as

o   Zip

o   Unzip

o   Send SMSs etc

·         Execute Process Task Editor for Zip:

o   Executable=. Exe path e.g. C:\Program Files\WinRAR\WinRAR.exe

o   Arguments (i.e. Dest)= a E:\MSBI_VIMP\Archive\SSASFiles

§  a -> Archive means Zip

§  e -> Extract means Unzip

o   Working directory (i.e. source)= E:\MSBI_VIMP\Temp

 

·         Execute Process Task Editor for UnZip:

o   Executable=. Exe path e.g. C:\Program Files\WinRAR\WinRAR.exe

o   Arguments (it become source here)= e E:\MSBI_VIMP\Archive\SSASFiles

§  a -> Archive means Zip

§  e -> Extract means Unzip

o   Working directory (it become Destination here)= E:\MSBI_VIMP\Temp

 

 

8.    Expression Task:

·         It is used to write simple expressions

·         Usage: Find out number of files in a given folder.

 

9.    File System Task:

·         It is used to create, delete, move, copy, and rename files and folders.

 

(File system server means: You install Operating system and leave it, no other program is installed. It is only used to store Files/Folder only.)

·         File System Task Editor=>

o   UseDirectoryIfExists= True (means if it exists don’t create it else create.)

 

10. FTP(File Transfer Protocall) Task:

·         Download and upload data files and manages directories on servers.

Ex:  FTP server: ftp.ed.ac.uk

Path: /pub/mmsurvey/*.*

 

11. Hadoop file system:

·         To move local files to HDFS file system.

 

12. Script Task:

 

·         It is used to perform complex operations with C#.NET or VB.NET languages.

·         Purpose:

o   Access data by using other technologies that are not supported by built-in connection types.

o   Identify whether specified files are empty or how many rows they contain.

o   Identify whether given folder has files or not.

o   Identify whether the file has received today or not.

 

13. Send Mail Task:

·         It is used to send mail using SMTP server.

 

14. Web Service Task:

·         It is used to call web services method.

 

15. XML Task:

·         It is used to work with XML data.

Package Properties:

1)    MaxConcurrentExecutables= -1 -> Maximum how many tasks can run parallelly.

If you have i3 Processor (3 processor in your computer) so at a time you can run 2 tasks run simultaneously.

Q: what is minus 1 in this property?

A: Means no. of processors minus 1 tasks can be running.

 

Popular posts from this blog

How to remove special characters in the Data(using Script component Task and Regex)

Start with the Below screen shot ( Before Script Task) and you will come to know about the senario. Note : The Script Component also works for the special characters within the Data as well . for ex "A$hish" Hover over the below Regex. it will show > Show potential fixes > in that Select System.Text.RegularExpressions. The Below will disappear.  Use the Below Script.  public override void Input0_ProcessInputRow(Input0Buffer Row) { Row.CustomerID = RemoveSpecialCharacters(Row.CustomerID); Row.NameStyle = RemoveSpecialCharacters(Row.NameStyle); Row.Title = RemoveSpecialCharacters(Row.Title); Row.FirstName = RemoveSpecialCharacters(Row.FirstName); Row.LastName = RemoveSpecialCharacters(Row.LastName); Row.CompanyName = RemoveSpecialCharacters(Row.CompanyName); Row.SalesPerson = RemoveSpecialCharacters(Row.SalesPerson); Row.EmailAddress = RemoveSpecialCharacters(Row.EmailAddress); Row.Phone = RemoveSpecialCharacters(Row.Phone); } public static string ...

DBMS

Database : database can be defined as a collection of information stored at a particular place written in a predetermined manner, belongs to a particular topic.  DBMS: (DATA BASE MANAGEMENT SYSTEM):  it is defined as software which is present inside the database which is used o maintain or manage the data within the database .                                 In the above diagram, if the user needs any info from the database. He has to contact with  dbms , then the dbms goes to the database, takes the information from the  database and gives it back to the user. Database models :  these models are broadly divided into the following types: 1.         FMS/FDMS 2.         HMS/HDMS 3.         NDBMS 4.         RDBMS 5.      ...

Different Schemas in DWH

  SCHEMAS:   Star Schema: -       Star Schema means Dimension Tables are directly linked to the Fact Table. Snow Flake Schema: -       Dimension Tables are indirectly linked to the Fact Table OR -       Dimension table is linking to another dimension table .       Star Schema   Snow Flake Schema 1 Has redundant data and hence tough to maintain/change 1 No redundancy and hence more easy to maintain/change 2 Less complex queries and hence easy to understand 2 More complex queries and hence less easy to understand 3 Less foreign keys so faster execution time of cube. 3 More foreign keys so take more execution of cube 4 Has De-Normalized Tables 4 Has Normalized Tabl...