Chinna_Informatica tutorial: June 2014

Monday, 23 June 2014

Informatica PowerCenter 9 Installation and Configuration Complete Guide

This article provides complete step by step instruction for installation and configuration of Informatica PowerCenter 9.x. This includes the installation of server components, configuration of different Informatica services and client installation and configuration.

Download Installation Pack.
Unpack the Installation Package.
Install Informatica PowerCenter Server.
Domain Configure.
Configure Repository Service.
Configure Integration Service.
Client Installation.
Installation Known Issues and Solution.

Download Installation Pack

Step : 1
Informatica PowerCenter trail version can be downloaded from https://edelivery.oracle.com
Log on to https://edelivery.oracle.com and accept the Terms and Conditions.

Step : 2
Choose the Product package as shown below and Click Continue.

Step : 3
Locate the download package as shown in below image.

Step : 4
Download the packages to D:\INFA9X

Unpack the Installation Package

Step : 1

Unzip all the the four downloaded zip files into D:\INFA9X. Hint : Use the program WinRAR to unzip all the files. After unzipping you will see below files and folders.

Step : 2

Unzip dac_win_101314_infa_win_32bit_910.zip into the the same folder D:\INFA9X. After unzipping you will see below files and folders.

Install Informatica PowerCenter Server

Step : 1
To locate install.exe, Navigate to D:\INFA9X\dac_win_101314_infa_win_32bit_910 as shown in below image. double click on the install.exe.

Step : 2
Installation wizard Starts. Choose the installation type.
Click Next.

Step : 3
Installation Pre-requisites will be shown before the installation starts as below.
Click Next.

Step : 4

Enter the license key. You can locate the license key from D:\INFA9X\EXTRACT\Oracle_All_OS_Prod.key.

Click Next.

Step : 5

Pre-installation summery will give the items installed during the installation process based on the license key.

Click Next

Step : 6
Installation Begins. It takes couple of minutes to finish. Soon after completion of this step, Configuring Domain window opens. Continue the steps from Domain Configuration.

Domain Configuration.

Step : 1

Choose “Create a Domain” radio button.
Check “Enable HTTPS for Informatica Administrator”
Leave the Port number as it is and choose “Use a keystore file generated by the installer”

Click Next.

Step : 2
Provide the Repository database details as below.

Database Type : Choose your Repository database (Oracle/SQL Server/Sybase)
Database user ID : Database user ID to connect database.
User Password : Password.
Schema Name : If Schema name is not provided default schema will be used.
Database Address and Port : Machine on which database in installed and default port number.
Database Service Name : Database Name.

Below image shows the configuration using SQL Server.
Click Next.

Step : 3
You can give the Domain details, Admin user details now.

Domain Name : Name of your Domain.
Node Host Name : Machine name on which Informatica Server is running.
Node Name : Name of the Node.
Node Port Number : Leave the default port Number.
Domain user name : This is the Administrator user
Domain password : Administrator password

Note : Remember your Admin User ID, Password to log on to Admin Console later in the installation.

Informatica 9 Installation Steps

Step : 4
Use the default configuration and Click Next.

Step : 5
Installation is complete and you get the post-installation summery. You get a link to the installation log file and a link to Admin console.
Click Done.

Informatica 9 Installation Steps

Configure Repository Service

Step : 1
Go to Start menu and Click on “Informatica Administrator Home Page”. This will open up the Admin Console in a web browser.

Step : 2
Log on to Admin console using your Admin User ID and Password. You set your Admin User ID and Password in “Domain Configuration” section Step 3

Step : 3
Once you Log on you will see the Screen just like shown below.

Step : 4
Choose your Domain Name from “Domain Navigator”, Click on “Actions”, Choose “New” and “PowerCenter Repository Service”.

Step : 5
A new screen will appear, Provide the details as shown below.

Repository Name : Your Repository Name.
Description : An optional description about the repository.
Location : Choose the Domain you have already created. If you have only one Domain, this value will be pre populated.
License : Choose the license key from the drop down list.
Node : Choose the node name from the drop down list.

Click Next.

Step : 6
A new screen will appear, Provide the Repository database details.

Database Type : Choose your Repository database (Oracle/SQL Server/Sybase)
Username : Database user ID to connect database.
Password : Database user Password.
Connection String : Database Connection String.
Code Page : Database Code Page
Table Space : Database Table Space Name
Choose “No content exists under specified connection string. Create new content”

Click Finish

Step : 7
It takes couple of minutes create Repository content. After the repository creation below screen will be seen.

Step : 8
The repository service will be running in “Exclusive” mode as shown below. This needs to be change to “Normal” before we can configure Integration service.
Click “Edit” Repository Properties.

Step : 9
A pop up window appears, Set the properties

Operation Mode : Normal
Security Audit Trail : No

Click OK.

Click OK for the next two pop up windows which confirms the Repository Restart to change the Repository Operating Mode.

Configure Integration Service

Step : 1
Choose your Domain Name from “Domain Navigator”, Click on “Actions”, Choose “New” and “PowerCenter Integration Service”.

Step : 2
A new window will appear, Provide the details as shown below.

Name : Your Integration Service Name.
Description : An optional description about the repository.
Location : Choose the Domain you have already created. If you have only one Domain, this value will be pre populated.
License : Choose the license key from the drop down list.
Node : Choose the node name from the drop down list.

Click Next.

Step : 3
A new window will appear, Provide the details as shown below.

PowerCenter Repository Service : Choose your Repository Service Name from the drop down list.
Username : Admin user name.
Password : Admin password.
Data Movement Mode : ASCII.

Click Finish.

Step : 4
A pop up window will appear, Choose the Code Page as ANSI.
Click OK.

Step : 5
Window will be closed and you can see all the configured services in the “Domain Navigator”

With that we are all done with the installation and configuration for Informatica PowerCenter Server.

Client Installation.

Step : 1
Go to D:\INFA9X as shown in below image. Click on the install.bat.

Step : 2
Installation wizard Starts.
Click Start.

Step : 3
Installation wizard Starts. Choose the installation type as in the below image.
Click Next.

Step : 4
Installation Pre-requisites will be shown before the installation starts as below.
Click Next.

Step : 5
Choose the client tools you need. Only PowerCenter Client is mandatory.
Click Next.

Step : 6
Choose the client installation directory.
Click Next.

Step : 7
You can choose the type of Eclipse installation in this step. This window will be available if you choose to install “Informatica Developer” or “Data Transformation Studio”.
Click Next.

Step : 8
Pre-installation summery will give the items installed during the installation process.
Click Next.

Step : 9
Installation Begins. It takes one or two minutes to complete this step.

Step : 10
Installation is complete and you get the post-installation summery.

With that we are all done with the installation and configuration for Informatica PowerCenter Client

Friday, 20 June 2014

Why do we need Staging Area during ETL Load

"We have a simple data warehouse that takes data from a few RDBMS source systems and load the data in dimension and fact tables of the warehouse. I wonder why we have a staging layer in between. Why can’t we process everything on the fly and push them in the data warehouse?"

Really staging area is not a necessity if we can handle it on the fly. But can we? Here are a few reasons why you can’t avoid a staging area:

Source systems are only available for extraction during a specific time slot which is generally lesser than your overall data loading time. It’s a good idea to extract and keep things at your end before you lose the connection to the source systems.
You want to extract data based on some conditions which require you to join two or more different systems together. E.g. you want to only extract those customers who also exist in some other system. You will not be able to perform a SQL query joining two tables from two physically different databases.
Various source systems have different allotted timing for data extraction.
Data warehouse’s data loading frequency does not match with the refresh frequencies of the source systems
Extracted data from the same set of source systems are going to be used in multiple places (data warehouse loading, ODS loading, third-party applications etc.)
ETL process involves complex data transformations that require extra space to temporarily stage the data
There is specific data reconciliation / debugging requirement which warrants the use of staging area for pre, during or post load data validations

Clearly staging area gives lot flexibility during data loading. Shouldn’t we have a separate staging area always then? Is there any impact of having a stage area? Yes there are a few.

Staging area increases latency – that is the time required for a change in the source system to take effect in the data warehouse. In lot of real time / near real time applications, staging area is rather avoided
Data in the staging area occupies extra space

To me, in all practical senses, the benefit of having a staging area outweighs its problems. Hence, in general I will suggest designating a specific staging area in data warehousing projects.

How to Dynamic Lookup Cache works

A LookUp cache does not change its data once built. But what if the underlying table upon which lookup was done changes the data after the lookup cache is created? Is there a way so that the cache always remain up-to-date even if the underlying table changes?

Why do we need Dynamic Lookup Cache?

Let's think about this scenario. You are loading your target table through a mapping. Inside the mapping you have a Lookup and in the Lookup, you are actually looking up the same target table you are loading.
You may ask me, "So? What's the big deal? We all do it quite often...". And yes you are right.

There is no "big deal" because Informatica (generally) caches the lookup table in the very beginning of the mapping, so whatever record getting inserted to the target table through the mapping, will have no effect on the Lookup cache.
The lookup will still hold the previously cached data, even if the underlying target table is changing.

But what if you want your Informatica Lookup cache to get updated as and when the data in the underlying target table changes?

What if you want your lookup cache to always show the exact snapshot of the data in your target table at that point in time? Clearly this requirement will not be fullfilled in case you use a static cache. You will need a dynamic cache to handle this.
But in which scenario will someone need to use a dynamic cache? To understand this, let's first understand a static cache scenario.

Static Lookup Cache Scenario :

Let's suppose you run a retail business and maintain all your customer information in a customer master table (RDBMS table). Every night, all the customers from your customer master table is loaded in to a Customer Dimension table in your data warehouse. Your source customer table is a transaction system table, probably in 3rd normal form, and does not store history. Meaning, if a customer changes his address, the old address is updated with the new address.
But your data warehouse table stores the history (may be in the form of SCD Type-II). There is a map that loads your data warehouse table from the source table. Typically you do a Lookup on target (static cache) and check with every incoming customer record to determine if the customer is already existing in target or not. If the customer is not already existing in target, you conclude the customer is new and INSERT the record whereas if the customer is already existing, you may want to update the target record with this new record (if the record is updated). This scenario - commonly known as 'UPSERT' (update else insert) scenario - is illustrated below.
Static Cache

A static Lookup Cache to determine if a source record is new or updatable

You don't need dynamic Lookup cache for the above type of scenario.

Dynamic Lookup Cache Scenario :

Notice in the previous example I mentioned that your source table is an RDBMS table. Generally speaking, this ensures that your source table does not have any duplicate record.
But, What if you had a flat file as source with many duplicate records in the same bunch of data that you are trying to load? (Or even a RDBMS table may also contain duplicate records)

Would the scenario be same if the bunch of data I am loading contains duplicate?

Unfortunately Not. Let's understand why from the below illustration. As you can see below, the new customer "Linda" has been entered twice in the source system - most likely mistakenly. The customer "Linda" is not present in your target system and hence does not exist in the target side lookup cache.
When you try to load the target table, Informatica processes row 3 and inserts it to target as customer "Linda" does not exist in target. Then Informatica processes row 4 and again inserts "Linda" into target since Informatica lookup's static cache can not detect that the customer "Linda" has already been inserted. This results into duplicate rows in target.

The problem arising from above scenario can be resolved by using dynamic lookup cache

Here are some more examples when you may consider using dynamic lookup,

Updating a master customer table with both new and updated customer information coming together as shown above
Loading data into a slowly changing dimension table and a fact table at the same time. Remember, you typically lookup the dimension while loading to fact. So you load dimension table before loading fact table. But using dynamic lookup, you can load both simultaneously.
Loading data from a file with many duplicate records and to eliminate duplicate records in target by updating a duplicate row i.e. keeping the most recent row or the initial row
Loading the same data from multiple sources using a single mapping. Just consider the previous Retail business example. If you have more than one shops and Linda has visited two of your shops for the first time, customer record Linda will come twice during the same load.

How does dynamic lookup cache work :

Once you have configured your lookup to use dynamic cache (we will see below how to do that), when Integration Service reads a row from the source, it updates the lookup cache by performing one of the following actions:

Inserts the row into the cache: If the incoming row is not in the cache, the Integration Service inserts the row in the cache based on input ports or generated Sequence-ID. The Integration Service flags the row as insert.

Updates the row in the cache: If the row exists in the cache, the Integration Service updates the row in the cache based on the input ports. The Integration Service flags the row as update.

Makes no change to the cache: This happens when the row exists in the cache and the lookup is configured or specified To Insert New Rows only or, the row is not in the cache and lookup is configured to update existing rows only or, the row is in the cache, but based on the lookup condition, nothing changes. The Integration Service flags the row as unchanged.

Notice that Integration Service actually flags the rows based on the above three conditions.

And that's a great thing, because, if you know the flag you can actually reroute the row to achieve different logic.

Fortunately, as soon as you create a dynamic lookup Informatica adds one extra port to the lookup. This new port is called:

NewLookupRow

Using the value of this port, the rows can be routed for insert, update or to do nothing. You just need to use a Router or Filter transformation followed by an Update Strategy.
Oh, forgot to tell you the actual values that you can expect in NewLookupRow port are:

0 = Integration Service does not update or insert the row in the cache.
1 = Integration Service inserts the row into the cache.
2 = Integration Service updates the row in the cache.

When the Integration Service reads a row, it changes the lookup cache depending on the results of the lookup query and the Lookup transformation properties you define. It assigns the value 0, 1, or 2 to the NewLookupRow port to indicate if it inserts or updates the row in the cache, or makes no change.

Configuring a Dynamic Lookup - Mapping Example :

Ok, I design a mapping for you to show Dynamic lookup implementation. I have given a full screenshot of the mapping. Since the screenshot is slightly bigger, so I link it below. Just click to expand the image.
Screen-shot of the mapping (click on the image to expand)

Screen-shot of the Lookup - Ports Tab (click on the image to expand)

Screen-shot of the Lookup - Properties Tab (click on the image to expand)

If you check the mapping screenshot, there I have used a router to reroute the INSERT group and UPDATE group. The router screenshot is also given below. New records are routed to the INSERT group and existing records are routed to the UPDATE group.

Screen-shot of the Router (click on the image to expand)

Dynamic Lookup Sequence ID :

While using a dynamic lookup cache, we must associate each lookup/output port with an input/output port or a sequence ID. The Integration Service uses the data in the associated port to insert or update rows in the lookup cache. The Designer associates the input/output ports with the lookup/output ports used in the lookup condition.
When we select Sequence-ID in the Associated Port column, the Integration Service generates a sequence ID for each row it inserts into the lookup cache.
When the Integration Service creates the dynamic lookup cache, it tracks the range of values in the cache associated with any port using a sequence ID and it generates a key for the port by incrementing the greatest sequence ID existing value by one, when the inserting a new row of data into the cache.
When the Integration Service reaches the maximum number for a generated sequence ID, it starts over at one and increments each sequence ID by one until it reaches the smallest existing value minus one. If the Integration Service runs out of unique sequence ID numbers, the session fails.

Dynamic Lookup Ports :

The lookup/output port output value depends on whether we choose to output old or new values when the Integration Service updates a row:

Output old values on update: The Integration Service outputs the value that existed in the cache before it updated the row.

Output new values on update: The Integration Service outputs the updated value that it writes in the cache. The lookup/output port value matches the input/output port value.

Note: We can configure to output old or new values using the Output Old Value On Update transformation property.

Handling NULL in dynamic LookUp :

If the input value is NULL and we select the Ignore Null inputs for Update property for the associated input port, the input value does not equal the lookup value or the value out of the input/output port. When you select the Ignore Null property, the lookup cache and the target table might become unsynchronized if you pass null values to the target. You must verify that you do not pass null values to the target.
When you update a dynamic lookup cache and target table, the source data might contain some null values. The Integration Service can handle the null values in the following ways:

Insert null values: The Integration Service uses null values from the source and updates the lookup cache and target table using all values from the source.

Ignore Null inputs for Update property : The Integration Service ignores the null values in the source and updates the lookup cache and target table using only the not null values from the source.

If we know the source data contains null values, and we do not want the Integration Service to update the lookup cache or target with null values, then we need to check the Ignore Null property for the corresponding lookup/output port.
When we choose to ignore NULLs, we must verify that we output the same values to the target that the Integration Service writes to the lookup cache. We can Configure the mapping based on the value we want the Integration Service to output from the lookup/output ports when it updates a row in the cache, so that lookup cache and the target table might not become unsynchronized.

New values. Connect only lookup/output ports from the Lookup transformation to the target.
Old values. Add an Expression transformation after the Lookup transformation and before the Filter or Router transformation. Add output ports in the Expression transformation for each port in the target table and create expressions to ensure that we do not output null input values to the target.

Some other details about Dynamic Lookup :

When we run a session that uses a dynamic lookup cache, the Integration Service compares the values in all lookup ports with the values in their associated input ports by default.
It compares the values to determine whether or not to update the row in the lookup cache. When a value in an input port differs from the value in the lookup port, the Integration Service updates the row in the cache.

But what if we don't want to compare all ports?

We can choose the ports we want the Integration Service to ignore when it compares ports. The Designer only enables this property for lookup/output ports when the port is not used in the lookup condition. We can improve performance by ignoring some ports during comparison. (Learn how to improve performance of lookup transformation here)
We might want to do this when the source data includes a column that indicates whether or not the row contains data we need to update. Select the Ignore in Comparison property for all lookup ports except the port that indicates whether or not to update the row in the cache and target table.

Note: We must configure the Lookup transformation to compare at least one port else the Integration Service fails the session when we ignore all ports.

What is the difference between STOP and ABORT options in Workflow Monitor

STOP Command:

When we issue the STOP command on the executing session task, the Integration Service stops reading data from source. It continues processing, writing and committing the data to targets. If the Integration Service cannot finish processing and committing data, we can issue the abort command.

ABORT Command:

In contrast ABORT command has a timeout period of 60 seconds. If the Integration Service cannot finish processing and committing data within the timeout period, it kills the DTM process and terminates the session.

What are the new features of Informatica 9.x in developer level

What are the new features of Informatica 9.x in developer level:

Now Lookup can be configured as an active transformation - it can return multiple rows on successful match
Now you can write SQL override on un-cached lookup also. Previously you could do it only on cached lookup
You can control the size of your session log. In a real-time environment you can control the session log file size or time
Database deadlock resilience feature - this will ensure that your session does not immediately fail if it encounters any database deadlock, it will now retry the operation again. You can configure number of retry attempts.

How to configure a Lookup as Active

How to configure a Lookup as Active :

In order to use this option, while creating the transformation we must configure the Lookup transformation property "Lookup Policy on Multiple Match" to Use All Values.
It is important to note that once the lookup is created, we can no longer change the mode between passive and active anymore.
Whenever the Lookup policy on multiple match attribute is set to Use All Values, the property becomes read-only afterwards.

Implementing a Lookup As Active :

Scenario: Suppose we have customer order data in a relational table. Each customer has multiple orders in the table. We can configure the Lookup transformation to return all the orders placed by a customer.
Now check the below simple mapping where we want to return all employees in the departments.
Go to Transformation and click Create. Select Transformation Type as Lookup and enter a name for the transformation.

Next check the option Return All Values on Multiple Match.

Here our source is the DEPT table and the EMP table is used a lookup. The lookup condition is based on the department number.

Properties Tab of Active Lookup Transformation

Basically we try to achive the result as the below sql select:-

SELECT DEPT.DEPTNO, DEPT.DNAME, DEPT.LOC, EMP.ENAME, EMP.SAL 
FROM DEPT LEFT OUTER JOIN EMP 
ON DEPT.DEPTNO = EMP.DEPTNO

Active Lookup Transformation Restrictions:

We cannot return multiple rows from an unconnected Lookup transformation
We cannot enable dynamic cache for a Active Lookup transformation.
Active Lookup Transformation that returns multiple rows cannot share a cache with a similar Passive Lookup Transformation that returns one matching row for each input row.