DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE: January 2017

Sunday, January 8, 2017

Performance turning of Lookup Transformations

Lookup transformations are used to lookup a set of values in another table. Lookups slows down the performance.

1. To improve performance, cache the lookup tables. Informatica can cache all the lookup and reference tables; this makes operations run very fast.

2. Even after caching, the performance can be further improved by minimizing the size of the lookup cache. Reduce the number of cached rows by using a sql override with a restriction.

Cache: Cache stores data in memory so that Informatica does not have to read the table each time it is referenced. This reduces the time taken by the process to a large extent. Cache is automatically generated by Informatica depending on the marked lookup ports or by a user defined sql query.

Example for caching by a user defined query: –

Suppose we need to lookup records where employee_id=eno.

‘employee_id’ is from the lookup table, EMPLOYEE_TABLE and ‘eno’ is the

input that comes from the from the source table, SUPPORT_TABLE.

We put the following sql query override in Lookup Transform

‘select employee_id from EMPLOYEE_TABLE’

If there are 50,000 employee_id, then size of the lookup cache will be 50,000.

Instead of the above query, we put the following:-

‘select emp employee_id from EMPLOYEE_TABLE e, SUPPORT_TABLE s

where e. employee_id=s.eno’

If there are 1000 eno, then the size of the lookup cache will be only 1000.But here the performance gain will happen only if the number of records in SUPPORT_TABLE is not huge. Our concern is to make the size of the cache as less as possible.

3. In lookup tables, delete all unused columns and keep only the fields that are used in the mapping.

4. If possible, replace lookups by joiner transformation or single source qualifier.Joiner transformation takes more time than source qualifier transformation.

5. If lookup transformation specifies several conditions, then place conditions that use equality operator ‘=’ first in the conditions that appear in the conditions tab.

6. In the sql override query of the lookup table, there will be an ORDER BY clause. Remove it if not needed or put fewer column names in the ORDER BY list.

7. Do not use caching in the following cases: –

-Source is small and lookup table is large.

-If lookup is done on the primary key of the lookup table.

8. Cache the lookup table columns definitely in the following case: –

-If lookup table is small and source is large.

9. If lookup data is static, use persistent cache. Persistent caches help to save and reuse cache files. If several sessions in the same job use the same lookup table, then using persistent cache will help the sessions to reuse cache files. In case of static lookups, cache files will be built from memory cache instead of from the database, which will improve the performance.

10. If source is huge and lookup table is also huge, then also use persistent cache.

11. If target table is the lookup table, then use dynamic cache. The Informatica server updates the lookup cache as it passes rows to the target.

12. Use only the lookups you want in the mapping. Too many lookups inside a mapping will slow down the session.

13. If lookup table has a lot of data, then it will take too long to cache or fit in memory. So move those fields to source qualifier and then join with the main table.

14. If there are several lookups with the same data set, then share the caches.

15. If we are going to return only 1 row, then use unconnected lookup.

16. All data are read into cache in the order the fields are listed in lookup ports. If we have an index that is even partially in this order, the loading of these lookups can be speeded up.

17. If the table that we use for look up has an index (or if we have privilege to add index to the table in the database, do so), then the performance would increase both for cached and un cached lookups.

Thanks:-

Saturday, January 7, 2017

Informatica Partitions

When to use :-

Identification and elimination of performance bottlenecks will obviously optimize session performance. After tuning all the mapping bottlenecks, we can further optimize session performance by increasing the number of pipeline partitions in the session. Adding partitions can improve performance by utilizing more of the system hardware while processing the session.

The PowerCenter Integration Services creates a default partition type at each partition point. If you have the Partitioning option, you can change the partition type. The partition type controls how the PowerCenter Integration Service distributes data among partitions at partition points. When you configure the partitioning information for a pipeline, you must define a partition type at each partition point in the pipeline. The partition type determines how the PowerCenter Integration Service redistributes data across partition points.

Informatica Pipeline Partitioning Explained

Each mapping contains one or more pipelines. A pipeline consists of a source qualifier, all the transformations and the target. When the Integration Service runs the session, it can achieve higher performance by partitioning the pipeline and performing the extract, transformation, and load for each partition in parallel. Basically a partition is a pipeline stage that executes in a single reader, transformation, or writer thread.

The number of partitions in any pipeline stage equals the number of threads in the stage. By default, the Integration Service creates one partition in every pipeline stage. If we have the Informatica Partitioning option, we can configure multiple partitions for a single pipeline stage.

Setting partition attributes includes partition points, the number of partitions, and the partition types. In the session properties we can add or edit partition points. When we change partition points we can define the partition type and add or delete partitions(number of partitions).

We can set the following attributes to partition a pipeline:-

Partition point:
Partition points mark thread boundaries and divide the pipeline into stages. A stage is a section of a pipeline between any two partition points. The Integration Service redistributes rows of data at partition points. When we add a partition point, we increase the number of pipeline stages by one. Increasing the number of partitions or partition points increases the number of threads.

We cannot create partition points at Source instances or at Sequence Generator transformations.
Number of partitions:
A partition is a pipeline stage that executes in a single thread. If we purchase the Partitioning option, we can set the number of partitions at any partition point. When we add partitions, we increase the number of processing threads, which can improve session performance. We can define up to 64 partitions at any partition point in a pipeline. When we increase or decrease the number of partitions at any partition point, the Workflow Manager increases or decreases the number of partitions at all partition points in the pipeline. The number of partitions remains consistent throughout the pipeline. The Integration Service runs the partition threads concurrently.
Partition types:
The Integration Service creates a default partition type at each partition point. If we have the Partitioning option, we can change the partition type. The partition type controls how the Integration Service distributes data among partitions at partition points.

We can define the following partition types here: Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin.

Database partitioning.

The PowerCenter Integration Service queries the IBM DB2 or Oracle system for table partition information. It reads partitioned data from the corresponding nodes in the database. Use database partitioning with Oracle or IBM DB2 source instances on a multi-node table space. Use database partitioning with DB2 targets.

Hash partitioning

. Use hash partitioning when you want the PowerCenter Integration Service to distribute rows to the partitions by group. For example, you need to sort items by item ID, but you do not know how many items have a particular ID number. You can use the following types of hash partitioning:

 Hash auto-keys:- The PowerCenter Integration Service uses all grouped or sorted ports as a compound partition key. You may need to use hash autokeys partitioning at Rank,Sorter, and unsorted Aggregator transformations.
 Hash user keys:- The PowerCenter Integration Service uses a hash function to group rows of data among partitions. You define the number of ports to generate the partition key.

Key range:-
You specify one or more ports to form a compound partition key. The PowerCenter Integration Service passes data to each partition depending on the ranges you specify for each port. Use key range partitioning where the sources or targets in the pipeline are partitioned by key range.

Pass-through:- The PowerCenter Integration Service passes all rows at one partition point to the next partition point without redistributing them. Choose pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions.

Round-robin:- The PowerCenter Integration Service distributes blocks of data to one or more partitions. Use round-robin partitioning so that each partition processes rows based on the number and size of the blocks.

Limitation:-
You cannot create partition points for the following transformations:
 Source definition
 Sequence Generator
 XML Parser
 XML target
 Unconnected transformations

We cannot create a partition key for hash auto-keys, round-robin, or pass-through types partitioning
If you have bitmap index defined upon the target and you are using pass-through partitioning to, say Update the target table - the session might fail as bitmap index creates serious locking problem in this scenario
Partitioning considerably increases the total DTM buffer memory requirement for the job. Ensure you have enough free memory in order to avoid memory allocation failures
When you do pass-through partitioning, Informatica will try to establish multiple connection requests to the database server. Ensure that database is configured to accept high number of connection requests
As an alternative to partitioning, you may also use native database options to increase degree of parallelism of query processing. For example in Oracle database you can either specify PARALLEL hint or alter the DOP of the table in subject.
If required you can even combine Informatica partitioning with native database level parallel options - e.g. you create 5 pass-through pipelines, each sending query to Oracle database with PARALLEL hint.

Thursday, January 5, 2017

Performance Tuning in Informatica

Improve the session performance for flat file source :-
How can we do Performance Tuning in Informatica

- What is the row size for each source/target?
- have you remembered to tick off the 'sorted input' option in joiner3?
- do you NEED to sort the data before joining them?
- are the master the smaller of the two incoming streams in all cases (as mentioned above)?
- is the join-columns for each joiner set up in the correct order (the one with the highest number of distinct values comes first) ?

1.Please use the smaller of the two tables as the master table,as it reduces the cache size.
2.Increase the Memory for the session
Maximum Memory allowed for Auto Memory Attributes
Maximum Percentage of total memory attributes.
3.Hope you have the succifient size for DTM
4.If you load more number of records inthe target,increase the commit interval.
5.If possible also try to have an partition when you read from the source.

Optimizing the Bottleneck’s:-
1) If the source is a flat file, ensure that the flat file is local to the Informatica server. If source is a relational table, then try not to use synonyms or aliases.
2)If the source is a flat file, reduce the number of bytes (By default it is 1024 bytes per line) the Informatica reads per line. If we do this, we can decrease the Line Sequential Buffer Length setting of the session

properties.
3)If possible, give a conditional query in the source qualifier so that the records are filtered off as soon as possible in the process.
In the source qualifier, if the query has ORDER BY or GROUP BY, then create an index on the source table and order by the index field of the source table.

PERFORMANCE TUNING OF TARGETS :-
------------------------------------------------------
1)If the target is a flat file, ensure that the flat file is local to the Informatica server.
2)If target is a relational table, then try not to use synonyms or aliases.
3)Use bulk load whenever possible.
4)Increase the commit level.
5)Drop constraints and indexes of the table before loading.

PERFORMANCE TUNING OF MAPPINGS:-
-------------------------------------------------------

Mapping helps to channel the flow of data from source to target with all the transformations in between. Mapping is the skeleton of Informatica loading process.

Avoid executing major sql queries from mapplets or mappings.
Use optimized queries when we are using them.
Reduce the number of transformations in the mapping. Active transformations like rank, joiner, filter, aggregator etc should be used as less as possible.
Remove all the unnecessary links between the transformations from mapping.
If a single mapping contains many targets, then dividing them into separate mappings can improve performance.
If we need to use a single source more than once in a mapping, then keep only one source and source qualifier in the mapping. Then create different data flows as required into different targets or same target.
If a session joins many source tables in one source qualifier, then an optimizing query will improve performance.
In the sql query that Informatica generates, ORDERBY will be present. Remove the ORDER BY clause if not needed or at least reduce the number of column names in that list. For better performance it is best to order

by the index field of that table.
Combine the mappings that use same set of source data.
On a mapping, field with the same information should be given the same type and length throughout the mapping. Otherwise time will be spent on field conversions.
Instead of doing complex calculation in query, use an expression transformer and do the calculation in the mapping.
If data is passing through multiple staging areas, removing the staging area will increase performance.
Stored procedures reduce performance. Try to keep the stored procedures simple in the mappings.
Unnecessary data type conversions should be avoided since the data type conversions impact performance.
Transformation errors result in performance degradation. Try running the mapping after removing all transformations. If it is taking significantly less time than with the transformations, then we have to fine-tune the

transformation.
Keep database interactions as less as possible.

PERFORMANCE TUNING OF SESSIONS:-
----------------------------------------------------

1)A session specifies the location from where the data is to be taken, where the transformations are done and where the data is to be loaded. It has various properties that help us to schedule and run the job in the way we want.

2)Partition the session: This creates many connections to the source and target, and loads data in parallel pipelines. Each pipeline will be independent of the other. But the performance of the session will not improve if the number of records is less. Also the performance will not improve if it does updates and deletes. So session partitioning should be used only if the volume of data is huge and the job is mainly insertion of data.

3)Run the sessions in parallel rather than serial to gain time, if they are independent of each other.

4)Drop constraints and indexes before we run session. Rebuild them after the session run completes. Dropping can be done in pre-session script and Rebuilding in post session script. But if data is too much, dropping indexes

5)and then rebuilding them etc. will be not possible. In such cases, stage all data, pre-create the index, use a transportable table space and then load into database.
6)Use bulk loading, external loading etc. Bulk loading can be used only if the table does not have an index.

7)In a session we have options to ‘Treat rows as ‘Data Driven, Insert, Update and Delete’. If update strategies are used, then we have to keep it as ‘Data Driven’. But when the session does only insertion of rows into target

table, it has to be kept as ‘Insert’ to improve performance.

8)Increase the database commit level (The point at which the Informatica server is set to commit data to the target table. For e.g. commit level can be set at every every 50,000 records)

9)By avoiding built in functions as much as possible, we can improve the performance. E.g. For concatenation, the operator ‘||’ is faster than the function CONCAT (). So use operators instead of functions, where possible. The

10)functions like IS_SPACES (), IS_NUMBER (), IFF (), DECODE () etc. reduce the performance to a big extent in this order. Preference should be in the opposite order.

11)String functions like substring, ltrim, and rtrim reduce the performance. In the sources, use delimited strings in case the source flat files or use varchar data type.

12)Manipulating high precision data types will slow down Informatica server. So disable ‘high precision’.

13Localize all source and target tables, stored procedures, views, sequences etc. Try not to connect across synonyms. Synonyms and aliases slow down the performance.

DATABASE OPTIMISATION:-
---------------------------------------

1)To gain the best Informatica performance, the database tables, stored procedures and queries used in Informatica should be tuned well.

2)If the source and target are flat files, then they should be present in the system in which the Informatica server is present.

3)Increase the network packet size.
The performance of the Informatica server is related to network connections.Data generally moves across a network at less than 1 MB per second, whereas a local disk moves data five to twenty times faster. Thus network

4)connections often affect on session performance. So avoid network connections.
Optimize target databases.

PERFORMANCE TUNNING IN INFORMATICA

Performance tuning in Informatica-

The goal of performance tuning is optimize session performance so sessions run during the available load window for the Informatica Server. Increase the session performance by following-

1) Performance of the Informatica Server is related to network connections. Data generally moves across a network at less than 1 MB per second, whereas a local disk moves data five to twenty times faster. Thus network connections often affect on session performance. So avoid network connections.

2) Flat files: If your flat files stored on a machine other than the informatica server, move those files to the machine that consists of informatica server.

3) Relational data sources: Minimize the connections to sources, targets and informatica server to improve session performance. Moving target database into server system may improve session performance.

4) Staging areas: If you use staging areas you force informatica server to perform multiple data passes. Removing of staging areas may improve session performance.

5) You can run the multiple informatica servers against the same repository. Distributing the session load to multiple informatica servers may improve session performance.

6) Run the informatica server in ASCII data movement mode improves the session performance. Because ASCII data movement mode stores a character value in one byte. Unicode mode takes 2 bytes to store a character.

7) If a session joins multiple source tables in one Source Qualifier, optimizing the query may improve performance. Also, single table select statements with an ORDER BY or GROUP BY clause may benefit from optimization such as adding indexes.

8) We can improve the session performance by configuring the network packet size, which allows data to cross the network at one time. To do this go to server manger, choose server configure database connections.

9) If your target consist key constraints and indexes you slow the loading of data. To improve the session performance in this case drop constraints and indexes before u run the session and rebuild them after completion of session.

10) Running a parallel session by using concurrent batches will also reduce the time of loading the data. So concurrent batches may also increase the session performance.

11) Partitioning the session improves the session performance by creating multiple connections to sources and targets and loads data in parallel pipe lines.

12) In some cases if a session contains an aggregator transformation, you can use incremental aggregation to improve session performance.

13) Avoid transformation errors to improve the session performance.
If the session contains lookup transformation you can improve the session performance by enabling the look up cache.

14) If your session contains filter transformation, create that filter transformation nearer to the sources or you can use filter condition in source qualifier.

15) Aggregator, Rank and joiner transformation may often decrease the session performance, because they must group data before processing it. To improve session performance in this case use sorted ports option.

Improving Mapping Performance in Informatica-

Mapping optimization:

The best time in the development cycle is after system testing. Focus on mapping-level optimization only after optimizing the target and source databases.
Use Session Log to identify if the source, target or transformations are the performance bottleneck

Identifying Target Bottlenecks:

The most common performance bottleneck occurs when the Informatica Server writes to a target database. You can identify target bottlenecks by configuring the session to write to a flat file target. If the session performance increases significantly when you write to a flat file, you have a target bottleneck.

Tasks to be performed to increase performance:
* Drop indexes and key constraints.
* Increase checkpoint intervals.
* Use bulk loading.
* Use external loading.
* Increase database network packet size.
* Optimize target databases.

Identifying Source Bottlenecks:

If the session reads from relational source, you can use a filter transformation, a read test mapping, or a database query to identify source bottlenecks:

* Filter Transformation - measure the time taken to process a given amount of data, then add an always false filter transformation in the mapping after each source qualifier so that no data is processed past the filter transformation. You have a source bottleneck if the new session runs in about the same time.

* Read Test Session - compare the time taken to process a given set of data using the session with that for a session based on a copy of the mapping with all transformations after the source qualifier removed with the source qualifiers connected to file targets. You have a source bottleneck if the new session runs in about the same time.

* Extract the query from the session log and run it in a query tool. Measure the time taken to return the first row and the time to return all rows. If there is a significant difference in time, you can use an optimizer hint to eliminate the source bottleneck

Tasks to be performed to increase performance:

* Optimize the query.
* Use conditional filters.
* Increase database network packet size.
* Connect to Oracle databases using IPC protocol.

Identifying Mapping Bottlenecks

If you determine that you do not have a source bottleneck.

How to Increase Informatica Server Performance:

Many factors can affect session performance. Here are some points-

Before doing tuning that is specific to Informatica:

1. Check hard disks on related machines. (Slow disk access on source and target databases, source and target file systems, as well as the Informatica Server and repository machines can slow session performance.)

2. Improve network speed. (Slow network connections can slow session performance.)

3. Check CPUs on related machines (make sure the Informatica Server and related machines run on high performance CPUs.)

4. Configure physical memory for the Informatica Server to minimize disk I/O. (Configure the physical memory for the Informatica Server machine to minimize paging to disk.)

5. Optimize database configuration

6. Staging areas. If you use a staging area, you force the Informatica Server to perform multiple passes on your data. Where possible, remove staging areas to improve performance.

7. You can run multiple Informatica Servers on separate systems against the same repository. Distributing the session load to separate Informatica Server systems increases performance.

Informatica specific:
- Transformation tuning
- Using Caches
- Avoiding Lookups by using DECODE for smaller and frequently used tables
- Applying Filter at the earliest point in the data flow etc.

Informatica PowerCenter Partitioning Option

Delivering High Performance for Processing Massive Data Volumes

The PowerCenter® Partitioning Option increases the performance of PowerCenter through parallel data processing, and it has been instrumental in establishing PowerCenter’s industry performance leadership. This option provides a thread-based architecture and automatic data partitioning that optimizes parallel processing on multiprocessor and grid-based hardware environments.

Partitioning Option

Key Features

Data Smart Parallelism:

• Automatically aligns PowerCenter partitions with database table partitions to improve performance.

• Automatically guarantees data integrity by leveraging the parallel engine of

PowerCenter, which dynamically realigns data partitions for set-oriented transformations.

Session Design Tools:

• Create user-defined partitioning schemes quickly and easily

• Provide a graphical partitioning map for determining the best partitioning points

• Gather statistics on configurable session options, such as error handling, recovery

strategy, memory allocation, and logging, to maximize performance.

Integrated Monitoring Console:

• Gathers session statistics, such as throughput, rows/second, error details, and

performance optimizations, to identify potential bottlenecks and recognize trends

• Shows all session execution and dependency details.

Multiple Partition Schemes:

• Support parallelization through multiple mechanisms, including key range, hash

algorithm-based, round robin, or file partitions

• Maximize data throughput via concurrent processing of specified partitions along the

data transformation pipeline.

Partitioning Option Benefits:

Scale Cost-Effectively to Handle Large Data Volumes:

With the Partitioning Option, you can execute optimal parallel sessions by dividing data

processing into subsets that are run in parallel and spread among available CPUs

in a multiprocessor system. When different processors share the computational load,

large data volumes can be processed faster. When sourcing and targeting relational

databases, the Partitioning Option enables PowerCenter to automatically align its

partitions with database table partitions to improve performance. Unlike approaches

that require manual data partitioning, data integrity is automatically guaranteed because

the parallel engine of PowerCenter dynamically realigns data partitions for set-oriented

transformations (e.g., aggregators or sorters).

Enhance Developer Productivity:

The Partitioning Option provides intuitive, GUI-based, session design tools that reduce

the time spent on initial and ongoing configuration and performance tuning tasks. You can easily create user-defined partitioning schemes. A graphical partitioning

map helps you determine the best points of partitioning. Configurable session options,

such as error handling, recovery strategy, memory allocation, and logging, make it

easier to gather statistics used to maximize performance.

Optimize System Performance in Response to Changing Business Requirements:

The Partitioning Option lets you easily gather in-depth session statistics such as throughput, rows/second, error details, and performance optimizations. These statistics help you identify potential bottlenecks and recognize trends. An integrated monitoring console lets you view all session execution and dependency details. With the metadata-driven architecture of PowerCenter, data transformation logic is abstracted from the physical execution plan. This feature enables rapid performance tuning without compromising the logic and design of the original data mappings. You can continually and easily optimize system performance in the face of increasing data loads and changing business requirements.

Conclusion

The goal of performance tuning is optimize session performance so sessions run during the available load window for the Informatica Server.

Informatica is a leading provider of enterprise data integration software and services. With Informatica, organizations can gain greater business value by integrating all their information assets from across the enterprise. Thousands of companies worldwide rely on Informatica to reduce the cost and expedite the time to address data integration needs of any complexity and scale.

DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE

Friday, January 27, 2017

OBIEE 12C New Features

OBIEE 12C New Features

Migrate the Informatica code from Unix Server to Windows Server

Migrate the Informatica code from Unix Server to Windows Server

Sunday, January 8, 2017

Performance turning of Lookup Transformations

Performance turning of Lookup Transformations

Saturday, January 7, 2017

Informatica Partitions

Informatica Partitions

When to use :-

Informatica Pipeline Partitioning Explained

Thursday, January 5, 2017

Performance Tuning in Informatica

Conclusion

ADF functions

DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE