DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE

Monday, March 13, 2017

Managing Performance Tuning and Query Caching In Obiee11g

OBIEE11G BI Server Cache Management:-

What is Cache?

Analytics systems reduce database calls by creating a ‘Cache’ of data on the same machine as the Analytics engine. Caching has a small cost in terms of disk space to store and a small number of I/O transactions on the server, but this cost is easily outweighed by the improvement in response time.

Types of Cache in OBIEE

There are two different types of cache in Oracle Business Intelligence Enterprise Edition (OBIEE):

The Query Cache (BI Server Cache): The OBIEE Server can save the results of a query in cache files and then reuse those results later when a similar query is requested. This type of cache is referred to as Query cache.
The Presentation Server Cache: When users run analytics, Presentation server can cache the results

Why do we need to Purge Cache?

The main reason to implement the cache purge process is, when we have data that is being updated frequently in the warehouse and if there is a query cache present for the query that hits on the database, then the numbers may vary from that of the database. In such cases if you purge the cache the issue gets resolved.

Steps to Automatically Purge BI Server Cache

[Assuming a Oracle Business Intelligence Enterprise Edition (OBIEE) server and Data warehouse Administration Console (DAC) server are installed on the Linux machines.]

Here are the few steps to purging BI SERVER cache automatically in Oracle Business Intelligence Applications. For our purposes, the OBIEE server and DAC server are on two different Linux machines.

Step 1

As the OBIEE and DAC servers present are two different servers we need to login to OBIEE server from DAC server. This can be achieved by using the SSH command in Linux.

Step 2

As the DAC server requires password-less login, we either need to setup the RSA keys or use the ‘sshpass’ command to login to the OBIEE server. Where RSA keys are most preferable by the client personal. Setting up RSA keys is mostly the work of a DBA.

Step 3

Once the login without using a password is completed, we now need to move to the OBIEE HOME and create two files: purge.txt and purgecache.sh

We also must provide the privileges required for execution.

Thus after every ETL load the purging of the cache can be achieved.

We can check the output of the purging process by navigating to the dac/log folder and check for post_etl.sh.log file.

The Above steps help us purge the BI SERVER cache automatically after every DAC load.

Presentation Server Cache Management

The Presentation server cache can be managed in the instanceconfig.xml file and it’s a one-time setup by adding the tags shown below (just above the

at the end of the file). Don’t bother to automate the purge of Presentation server cache.

Web Cache
1440 1440 1440

120 180 180 120 600

1440
1440

This one only for BI Server to Presentation server expire only. 10

A cache management strategy sounds grand doesn’t it? But it boils down to two things:

Accuracy – Flush any data from the cache that is now stale
Speed – Prime the cache so that as many queries get a hit on it, first time

Maintaining an Accurate Cache

Every query that is run through the BI Server, whether from a Dashboard, Answers, or more funky routes such as custom ODBC clients or JDBC, will end up in cache. It’s possible to “seed” (“prime”/“warmup”) the cache explicitly, and this is discussed later. The only time you won’t see data in the cache is if (a) you have BI Server caching disabled, or (b) you’ve disabled the Cacheable option for a physical table that is involved in providing the data for the query being run.

Purging Options

So we’ve a spread of queries run that hit various dimension and fact tables and created lots of cache entries. Now we’ve loaded data into our underlying database, so we need to make sure that the next time a user runs an OBIEE query that uses the new data they can see it. Otherwise we commit the cardinal sin of any analytical system and show the user incorrect data which is a Bad Thing. It may be fast, but it’s WRONG….

We can purge the whole cache, but that’s a pretty brutal approach. The cache is persisted to disk and can hold lots of data stretching back months - to blitz all of that just because one table has some new data is overkill. A more targeted approach is to purge by physical database, physical table, or even logical query. When would you use these?

Purge entire cache - the nuclear option, but also the simplest. If your data model is small and a large proportion of the underlying physical tables may have changed data, then go for this
Purge by Physical Database - less brutal that clearing the whole cache, if you have various data sources that are loaded at different points in the batch schedule then targeting a particular physical database makes sense.
Purge by Physical Table - if many tables within your database have remained unchanged, whilst a large proportion of particular tables have changed (or it’s a small table) then this is a sensible option to run for each affected table
Purge by Query - If you add a few thousand rows to a billion row fact table, purging all references to that table from the cache would be a waste. Imagine you have a table with sales by day. You load new sales figures daily, so purging the cache by query for recent data is obviously necessary, but data from previous weeks and months may well remain untouched so it makes sense to leave queries against those in the cache. The specifics of this choice are down to you and your ETL process and business rules inherent in the data (maybe there shouldn’t be old data loaded, but what happens if there is? See above re. serving wrong data to users). This option is the most complex to maintain because you risk leaving behind in the cache data that may be stale but doesn’t match the precise set of queries that you purge against.

Which one is correct depends on

your data load and how many tables you’ve changed
your level of reliance on the cache (can you afford low cache hit ratio until it warms up again?)
time to reseed new content

If you are heavily dependant on the cache and have large amounts of data in it, you are probably going to need to invest time in a precise and potentially complex cache purge strategy. Conversely if you use caching as the ‘icing on the cake’ and/or it’s quick to seed new content then the simplest option is to purge the entire cache. Simple is good; OBIEE has enough moving parts without adding to its complexity unnecessarily.

Note that OBIEE itself will perform cache purges in some situations including if a dynamic repository variable used by a Business Model (e.g. in a Logical Column) gets a new value through a scheduled initialisation block.

Performing the Purge

There are several ways in which we can purge the cache. First I’ll discuss the ones that I would not recommend except for manual testing:

Administration Tool -> Manage -> Cache -> Purge. Doing this every time your ETL runs is not a sensible idea unless you enjoy watching paint dry (or need to manually purge it as part of a deployment of a new RPD etc).
In the Physical table, setting Cache persistence time. Why not? Because this time period starts from when the data was loaded into the cache, not when the data was loaded into your database.
An easy mistake to make would be to think that with a daily ETL run, setting the Cache persistence time to 1 day might be a good idea. It’s not, because if your ETL runs at 06:00 and someone runs a report at 05:00, there is a going to be a stale cache entry present for another 23 hours. Even if you use cache seeding, you’re still relinquishing control of the data accuracy in your cache. What happens if the ETL batch overruns or underruns?
The only scenario in which I would use this option is if I was querying directly against a transactional system and wanted to minimise the number of hits OBIEE made against it - the trade-off being users would deliberately be seeing stale data (but sometimes this is an acceptable compromise, so long as it’s made clear in the presentation of the data).

So the two viable options for cache purging are:

BI Server Cache Purge Procedures

These are often called “ODBC” Procedures but technically ODBC is just one - of several - ways that the commands can be sent to the BI Server to invoke.

As well as supporting queries for data from clients (such as Presentation Services) sent as Logical SQL, the BI Server also has its own set of procedures. Many of these are internal and mostly undocumented (Christian Berg does a great job of explaining them here, and they do creep into the documentation here and here), but there are some cache management ones that are fully supported and documented. They are:

SAPurgeCacheByQuery
SAPurgeCacheByTable
SAPurgeCacheByDatabase
SAPurgeAllCache
SAPurgeCacheBySubjectArea (>= 11.1.1.9)
SAPurgeCacheEntryByIDVector (>= 11.1.1.9)

Seeding the Cache

http://dineshng03.blogspot.in/2017/03/seeding-cache-in-obiee11g.html

Monday, March 6, 2017

Hints in oracle and Informatica

Hints for improving the performance of query in Oracle

The performance and the explain plan of a query can be improved by using Hints in the query, here are few of them:
1. /*+ parallel(table_name,8) */ can be used in a select statement.

Example: select /*+ parallel(emp,8) */ * from emp;
This will help in getting the results quickly, this hint will create 8 parallel pipelines to select the records from the emp table in this example.
This hint can also be used with inserts, but will only help without the DBlinks (meaning copying data from the same database).

2. /*+ append */ used with insert statement

Example: insert /*+ append */ into emp select /*+ parallel(emp,8) */ * from xyz;
This should be used for large loads, it bypasses the buffer cache and does a direct path load.

3. /*+ use_hash(table1 table2...) */ used with select statement

Example: select /*+ use_hash(table1 table2...) */ * from table1 table2.. where table1.xyz = table2.yzx;
This is used to improve the explain plan of the query, this hint eliminates the nested loops and uses the hash join instead. This helps in improving the performance of the query.

SELECT /*+ PARALLEL(employees 4) PARALLEL(departments 4) USE_HASH(employees)

ORDERED */

MAX(salary), AVG(salary)

FROM employees, departments

WHERE employees.department_id = departments.department_id

GROUP BY employees.department_id;

Hints in oracle

· ALL_ROWS
One of the hints that 'invokes' the Cost based optimizer
ALL_ROWS is usually used for batch processing or data warehousing systems.

(/*+ ALL_ROWS */)

· FIRST_ROWS
One of the hints that 'invokes' the Cost based optimizer
FIRST_ROWS is usually used for OLTP systems.

(/*+ FIRST_ROWS */)

SELECT /*+ FIRST_ROWS(10) */ * FROM employees;

· CHOOSE
One of the hints that 'invokes' the Cost based optimizer
This hint lets the server choose (between ALL_ROWS and FIRST_ROWS, based on statistics gathered.

· HASH
Hashes one table (full scan) and creates a hash index for that table. Then hashes other table and uses hash index to find corresponding records. Therefore not suitable for < or > join conditions.

/*+ use_hash */

Hints are most useful to optimize the query performance.

/*+ hint */

/*+ hint(argument) */

/*+ hint(argument-1 argument-2) */

All hints except

/*+
rule */

cause the CBO to be used. Therefore, it is good practise to analyze the underlying tables if hints are used (or the query is fully hinted.

There should be no schema names in hints. Hints must use aliases if alias names are used for table names. So the following is wrong:

select /*+ index(scott.emp ix_emp) */ from scott.emp emp_alias

better:

select /*+ index(emp_alias ix_emp) */ ... from scott.emp emp_alias

Why using hints

It is a perfect valid question to ask why hints should be used. Oracle comes with an optimizer that promises to optimize aquery's execution plan. When this optimizer is really doing a good job, no hints should be required at all.

Sometimes, however, the characteristics of the data in the database are changing rapidly, so that the optimizer (or more accuratly, its statistics) are out of date. In this case, a hint could help.

It must also be noted, that Oracle allows to lock the statistics when they look ideal which should make the hints meaningless again.

Hint categories

Hints can be categorized as follows:

Hints for Optimization Approaches and Goals,
Hints for Access Paths, Hints for Query Transformations,
Hints for Join Orders,
Hints for Join Operations,
Hints for Parallel Execution,
Additional Hints

Documented Hints

Hints for Optimization Approaches and Goals

ALL_ROWS

One of the hints that 'invokes' the Cost based optimizer
ALL_ROWS is usually used for batch processing or data warehousing systems.

FIRST_ROWS

One of the hints that 'invokes' the Cost based optimizer
FIRST_ROWS is usually used for OLTP systems.

CHOOSE

One of the hints that 'invokes' the Cost based optimizer
This hint lets the server choose (between ALL_ROWS and FIRST_ROWS, based on statistics gathered.

RULE

The RULE hint should be considered deprecated as it is dropped from Oracle9i2.

See also the following initialization parameters: optimizer_mode,optimizer_max_permutations, optimizer_index_cost_adj,optimizer_index_caching and

Hints for Access Paths

CLUSTER

Performs a nested loop by the cluster index of one of the tables.

FULL

Performs full table scan.

HASH

Hashes one table (full scan) and creates a hash index for that table. Then hashes other table and uses hash index to find corresponding records. Therefore not suitable for < or > join conditions.

ROWID

Retrieves the row by rowid

INDEX

Specifying that index index_name should be used on table tab_name:
/*+ index (tab_name index_name) */
Specifying that the index should be used the the CBO thinks is most suitable. (Not always a good choice).
Starting with Oracle 10g, the index hint can be described:

/*+ index(my_tab my_tab(col_1,
     col_2)) */

. Using the index on my_tab that starts with the columns col_1 and col_2.

INDEX_ASC

INDEX_COMBINE

INDEX_DESC

INDEX_FFS

INDEX_JOIN

NO_INDEX

AND_EQUAL

The AND_EQUAL hint explicitly chooses an execution plan that uses an access path that merges the scans on several single-column indexes

Hints for Query Transformations

FACT

The FACT hint is used in the context of the star transformation to indicate to the transformation that the hinted table should be considered as a fact table.

MERGE

NO_EXPAND

NO_EXPAND_GSET_TO_UNION

NO_FACT

NO_MERGE

NOREWRITE

REWRITE

STAR_TRANSFORMATION

USE_CONCAT

Hints for Join Operations

DRIVING_SITE
HASH_AJ
HASH_SJ
LEADING
MERGE_AJ
MERGE_SJ
NL_AJ
NL_SJ
USE_HASH
USE_MERGE
USE_NL

Hints for Parallel Execution

NOPARALLEL
PARALLEL
NOPARALLEL_INDEX
PARALLEL_INDEX
PQ_DISTRIBUTE

Additional Hints

ANTIJOIN
APPEND

If a table or an index is specified with nologging, this hint applied with an insert statement produces a direct path insert which reduces generation of redo.

BITMAP

BUFFER

CACHE

CARDINALITY

CPU_COSTING

DYNAMIC_SAMPLING

INLINE

MATERIALIZE

NO_ACCESS

NO_BUFFER

NO_MONITORING

NO_PUSH_PRED

NO_PUSH_SUBQ

NO_QKN_BUFF

NO_SEMIJOIN

NOAPPEND

NOCACHE

OR_EXPAND

ORDERED

ORDERED_PREDICATES

PUSH_PRED

PUSH_SUBQ

QB_NAME

RESULT_CACHE (Oracle 11g)

SELECTIVITY

SEMIJOIN

SEMIJOIN_DRIVER

STAR

The STAR hint forces a star query plan to be used, if possible. A star plan has the largest table in the query last in the join order and joins it with a nested loops join on a concatenated index. The STAR hint applies when there are at least three tables, the large table's concatenated index has at least three columns, and there are no conflicting access or join method hints. The optimizer also considers different permutations of the small tables.

SWAP_JOIN_INPUTS

USE_ANTI

USE_SEMI

Undocumented hints:

BYPASS_RECURSIVE_CHECK

Workaraound for bug 1816154

BYPASS_UJVC

CACHE_CB

CACHE_TEMP_TABLE

CIV_GB

COLLECTIONS_GET_REFS

CUBE_GB

CURSOR_SHARING_EXACT

DEREF_NO_REWRITE

DML_UPDATE

DOMAIN_INDEX_NO_SORT

DOMAIN_INDEX_SORT

DYNAMIC_SAMPLING

DYNAMIC_SAMPLING_EST_CDN

EXPAND_GSET_TO_UNION

FORCE_SAMPLE_BLOCK

GBY_CONC_ROLLUP

GLOBAL_TABLE_HINTS

HWM_BROKERED

IGNORE_ON_CLAUSE

IGNORE_WHERE_CLAUSE

INDEX_RRS

INDEX_SS

INDEX_SS_ASC

INDEX_SS_DESC

LIKE_EXPAND

LOCAL_INDEXES

MV_MERGE

NESTED_TABLE_GET_REFS

NESTED_TABLE_SET_REFS

NESTED_TABLE_SET_SETID

NO_FILTERING

NO_ORDER_ROLLUPS

NO_PRUNE_GSETS

NO_STATS_GSETS

NO_UNNEST

NOCPU_COSTING

OVERFLOW_NOMOVE

PIV_GB

PIV_SSF

PQ_MAP

PQ_NOMAP

REMOTE_MAPPED

RESTORE_AS_INTERVALS

SAVE_AS_INTERVALS

SCN_ASCENDING

SKIP_EXT_OPTIMIZER

SQLLDR

SYS_DL_CURSOR

SYS_PARALLEL_TXN

SYS_RID_ORDER

TIV_GB

TIV_SSF

UNNEST

USE_TTT_FOR_GSETS

Thanks

Sunday, January 8, 2017

Performance turning of Lookup Transformations

Lookup transformations are used to lookup a set of values in another table. Lookups slows down the performance.

1. To improve performance, cache the lookup tables. Informatica can cache all the lookup and reference tables; this makes operations run very fast.

2. Even after caching, the performance can be further improved by minimizing the size of the lookup cache. Reduce the number of cached rows by using a sql override with a restriction.

Cache: Cache stores data in memory so that Informatica does not have to read the table each time it is referenced. This reduces the time taken by the process to a large extent. Cache is automatically generated by Informatica depending on the marked lookup ports or by a user defined sql query.

Example for caching by a user defined query: –

Suppose we need to lookup records where employee_id=eno.

‘employee_id’ is from the lookup table, EMPLOYEE_TABLE and ‘eno’ is the

input that comes from the from the source table, SUPPORT_TABLE.

We put the following sql query override in Lookup Transform

‘select employee_id from EMPLOYEE_TABLE’

If there are 50,000 employee_id, then size of the lookup cache will be 50,000.

Instead of the above query, we put the following:-

‘select emp employee_id from EMPLOYEE_TABLE e, SUPPORT_TABLE s

where e. employee_id=s.eno’

If there are 1000 eno, then the size of the lookup cache will be only 1000.But here the performance gain will happen only if the number of records in SUPPORT_TABLE is not huge. Our concern is to make the size of the cache as less as possible.

3. In lookup tables, delete all unused columns and keep only the fields that are used in the mapping.

4. If possible, replace lookups by joiner transformation or single source qualifier.Joiner transformation takes more time than source qualifier transformation.

5. If lookup transformation specifies several conditions, then place conditions that use equality operator ‘=’ first in the conditions that appear in the conditions tab.

6. In the sql override query of the lookup table, there will be an ORDER BY clause. Remove it if not needed or put fewer column names in the ORDER BY list.

7. Do not use caching in the following cases: –

-Source is small and lookup table is large.

-If lookup is done on the primary key of the lookup table.

8. Cache the lookup table columns definitely in the following case: –

-If lookup table is small and source is large.

9. If lookup data is static, use persistent cache. Persistent caches help to save and reuse cache files. If several sessions in the same job use the same lookup table, then using persistent cache will help the sessions to reuse cache files. In case of static lookups, cache files will be built from memory cache instead of from the database, which will improve the performance.

10. If source is huge and lookup table is also huge, then also use persistent cache.

11. If target table is the lookup table, then use dynamic cache. The Informatica server updates the lookup cache as it passes rows to the target.

12. Use only the lookups you want in the mapping. Too many lookups inside a mapping will slow down the session.

13. If lookup table has a lot of data, then it will take too long to cache or fit in memory. So move those fields to source qualifier and then join with the main table.

14. If there are several lookups with the same data set, then share the caches.

15. If we are going to return only 1 row, then use unconnected lookup.

16. All data are read into cache in the order the fields are listed in lookup ports. If we have an index that is even partially in this order, the loading of these lookups can be speeded up.

17. If the table that we use for look up has an index (or if we have privilege to add index to the table in the database, do so), then the performance would increase both for cached and un cached lookups.

Thanks:-

Saturday, January 7, 2017

Informatica Partitions

When to use :-

Identification and elimination of performance bottlenecks will obviously optimize session performance. After tuning all the mapping bottlenecks, we can further optimize session performance by increasing the number of pipeline partitions in the session. Adding partitions can improve performance by utilizing more of the system hardware while processing the session.

The PowerCenter Integration Services creates a default partition type at each partition point. If you have the Partitioning option, you can change the partition type. The partition type controls how the PowerCenter Integration Service distributes data among partitions at partition points. When you configure the partitioning information for a pipeline, you must define a partition type at each partition point in the pipeline. The partition type determines how the PowerCenter Integration Service redistributes data across partition points.

Informatica Pipeline Partitioning Explained

Each mapping contains one or more pipelines. A pipeline consists of a source qualifier, all the transformations and the target. When the Integration Service runs the session, it can achieve higher performance by partitioning the pipeline and performing the extract, transformation, and load for each partition in parallel. Basically a partition is a pipeline stage that executes in a single reader, transformation, or writer thread.

The number of partitions in any pipeline stage equals the number of threads in the stage. By default, the Integration Service creates one partition in every pipeline stage. If we have the Informatica Partitioning option, we can configure multiple partitions for a single pipeline stage.

Setting partition attributes includes partition points, the number of partitions, and the partition types. In the session properties we can add or edit partition points. When we change partition points we can define the partition type and add or delete partitions(number of partitions).

We can set the following attributes to partition a pipeline:-

Partition point:
Partition points mark thread boundaries and divide the pipeline into stages. A stage is a section of a pipeline between any two partition points. The Integration Service redistributes rows of data at partition points. When we add a partition point, we increase the number of pipeline stages by one. Increasing the number of partitions or partition points increases the number of threads.

We cannot create partition points at Source instances or at Sequence Generator transformations.
Number of partitions:
A partition is a pipeline stage that executes in a single thread. If we purchase the Partitioning option, we can set the number of partitions at any partition point. When we add partitions, we increase the number of processing threads, which can improve session performance. We can define up to 64 partitions at any partition point in a pipeline. When we increase or decrease the number of partitions at any partition point, the Workflow Manager increases or decreases the number of partitions at all partition points in the pipeline. The number of partitions remains consistent throughout the pipeline. The Integration Service runs the partition threads concurrently.
Partition types:
The Integration Service creates a default partition type at each partition point. If we have the Partitioning option, we can change the partition type. The partition type controls how the Integration Service distributes data among partitions at partition points.

We can define the following partition types here: Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin.

Database partitioning.

The PowerCenter Integration Service queries the IBM DB2 or Oracle system for table partition information. It reads partitioned data from the corresponding nodes in the database. Use database partitioning with Oracle or IBM DB2 source instances on a multi-node table space. Use database partitioning with DB2 targets.

Hash partitioning

. Use hash partitioning when you want the PowerCenter Integration Service to distribute rows to the partitions by group. For example, you need to sort items by item ID, but you do not know how many items have a particular ID number. You can use the following types of hash partitioning:

 Hash auto-keys:- The PowerCenter Integration Service uses all grouped or sorted ports as a compound partition key. You may need to use hash autokeys partitioning at Rank,Sorter, and unsorted Aggregator transformations.
 Hash user keys:- The PowerCenter Integration Service uses a hash function to group rows of data among partitions. You define the number of ports to generate the partition key.

Key range:-
You specify one or more ports to form a compound partition key. The PowerCenter Integration Service passes data to each partition depending on the ranges you specify for each port. Use key range partitioning where the sources or targets in the pipeline are partitioned by key range.

Pass-through:- The PowerCenter Integration Service passes all rows at one partition point to the next partition point without redistributing them. Choose pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions.

Round-robin:- The PowerCenter Integration Service distributes blocks of data to one or more partitions. Use round-robin partitioning so that each partition processes rows based on the number and size of the blocks.

Limitation:-
You cannot create partition points for the following transformations:
 Source definition
 Sequence Generator
 XML Parser
 XML target
 Unconnected transformations

We cannot create a partition key for hash auto-keys, round-robin, or pass-through types partitioning
If you have bitmap index defined upon the target and you are using pass-through partitioning to, say Update the target table - the session might fail as bitmap index creates serious locking problem in this scenario
Partitioning considerably increases the total DTM buffer memory requirement for the job. Ensure you have enough free memory in order to avoid memory allocation failures
When you do pass-through partitioning, Informatica will try to establish multiple connection requests to the database server. Ensure that database is configured to accept high number of connection requests
As an alternative to partitioning, you may also use native database options to increase degree of parallelism of query processing. For example in Oracle database you can either specify PARALLEL hint or alter the DOP of the table in subject.
If required you can even combine Informatica partitioning with native database level parallel options - e.g. you create 5 pass-through pipelines, each sending query to Oracle database with PARALLEL hint.