DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE: March 2017

Monday, March 20, 2017

Informatica PowerCenter 9 Architecture

Before we learn how to use Informatica, we need to understand what are the important components of Informatica and how it works.

Informatica tool consists of following services & components

1.Repository Service – Responsible for maintaining Informatica metadata & providing access of same to other services.

2.Integration Service – Responsible for the movement of data from sources to targets

3.Reporting Service - Enables the generation of reports

4.Nodes – Computing platform where the above services are executed

5. Informatica Designer - Used for creation of mappings between source and target

6. Workflow Manager – Used to create workflows and other task & their execution

7. Workflow Monitor – Used to monitor the execution of workflows

8. Repository Manager – Used to manage objects in repository

Informatica Repository:
The informatica repository is at the center of the informatica suite.

· The Informatica repository is a relational database that stores information, or metadata, used by the Informatica Server and Client tools.

· Metadata is data about data which include information such as source definitions , target definitions, mappings describing how to transform source data, sessions indicating when you want the Informatica Server to perform the transformations, and connect strings for sources and targets.

· The repository also stores administrative information such as usernames and passwords, permissions and privileges, and product version.

· Use repository manager to create the repository. The Repository Manager connects to the repository database and runs the code needed to create the repository tables. These tables stores metadata in specific format the informatica server, client tools use.

Informatica Repository:
The informatica repository is at the center of the informatica suite.

· The Informatica repository is a relational database that stores information, or metadata, used by the Informatica Server and Client tools.

· The repository also stores administrative information such as usernames and passwords, permissions and privileges, and product version.

Informatica Components:

Server Components:

Repository Service
Integration Service

Client Components:

Repository Manager
Designer
Workflow Manager
Workflow Monitor

Server Components
1. Repository Server:

Ø The Repository Server manages the metadata in the repository database.

Ø The Repository Server manages connections to the repository from client applications.

Ø The Repository Service is a separate, multi-threaded process that retrieves, inserts, and updates metadata in the repository database tables. The Repository Service ensures the consistency of metadata in the repository.

2. Integration Service:

Ø The Integration Server reads mapping and session information from the repository. It extracts data from the mapping sources and stores the data in memory while it applies the transformation rules that you configure in the mapping. The Integration Server loads the transformed data into the mapping targets.

Ø Manages the scheduling and execution of workflows

Ø The Integration Server can start and run multiple workflows concurrently. It can also concurrently process partitions within a single session.

Monday, March 13, 2017

Seeding the Cache IN OBIEE11G

Seeding the Cache

I runs an OBIEE11G dashboard, and the results are added to the cache so that when Bill runs the same dashboard Bill gets a great response rate because his dashboard runs straight from cache. Kinda sucks for Bob though, because his query ran slow as it wasn’t in the cache yet. What’d be nice would be that for the first user on a dashboard the results were already in cache. There are several options for seeding the cache. These all assume you’ve figured out the queries that you want to run in order to load the results into cache.

Run the analysis manually, which will return the analysis data to you and insert it into the BI Server Cache too.
Create an Agent to run the analysis with destination set to Oracle BI Server Cache (For seeding cache), and then either:
1. Schedule the analysis to run from an Agent on a schedule
2. Trigger it from a Web Service in order to couple it to your ETL data load / cache purge batch steps.
Use the BI Server Procedure SASeedQuery (which is what the Agent does in the background) to load the given query into cache without returning the data to the client. This is useful for doing over JDBC/ODBC/Web Service (as discussed for purging above). You could just run the Logical SQL itself, but you probably don’t want to pull the actual data back to the client, hence using the procedure call instead.

SET VARIABLE SAW_SRC_PATH='/users/weblogic/Cache Test 01',  
DISABLE\_CACHE\_HIT=1:SELECT  
   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  
ORDER BY 1, 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY

Checking the RPD for Cacheable Tables

The RPD Query Tool is great for finding objects matching certain criteria. However, it seems to invert results when looking for Cacheable Physical tables - if you add a filter of Cacheable = false you get physical tables where Cacheable is enabled! And the same in reverse (Cacheable = true -> shows Physical tables where Cacheable is disabled)

Cache Location

The BI Server cache is held on disk, so it goes without saying that storing it on fast (eg SSD) disk is a Good Idea. There's no harm in giving it its own filesystem on *nix to isolate it from other work (in terms of filesystems filling up) and to make monitoring it super easy.

Use the DATASTORAGEPATHS configuration element in NQSConfig.ini to change the location of the BI Server cache

Managing Performance Tuning and Query Caching In Obiee11g

OBIEE11G BI Server Cache Management:-

What is Cache?

Analytics systems reduce database calls by creating a ‘Cache’ of data on the same machine as the Analytics engine. Caching has a small cost in terms of disk space to store and a small number of I/O transactions on the server, but this cost is easily outweighed by the improvement in response time.

Types of Cache in OBIEE

There are two different types of cache in Oracle Business Intelligence Enterprise Edition (OBIEE):

The Query Cache (BI Server Cache): The OBIEE Server can save the results of a query in cache files and then reuse those results later when a similar query is requested. This type of cache is referred to as Query cache.
The Presentation Server Cache: When users run analytics, Presentation server can cache the results

Why do we need to Purge Cache?

The main reason to implement the cache purge process is, when we have data that is being updated frequently in the warehouse and if there is a query cache present for the query that hits on the database, then the numbers may vary from that of the database. In such cases if you purge the cache the issue gets resolved.

Steps to Automatically Purge BI Server Cache

[Assuming a Oracle Business Intelligence Enterprise Edition (OBIEE) server and Data warehouse Administration Console (DAC) server are installed on the Linux machines.]

Here are the few steps to purging BI SERVER cache automatically in Oracle Business Intelligence Applications. For our purposes, the OBIEE server and DAC server are on two different Linux machines.

Step 1

As the OBIEE and DAC servers present are two different servers we need to login to OBIEE server from DAC server. This can be achieved by using the SSH command in Linux.

Step 2

As the DAC server requires password-less login, we either need to setup the RSA keys or use the ‘sshpass’ command to login to the OBIEE server. Where RSA keys are most preferable by the client personal. Setting up RSA keys is mostly the work of a DBA.

Step 3

Once the login without using a password is completed, we now need to move to the OBIEE HOME and create two files: purge.txt and purgecache.sh

We also must provide the privileges required for execution.

Thus after every ETL load the purging of the cache can be achieved.

We can check the output of the purging process by navigating to the dac/log folder and check for post_etl.sh.log file.

The Above steps help us purge the BI SERVER cache automatically after every DAC load.

Presentation Server Cache Management

The Presentation server cache can be managed in the instanceconfig.xml file and it’s a one-time setup by adding the tags shown below (just above the

at the end of the file). Don’t bother to automate the purge of Presentation server cache.

Web Cache
1440 1440 1440

120 180 180 120 600

1440
1440

This one only for BI Server to Presentation server expire only. 10

A cache management strategy sounds grand doesn’t it? But it boils down to two things:

Accuracy – Flush any data from the cache that is now stale
Speed – Prime the cache so that as many queries get a hit on it, first time

Maintaining an Accurate Cache

Every query that is run through the BI Server, whether from a Dashboard, Answers, or more funky routes such as custom ODBC clients or JDBC, will end up in cache. It’s possible to “seed” (“prime”/“warmup”) the cache explicitly, and this is discussed later. The only time you won’t see data in the cache is if (a) you have BI Server caching disabled, or (b) you’ve disabled the Cacheable option for a physical table that is involved in providing the data for the query being run.

Purging Options

So we’ve a spread of queries run that hit various dimension and fact tables and created lots of cache entries. Now we’ve loaded data into our underlying database, so we need to make sure that the next time a user runs an OBIEE query that uses the new data they can see it. Otherwise we commit the cardinal sin of any analytical system and show the user incorrect data which is a Bad Thing. It may be fast, but it’s WRONG….

We can purge the whole cache, but that’s a pretty brutal approach. The cache is persisted to disk and can hold lots of data stretching back months - to blitz all of that just because one table has some new data is overkill. A more targeted approach is to purge by physical database, physical table, or even logical query. When would you use these?

Purge entire cache - the nuclear option, but also the simplest. If your data model is small and a large proportion of the underlying physical tables may have changed data, then go for this
Purge by Physical Database - less brutal that clearing the whole cache, if you have various data sources that are loaded at different points in the batch schedule then targeting a particular physical database makes sense.
Purge by Physical Table - if many tables within your database have remained unchanged, whilst a large proportion of particular tables have changed (or it’s a small table) then this is a sensible option to run for each affected table
Purge by Query - If you add a few thousand rows to a billion row fact table, purging all references to that table from the cache would be a waste. Imagine you have a table with sales by day. You load new sales figures daily, so purging the cache by query for recent data is obviously necessary, but data from previous weeks and months may well remain untouched so it makes sense to leave queries against those in the cache. The specifics of this choice are down to you and your ETL process and business rules inherent in the data (maybe there shouldn’t be old data loaded, but what happens if there is? See above re. serving wrong data to users). This option is the most complex to maintain because you risk leaving behind in the cache data that may be stale but doesn’t match the precise set of queries that you purge against.

Which one is correct depends on

your data load and how many tables you’ve changed
your level of reliance on the cache (can you afford low cache hit ratio until it warms up again?)
time to reseed new content

If you are heavily dependant on the cache and have large amounts of data in it, you are probably going to need to invest time in a precise and potentially complex cache purge strategy. Conversely if you use caching as the ‘icing on the cake’ and/or it’s quick to seed new content then the simplest option is to purge the entire cache. Simple is good; OBIEE has enough moving parts without adding to its complexity unnecessarily.

Note that OBIEE itself will perform cache purges in some situations including if a dynamic repository variable used by a Business Model (e.g. in a Logical Column) gets a new value through a scheduled initialisation block.

Performing the Purge

There are several ways in which we can purge the cache. First I’ll discuss the ones that I would not recommend except for manual testing:

Administration Tool -> Manage -> Cache -> Purge. Doing this every time your ETL runs is not a sensible idea unless you enjoy watching paint dry (or need to manually purge it as part of a deployment of a new RPD etc).
In the Physical table, setting Cache persistence time. Why not? Because this time period starts from when the data was loaded into the cache, not when the data was loaded into your database.
An easy mistake to make would be to think that with a daily ETL run, setting the Cache persistence time to 1 day might be a good idea. It’s not, because if your ETL runs at 06:00 and someone runs a report at 05:00, there is a going to be a stale cache entry present for another 23 hours. Even if you use cache seeding, you’re still relinquishing control of the data accuracy in your cache. What happens if the ETL batch overruns or underruns?
The only scenario in which I would use this option is if I was querying directly against a transactional system and wanted to minimise the number of hits OBIEE made against it - the trade-off being users would deliberately be seeing stale data (but sometimes this is an acceptable compromise, so long as it’s made clear in the presentation of the data).

So the two viable options for cache purging are:

BI Server Cache Purge Procedures

These are often called “ODBC” Procedures but technically ODBC is just one - of several - ways that the commands can be sent to the BI Server to invoke.

As well as supporting queries for data from clients (such as Presentation Services) sent as Logical SQL, the BI Server also has its own set of procedures. Many of these are internal and mostly undocumented (Christian Berg does a great job of explaining them here, and they do creep into the documentation here and here), but there are some cache management ones that are fully supported and documented. They are:

SAPurgeCacheByQuery
SAPurgeCacheByTable
SAPurgeCacheByDatabase
SAPurgeAllCache
SAPurgeCacheBySubjectArea (>= 11.1.1.9)
SAPurgeCacheEntryByIDVector (>= 11.1.1.9)

Seeding the Cache

http://dineshng03.blogspot.in/2017/03/seeding-cache-in-obiee11g.html

Monday, March 6, 2017

Hints in oracle and Informatica

Hints for improving the performance of query in Oracle

The performance and the explain plan of a query can be improved by using Hints in the query, here are few of them:
1. /*+ parallel(table_name,8) */ can be used in a select statement.

Example: select /*+ parallel(emp,8) */ * from emp;
This will help in getting the results quickly, this hint will create 8 parallel pipelines to select the records from the emp table in this example.
This hint can also be used with inserts, but will only help without the DBlinks (meaning copying data from the same database).

2. /*+ append */ used with insert statement

Example: insert /*+ append */ into emp select /*+ parallel(emp,8) */ * from xyz;
This should be used for large loads, it bypasses the buffer cache and does a direct path load.

3. /*+ use_hash(table1 table2...) */ used with select statement

Example: select /*+ use_hash(table1 table2...) */ * from table1 table2.. where table1.xyz = table2.yzx;
This is used to improve the explain plan of the query, this hint eliminates the nested loops and uses the hash join instead. This helps in improving the performance of the query.

SELECT /*+ PARALLEL(employees 4) PARALLEL(departments 4) USE_HASH(employees)

ORDERED */

MAX(salary), AVG(salary)

FROM employees, departments

WHERE employees.department_id = departments.department_id

GROUP BY employees.department_id;

Hints in oracle

· ALL_ROWS
One of the hints that 'invokes' the Cost based optimizer
ALL_ROWS is usually used for batch processing or data warehousing systems.

(/*+ ALL_ROWS */)

· FIRST_ROWS
One of the hints that 'invokes' the Cost based optimizer
FIRST_ROWS is usually used for OLTP systems.

(/*+ FIRST_ROWS */)

SELECT /*+ FIRST_ROWS(10) */ * FROM employees;

· CHOOSE
One of the hints that 'invokes' the Cost based optimizer
This hint lets the server choose (between ALL_ROWS and FIRST_ROWS, based on statistics gathered.

· HASH
Hashes one table (full scan) and creates a hash index for that table. Then hashes other table and uses hash index to find corresponding records. Therefore not suitable for < or > join conditions.

/*+ use_hash */

Hints are most useful to optimize the query performance.

/*+ hint */

/*+ hint(argument) */

/*+ hint(argument-1 argument-2) */

All hints except

/*+
rule */

cause the CBO to be used. Therefore, it is good practise to analyze the underlying tables if hints are used (or the query is fully hinted.

There should be no schema names in hints. Hints must use aliases if alias names are used for table names. So the following is wrong:

select /*+ index(scott.emp ix_emp) */ from scott.emp emp_alias

better:

select /*+ index(emp_alias ix_emp) */ ... from scott.emp emp_alias

Why using hints

It is a perfect valid question to ask why hints should be used. Oracle comes with an optimizer that promises to optimize aquery's execution plan. When this optimizer is really doing a good job, no hints should be required at all.

Sometimes, however, the characteristics of the data in the database are changing rapidly, so that the optimizer (or more accuratly, its statistics) are out of date. In this case, a hint could help.

It must also be noted, that Oracle allows to lock the statistics when they look ideal which should make the hints meaningless again.

Hint categories

Hints can be categorized as follows:

Hints for Optimization Approaches and Goals,
Hints for Access Paths, Hints for Query Transformations,
Hints for Join Orders,
Hints for Join Operations,
Hints for Parallel Execution,
Additional Hints

Documented Hints

Hints for Optimization Approaches and Goals

ALL_ROWS

One of the hints that 'invokes' the Cost based optimizer
ALL_ROWS is usually used for batch processing or data warehousing systems.

FIRST_ROWS

One of the hints that 'invokes' the Cost based optimizer
FIRST_ROWS is usually used for OLTP systems.

CHOOSE

One of the hints that 'invokes' the Cost based optimizer
This hint lets the server choose (between ALL_ROWS and FIRST_ROWS, based on statistics gathered.

RULE

The RULE hint should be considered deprecated as it is dropped from Oracle9i2.

See also the following initialization parameters: optimizer_mode,optimizer_max_permutations, optimizer_index_cost_adj,optimizer_index_caching and

Hints for Access Paths

CLUSTER

Performs a nested loop by the cluster index of one of the tables.

FULL

Performs full table scan.

HASH

Hashes one table (full scan) and creates a hash index for that table. Then hashes other table and uses hash index to find corresponding records. Therefore not suitable for < or > join conditions.

ROWID

Retrieves the row by rowid

INDEX

Specifying that index index_name should be used on table tab_name:
/*+ index (tab_name index_name) */
Specifying that the index should be used the the CBO thinks is most suitable. (Not always a good choice).
Starting with Oracle 10g, the index hint can be described:

/*+ index(my_tab my_tab(col_1,
     col_2)) */

. Using the index on my_tab that starts with the columns col_1 and col_2.

INDEX_ASC

INDEX_COMBINE

INDEX_DESC

INDEX_FFS

INDEX_JOIN

NO_INDEX

AND_EQUAL

The AND_EQUAL hint explicitly chooses an execution plan that uses an access path that merges the scans on several single-column indexes

Hints for Query Transformations

FACT

The FACT hint is used in the context of the star transformation to indicate to the transformation that the hinted table should be considered as a fact table.

MERGE

NO_EXPAND

NO_EXPAND_GSET_TO_UNION

NO_FACT

NO_MERGE

NOREWRITE

REWRITE

STAR_TRANSFORMATION

USE_CONCAT

Hints for Join Operations

DRIVING_SITE
HASH_AJ
HASH_SJ
LEADING
MERGE_AJ
MERGE_SJ
NL_AJ
NL_SJ
USE_HASH
USE_MERGE
USE_NL

Hints for Parallel Execution

NOPARALLEL
PARALLEL
NOPARALLEL_INDEX
PARALLEL_INDEX
PQ_DISTRIBUTE

Additional Hints

ANTIJOIN
APPEND

If a table or an index is specified with nologging, this hint applied with an insert statement produces a direct path insert which reduces generation of redo.

BITMAP

BUFFER

CACHE

CARDINALITY

CPU_COSTING

DYNAMIC_SAMPLING

INLINE

MATERIALIZE

NO_ACCESS

NO_BUFFER

NO_MONITORING

NO_PUSH_PRED

NO_PUSH_SUBQ

NO_QKN_BUFF

NO_SEMIJOIN

NOAPPEND

NOCACHE

OR_EXPAND

ORDERED

ORDERED_PREDICATES

PUSH_PRED

PUSH_SUBQ

QB_NAME

RESULT_CACHE (Oracle 11g)

SELECTIVITY

SEMIJOIN

SEMIJOIN_DRIVER

STAR

The STAR hint forces a star query plan to be used, if possible. A star plan has the largest table in the query last in the join order and joins it with a nested loops join on a concatenated index. The STAR hint applies when there are at least three tables, the large table's concatenated index has at least three columns, and there are no conflicting access or join method hints. The optimizer also considers different permutations of the small tables.

SWAP_JOIN_INPUTS

USE_ANTI

USE_SEMI

Undocumented hints:

BYPASS_RECURSIVE_CHECK

Workaraound for bug 1816154

BYPASS_UJVC

CACHE_CB

CACHE_TEMP_TABLE

CIV_GB

COLLECTIONS_GET_REFS

CUBE_GB

CURSOR_SHARING_EXACT

DEREF_NO_REWRITE

DML_UPDATE

DOMAIN_INDEX_NO_SORT

DOMAIN_INDEX_SORT

DYNAMIC_SAMPLING

DYNAMIC_SAMPLING_EST_CDN

EXPAND_GSET_TO_UNION

FORCE_SAMPLE_BLOCK

GBY_CONC_ROLLUP

GLOBAL_TABLE_HINTS

HWM_BROKERED

IGNORE_ON_CLAUSE

IGNORE_WHERE_CLAUSE

INDEX_RRS

INDEX_SS

INDEX_SS_ASC

INDEX_SS_DESC

LIKE_EXPAND

LOCAL_INDEXES

MV_MERGE

NESTED_TABLE_GET_REFS

NESTED_TABLE_SET_REFS

NESTED_TABLE_SET_SETID

NO_FILTERING

NO_ORDER_ROLLUPS

NO_PRUNE_GSETS

NO_STATS_GSETS

NO_UNNEST

NOCPU_COSTING

OVERFLOW_NOMOVE

PIV_GB

PIV_SSF

PQ_MAP

PQ_NOMAP

REMOTE_MAPPED

RESTORE_AS_INTERVALS

SAVE_AS_INTERVALS

SCN_ASCENDING

SKIP_EXT_OPTIMIZER

SQLLDR

SYS_DL_CURSOR

SYS_PARALLEL_TXN

SYS_RID_ORDER

TIV_GB

TIV_SSF

UNNEST

USE_TTT_FOR_GSETS

Thanks