Saturday, May 27, 2017

Prune days and Change Capture in DAC

Prune days and Change Capture in DAC

Here is a question I often get from customers regarding change capture by DAC when it is used for extracting Siebel data. The problem is equally valid for non-Siebel data sources if the default change capture mechanism provided in DAC is used for identifying changes in source data.
How can we ensure that data committed with an update timestamp between the start of an ETL run and the end of the extract is not excluded by the next ETL?
  DAC stores a value called LAST_REFRESH_DATE at the end of an ETL extract.
 During the next run, DAC captures changes to the source data by comparing this LAST_REFRESH_DATE with the LAST_UPD date of records in the source.
·   So if the LAST_REFRESH_DATE is less than LAST_UPD of a source record, then the record being processed has either been updated or created between the time of last execution and the current run
·    But what will happen if a record in the source is committed between the time of starting the data extract and completing the extract with a commit date that is less than the LAST_REFRESH_DATE
Prune days to the rescue!!! The solution to the above problem is to set the PRUNE_DAYS parameter.
If the prune days parameter is set to 1 for example, the change capture process will use LAST_REFRESH_DATE – Prune days as the cutoff timestamp for change capture
Here is an example of such a scenario
·    Let us assume that there is a table called S_CUSTOMER in transaction database with 2 records created during business hours on the 1st of January.
·    On the January 1st, Extraction of Customer Data begins at 1700 hours and completes at January 1st, 1800 hours into a customer Dimension called WC_CUSTOMER.
·    LAST_REFRESH_DATE stored is January 1, 1800 hours
   Now, a new customer is created between 1700 and 1800. But it is committed to the database AFTER the extract is completed with a timestamp of January 1st, 17.30
More customers are added to the transaction database during the next working day. As a result the following is the condition of the source table at the time of the next ETL Extract i.e. Jan 2, 12:00 am

Now the change capture process in DAC will compare the stored LAST_REFRESH_DATE with the update timestamp of the records in the source data and pull only the records updated after this LAST_REFRESH_DATE i.e. Jan 1, 1800 hours into the target table. Unfortunately for the 3rd record depicted below, the LAST_UPD is Jan 1, 1730 i.e. it is before the LAST_REFRESH_DATE. So, it does not qualify for the incremental extract and we have a missing record in the target table.

·    g) But if we had set the PRUNE DAYS parameter to 1, the cutoff date used to identify changes in source data would be .Dec 31, 1800 hours i.e LAST_REFRESH_DATE – 1 day . If this ploy is used, all the records with LAST_UPD > Dec 31, 1800 hours will qualify for the incremental extract and we will have a complete set of data in the target table

The Prune days parameter can be used to ensure that DAC Change Capture does not miss out data created around the time of data extraction.
Set it with care to a reasonable value to avoid increasing the volume of data selected by incremental load process.
New Feature in DAC 7.9 for non-Siebel Sources
For source systems other than Siebel, the Prune Days setting is used in the same way except that the DAC subtracts the number of prune days from the LAST_REFRESH_DATE of a given source and supplies this as the value for the $$LAST_EXTRACT_DATE parameter.

DAC -Warehousing Application Console

DAC -Warehousing Application Console  



DAC stands for Data Warehouse Application Console.  It is an essential component of Oracle BI application architecture.
DAC serves the following purposes:
1. DAC is a metadata driven administration and deployment tool for ETL and data warehouse objects
 2. Used by warehouse developers and ETL Administrator
 3. Application Configuration
  • Manages metadata-driven task dependencies and relationships
  • Allows creating custom ETL execution plans
  • Allows for dry-run development and testing
4. Execution
  • Enables parallel loading for high performance ETL
  • Facilitates in index management and database statistics collection
  • Automates change capture for Siebel OLTP
  • Assists in capturing deleted records
  • Fine grain restarts ability.
5. Monitoring
  • Enables remote admin and monitoring
  • Provides runtime metadata validation checks
  • Provides in-context documentation
DAC Task
In Oracle BI Application Data Warehouse Administration Console, a task is a unit of work that loads data to one or more tables. A task can extract the data from the source and populate the staging tables or load the data from the staging tables to the target warehouse tables.
Typically a task is mapped to an Informatica workflow. DAC does not include the actual mapping logic. DAC stores the essential metadata which satisfies the needs to generate the ETL execution plan. A task is the minimum unit of work the DAC engine can include in the execution plan.
A task in DAC consists of the following information:
  • Source Tables
  • Target Tables
  • Phase
  • Execution Type
  • Truncate Option
  • Command for Increment Load : The name of the Informatica workflow
  • Command for Full Load: The name of the Informatica workflow
Oracle BI Applications provides the pre-defined task seed data for the predefined Informatica mappings for different OLTP adapters (Oracle E-Business Suite, PeopleSoft, Siebel, and SAP) The DAC client allows you to define new tasks and configure the execution plan for your custom deployment. When the task is mapped to Informatica workflow, DAC provides the capability to synchronize the metadata between DAC repository and Informatica repository. If the workflow name is found in Informatica repository, the source tables and target tables of the Informatica workflow will be fetched and loaded into DAC repository for the DAC task.
The source and target table information is essential for DAC to determine the tasks required for populating a target star schema and to determine the sequence of tasks in execution plan.
Tip: Always use the “Synchronize Tasks” option when you define a task in DAC.
You can also execute a SQL script to load the data instead of using Informatica Workflow.  Informatica gave Oracle BI applications development the advantage of developing the platform independent ETL programs, which can extract and load the data from and into different database platform.  However, DAC does support the ETL execution type, other than Informatica workflow, such as SQL script.  If you want to improve your data warehouse ETL performance or you find that it is easier to develop SQL script to perform your custom ETL jobs, you can create SQL scripts and still register the scripts with DAC in its repository so it can be included in the execution plan and executed automatically by the DAC execution engine.
Tip:  Create SQL script to improve your warehouse ETL performance in your custom warehouse deployment

 

Why to use DAC?

Following are the reasons why we go for DAC
·         Scheduling Tasks: It helps in scheduling an execution plan run. An execution plan can be scheduled for different frequencies or recurrences by defining multiple schedules. (An execution plan is defined based on business requirements for when the data warehouse needs to be loaded. Execution plans are scheduled in sequence, not in parallel)
·         Automate Execution: It can be quite difficult and error prone to run different workflows manually and DAC helps us to automate execution of workflows according to our needs.
·         Easy Flow Management: It helps easy management of workflow execution. Take a simple example I have created 5 different workflows 3rd should only be run after 4th and 5th workflow. Running them manually can always leave a possibility of error and result in wrong data load. DAC can help us define dependency and order of the workflows in which they should be run.
Also DAC manage the performance by dropping indexes, truncating stage tables, rebuilding the indexes, and analyzing the tables during the process
If you do not use DAC, you have to write your own custom change capture process and need to redesign from scratch an ETL method that allow the restart of the ETL process from point of failure at record level. The biggest saving is that DAC can survive during the upgrade , while your custom processes cannot.

Components of DAC

DAC Client: User interface through which user performs the activities (configuration, administration, loading, monitoring) to manage the analytics schema.
DAC Server: Executes the instructions given from the DAC client for running the ETL routines.
DAC Repository: Stores the metadata used by DAC to manage this whole data warehouse process.

 

Difference between Full load & Incremental load

In most OBIA projects we create both incremental and full load mappings for each table in DWH target. Below you can find the difference b/w both. 
 
Full Load is the entire data dump load taking place the very first time. In this we give the last extract date as empty so that all the data gets loaded
Incremental - Where delta or difference between target and source data is dumped at regular intervals. Here we give the last extract date such that only records after this date are loaded.

Note:
·         Full load: $$Last_Extract_Date = either leave it blank or give some old date so that all the records are loaded
·         Incremental load: $$Last_Extract_Date = The date from which you need the data (to get latest record).

Why Incremental?
·         Speed. Opting to do a full load on larger datasets will take a great amount of time and other server resources. Ideally all the data loads are performed overnight with the expectation of completing them before users can see the data the next day. The overnight window may not be enough time for the full load to complete.
·         Preserving history. When dealing with a OLTP source that is not designed to keep history, a full load will remove history from the destination as well, since full load will remove all the records first, remember! So a full load will not allow you to preserve history in the data warehouse.
Full Load vs. Incremental Load:
Full Load
Incremental Load
Truncates all rows and loads from scratch.
New records and updated ones are loaded
Requires more time.
Requires less time.
Can easily be guaranteed
Difficult. ETL must check for new/updated rows.
Can be lost.
Retained.

DAC Process Cycle

DAC is used to design, execute, monitor and diagnose execution plans
Setup: Database connections, ETL setup in Informatica
Design: Define Tasks, Groups and Execution plans
Execute: Define parameters, schedule and run the execution plans
Monitor: Monitoring the run time executions
Diagnose: In case of task failure identifying the route cause and rerunning the task.

Execution Plan (DAC, OBIA, OBIEE)
An execution plan includes the ordered tasks to be executed together in order to populated one or more subject area.
An execution plan is generated by DAC based the logical dependencies among the tables and tasks required for populating the tables involved in a subject area.
You run the ETL jobs for Oracle BI Apps from DAC by running the execution plan. You can schedule the execution to run periodically in the DAC. You can also monitor execution plan process in the DAC.

An execution plan is the unit of work for you to organize, schedule, execute, and monitor your ETL processes.

How is change data capture been reflected from Informatica to Siebel analytics?

Each source table involved in the ETL process has two image tables, S_ETL_R_IMG_# and S_ETL_I_IMG_#. The # is an image suffix (In DAC) which is specific for the source table and is maintained in the DAC. 

Siebel CRM stores for each record the number of times a record is modified and the last datetime the record has been updated: 
MODIFICATION_NUM and LAST_UPDATE_DT, respectively. For new records mod_num is 1. 

The 'R' image table holds an image of all the records that have been loaded into the data warehouse since the last Full Load. The 'I' image table holds all records that will be loaded into the data warehouse for the current run for the source table the image table belongs to. The 'I' image table holds the operation in the corresponding column for each record. After the ETL process has been completed, the 'I' image table is truncated and the 'R' image is updated. 

During the comparison process (in the beginning of the ETL process) (Changed Data Capture) each source table is compared with the corresponding image table by comparing the LAST_UPDATE_DT and MODIFICATION_NUM fields. The 'I' image table is updated with the result of the comparison. 

When the Changed Data Capture process has been completed, the SDE process (source dependent extract into staging area) takes off. 
CHANGE DATA CAPTURE PROCESS SIEBEL ANALYTICS 7.8
Initial Tasks Performed on the Transaction Database (OLTP)
• SIF File and Image Tables
As part of the initial “installation” steps for Analytics 7.8.x, a SIF file is applied to the Siebel Transaction Database (OLTP). This creates many S_ETL_* tables including three types of image tables that are used for the change data capture process: 

 S_ETL_D_IMG_* tables: 
These are “delete” tables that are used to capture data for rows that have been deleted in the OLTP. Rows are inserted into these “D” image tables via database triggers.

 S_ETL_I_IMG_* tables: 
These tables are used for all incremental changes (inserts/updates/deletes). Data is loaded to these tables from their corresponding OLTP base tables and “D” image tables via the DAC at the beginning of a load.

 S_ETL_R_IMG_* tables: 
These are “reference” tables that reflect the data that has been loaded to the SRMW (Siebel Relationship Management Warehouse). For performance reasons, only rows with last_upd within the prune period are retained in this table. (“Prune days” is explained later in this document.) Data is loaded into the “R” tables via the DAC at the end of a load.

 The “D” and “R” image tables have the following structure:
Name Null? Type
------------------------------------ -------- ------------
ROW_ID NOT NULL VARCHAR2 (15 CHAR)
LAST_UPD NOT NULL DATE
MODIFICATION_NUM NOT NULL NUMBER(10)

 The “I” image table has the following structure:
Name Null? Type
------------------------------------ -------- ------------
ROW_ID NOT NULL VARCHAR2(15 CHAR)
LAST_UPD NOT NULL DATE
MODIFICATION_NUM NOT NULL NUMBER(10)
OPERATION NOT NULL VARCHAR2(1 CHAR)

• SIF File and Delete Triggers
o When the SIF file is applied delete triggers are created in the OLTP. They are applied to only certain tables (i.e. S_CONTACT, S_ORG_EXT, etc.). When a record is deleted from one of these tables, a row is inserted into the corresponding S_ETL_D_IMG_* table. 


Full Load
• Initially, a full load is performed to extract all required data and load all tables in the Siebel Relationship Management Warehouse (SRMW).
• The pre-load change data capture steps (DAC task ‘Change Capture For Siebel OLTP’) for a full load are described below (for all examples, please assume that the current run date is 2007-06-17): 

o Image tables (“D”, “I” and “R”) are truncated (for e.g. S_ETL_*_IMG_12 is for S_CONTACT) 

o New records are inserted into the “R” table, i.e. S_ETL_R_IMG_12:

INSERT /*+APPEND*/ INTO S_ETL_R_IMG_12 (ROW_ID, MODIFICATION_NUM, LAST_UPD) 
SELECT ROW_ID, MODIFICATION_NUM, LAST_UPD FROM S_CONTACT 
WHERE 
LAST_UPD > TO_DATE ('2007-05-18 01:00:26', 'YYYY-MM-DD HH: MI: SS’) 

/* This is current_run date MINUS Prune Days (for example, 30 days) */
/* “Prune days” will be discussed later in this document */
Oddly, step this runs prior to the extraction or load of any data instead of subsequent to it. It is premature to have these rows inserted into the R table prior to the end of the load, but this is the way Siebel engineered it. 
o Views are dropped and recreated as: 

CREATE VIEW V_CONTACT AS
SELECT
*
FROM
S_CONTACT

NOTE - During Full load this view is intentionally the same as the base S_% table so that all the rows in the base S_ table are extracted by the ETLs. For incremental load, this view has different SQL behind it (explained later in this document.) 

• At end of load, when the post-load change capture step is executed (DAC task ‘Change Capture Sync For Siebel OLTP’), the views are dropped and recreated using SQL that joins the base table to the “I” image table:

CREATE VIEW V_CONTACT AS
SELECT
*
FROM
S_CONTACT,
S_ETL_I_IMG_12
WHERE
S_CONTACT.ROW_ID = S_ETL_I_IMG_12.ROW_ID

This is done in preparation for future incremental loads. During an incremental load, the image tables are leveraged in order to limit the number of rows extracted.

Incremental Load 
• Once a full load is successfully run, subsequent loads to the SRMW are “incremental” loads, meaning that only data that has changed in the source since the last run is loaded to the SRMW. During an incremental load, the ETL processes extracts this changed data by using the views on the OLTP that join the base S_ tables with their corresponding “I” image tables. 
• “Prune Days” refers to “how far back in time” the customer wants to go in order to extract the changed data. The setting ensures that the OLTP rows that have a last_upd date older than the start date/time of the prior load (A.K.A. last_refresh_date) are not missed. It is determined by customer and setup in the DAC client. For the examples that are illustrated in this document, please assume the following:
 prune_days = 30
 last_refresh_date (prior load) = 2007-06-10
 current_load = 2007-06-17
• At the beginning of an incremental load the DAC executes a group of pre-load change data capture steps (DAC task ‘Change Capture for Siebel OLTP’). The steps are described below:

o The “I” Image tables are truncated (for e.g. S_ETL_I_IMG_12 is for S_CONTACT).
TRUNCATE TABLE S_ETL_I_IMG_12

o New rows are inserted into “I” image table for rows that have Last_Upd more recent than last_refresh_date MINUS prune_days. But before these rows are inserted, they are compared to the data in the “R” image table and if the modification_num and the last_upd values are the same for a particular row_id, then the row is excluded from the insert. This prevents needlessly processing rows that haven’t changed since the last time they were loaded into the SRMW. 

INSERT /*

Wednesday, May 24, 2017

How to select the max salary per department,including the employees that earn.

How to select the max salary per department, including the employees that earn it .


The sub query join will yield the highest salary in each department, then join to the output and get the employees for each department with the department's highest salary.  This query will also return matches where there is a tie for highest paid employee within a department.

SELECT emp.empid, emp.deptid, emp.empName, emp.salary, dpt.deptName
FROM Employee AS emp
INNER JOIN Department AS dpt ON emp.deptid = dpt.deptid
INNER JOIN
      (SELECT     deptid, MAX(salary) AS maxsalary
        FROM          Employee
        GROUP BY deptid) AS empMax ON emp.deptid = empMax.deptid
AND emp.salary = empMax.maxsalary;

row_number() function:-

If you want to find the maximum salary along with the employees, then use the ANSI standard row_number() function:

select d.deptno, e.ename, d.dname, e.sal, d.loc
from (select e.*, row_number() over (partition by deptno order by sal desc) as seqnum
      from employees e
     ) e join
     departments d
     on e.deptno = d.deptno
where seqnum = 1


dense_rank() 

select d.deptno, e.ename, d.dname, e.sal, d.loc
from 
(
  select e.ename,
    e.sal, 
    e.deptno,
    dense_rank() over (partition by deptno order by sal desc) as salRank
  from employees e
) e 
inner join departments d
  on e.deptno = d.deptno
where salRank = 1;



Normal


select 
   d.DeptID,
   max(e.Salary)
from
   department d
   inner join employees e on d.DeptID = e.DeptID
group by
  d.DeptID;


SQL>
select de.deptno,
  e.ename,
  de.dname,
  e.sal,
  de.loc
from employees e
inner join
(
  select max(e.sal) MaxSalary, d.deptno, d.loc, d.dname
  from employees e
  inner join departments d
     on e.deptno = d.deptno
  group by d.deptno, d.loc, d.dname
) de
  on e.sal = de.MaxSalary
  and e.deptno = de.deptno
order by de.deptno

Usage Tracking in OBIEE 11g

Usage Tracking in OBIEE 11.1.1.7.0 

Why Usage Tracking:-
Usage Tracking is used to monitor system and ad‐hoc query performance and to
provide input on usage trends for use in billing users and departments for resources.
Usage tracking is particularly useful in determining user queries that are creating
performance bottlenecks, based on query frequency and response time. When usage
tracking is enabled, the Oracle BI Server collects usage tracking data for each query
and writes statistics to a usage tracking log file or inserts them directly to a database
table. For the purposes of this tutorial, you will use a database table, which is the
recommended leading practice.
Here We will separate Usage Tracking Process to 4 Steps
I. Creating Uasge Tracking Table
II. Creating and Customizing rpd file
III. changing the Configuration in EM

IV. Testing the Usage Tacking in Answers

StepI: Creating Usage Tracking Table
If we latest OBIEE 11g installed in our machine, open sqlplus and connect

DEV_BIPLATFORM schema.


write the below query:

select table_name from tabs where table_name like='S%';


 The table in the name S_NQ_ACCT is related to usage tracking. so we donot have any
 manual work in the usage tracking table creation. So before setting up the usage
 tracking we couldn't have any rows in the table


StepII: Setting up the Repository File
1. Create 'New Database' by right clicking on the empty space in physical layer of the

rpd and name it as 'UsageTracking



2. Right click on the physical layer object 'UsageTracking' ‐>New Object ‐> select
'Connection Pool and enter the detail like below screen
Connection Pool Name: UsageTracking ConnectionPool
Data Source Name: (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=Bose‐PC)(PORT=1521)))

(CONNECT_DATA=(SERVICE_NAME=ORCL)))

Password: Admin123
Click 'OK' after you enter the password and you will be prompted to confirm the password. Type the

same password.


3. Import the 'S_NQ_ACCT' table from DEV_BIPLATFORM Schema to the 'UsageTracking'
database repository.


4. Double Click on the DEV_BIPLATFORM physical folder and change it 'UsageTracking
Schema' So that It would like below screen



6. Drag the following three physical columns from Usage Tracking
Schema>S_NQ_ACCT to the Measures logical table in the UsageTracking business
model. For each column, right‐click the logical column and select Rename, and then
apply aggregation rule


7. Drag the following three physical columns from UsageTracking
Schema>S_NQ_ACCT to the Time logical table in the UsageTracking business model.

8. Here is the example screen screen for how to create Key for Time Logical table
Under Keys Tab ‐> type 'Time_key' in the key name and select 'Start Date' in the
columns.

9. Drag the following two physical columns from UsageTracking Schema>S_NQ_ACCT
to the Topic logical table in the UsageTracking business model. Rename them

Do not forget to to apply 'Logical SQL' as a key column for this logical table.


10. Drag the USER_NAME physical column from UsageTracking Schema>S_NQ_ACCT
to the User logical table in the UsageTracking business mode. Rename it to User
Name and apply this column as a key to this table.
Below is the complete BMM Layer Model Screen for UsageTracking:


11. Right click on the UsageTracking BMM Layer folder ‐> select 'Business Model
Diagram' ‐> Whole Diagram and the create joins between the tables as shown in the

12. Just drag the UsageTracking folder to Presentation Layer of the repository.
Now Save the repository and check the consistency check. RPD works are done!!!


StepIII: Setting up the EM

1. Open EM window using the link http://localhost:7001/em by entering weblogic
username and password
Expand WebLogicDomain and the click 'bifoundation_domain' in the left pane. Now
select 'System MBean Browser' in the WebLogic Domain in the right side pane like show
in the picture

2. expand 'Application Defined MBeans' and the expand 'oracle.biee.admin' in that




3. Expand 'BIDomain' in oracle.biee.admin and click on second 'BIDomain' as show in the

picture. Now select 'operation' tab from right side pane and then click 'lock'


Click 'Invoke'. Once we get the screen like below, click 'Return'

4.. Now expand BIDomain.BIServer.ServerConfiguration under oracle.biee.admin from
the left pane. Now select all the following option from the right side pane and apply
the change using the following steps
UsageTrackingCentrallyManaged
UsageTrackingConnectionPool










StepIV: Checking results with Answers using UsageTracking Subject Area
Now Browse some analysis from the existing dashboard or create some reports and

create one report using "UsageTracking" subject for the details.






Thanks

Data engineering Interview Questions

1)  What all challenges you have faced and how did you overcome from it? Ans:- Challenges Faced and Overcome As a hypothetical Spark develop...