Saturday, May 27, 2017

Prune days and Change Capture in DAC

Prune days and Change Capture in DAC

Here is a question I often get from customers regarding change capture by DAC when it is used for extracting Siebel data. The problem is equally valid for non-Siebel data sources if the default change capture mechanism provided in DAC is used for identifying changes in source data.
How can we ensure that data committed with an update timestamp between the start of an ETL run and the end of the extract is not excluded by the next ETL?
  DAC stores a value called LAST_REFRESH_DATE at the end of an ETL extract.
 During the next run, DAC captures changes to the source data by comparing this LAST_REFRESH_DATE with the LAST_UPD date of records in the source.
·   So if the LAST_REFRESH_DATE is less than LAST_UPD of a source record, then the record being processed has either been updated or created between the time of last execution and the current run
·    But what will happen if a record in the source is committed between the time of starting the data extract and completing the extract with a commit date that is less than the LAST_REFRESH_DATE
Prune days to the rescue!!! The solution to the above problem is to set the PRUNE_DAYS parameter.
If the prune days parameter is set to 1 for example, the change capture process will use LAST_REFRESH_DATE – Prune days as the cutoff timestamp for change capture
Here is an example of such a scenario
·    Let us assume that there is a table called S_CUSTOMER in transaction database with 2 records created during business hours on the 1st of January.
·    On the January 1st, Extraction of Customer Data begins at 1700 hours and completes at January 1st, 1800 hours into a customer Dimension called WC_CUSTOMER.
·    LAST_REFRESH_DATE stored is January 1, 1800 hours
   Now, a new customer is created between 1700 and 1800. But it is committed to the database AFTER the extract is completed with a timestamp of January 1st, 17.30
More customers are added to the transaction database during the next working day. As a result the following is the condition of the source table at the time of the next ETL Extract i.e. Jan 2, 12:00 am

Now the change capture process in DAC will compare the stored LAST_REFRESH_DATE with the update timestamp of the records in the source data and pull only the records updated after this LAST_REFRESH_DATE i.e. Jan 1, 1800 hours into the target table. Unfortunately for the 3rd record depicted below, the LAST_UPD is Jan 1, 1730 i.e. it is before the LAST_REFRESH_DATE. So, it does not qualify for the incremental extract and we have a missing record in the target table.

·    g) But if we had set the PRUNE DAYS parameter to 1, the cutoff date used to identify changes in source data would be .Dec 31, 1800 hours i.e LAST_REFRESH_DATE – 1 day . If this ploy is used, all the records with LAST_UPD > Dec 31, 1800 hours will qualify for the incremental extract and we will have a complete set of data in the target table

The Prune days parameter can be used to ensure that DAC Change Capture does not miss out data created around the time of data extraction.
Set it with care to a reasonable value to avoid increasing the volume of data selected by incremental load process.
New Feature in DAC 7.9 for non-Siebel Sources
For source systems other than Siebel, the Prune Days setting is used in the same way except that the DAC subtracts the number of prune days from the LAST_REFRESH_DATE of a given source and supplies this as the value for the $$LAST_EXTRACT_DATE parameter.

No comments:

Post a Comment

Data engineering Interview Questions

1)  What all challenges you have faced and how did you overcome from it? Ans:- Challenges Faced and Overcome As a hypothetical Spark develop...