Prune days and Change Capture in DAC
Here is a question I often get from customers
regarding change capture by DAC when it is used for extracting Siebel data. The
problem is equally valid for non-Siebel data sources if the default change
capture mechanism provided in DAC is used for identifying changes in source
data.
How can we ensure that
data committed with an update timestamp between the start of an ETL run and the
end of the extract is not excluded by the next ETL?
DAC stores a value called LAST_REFRESH_DATE at the end of an ETL
extract.
During the next run, DAC captures changes to the source data by
comparing this LAST_REFRESH_DATE with the LAST_UPD date of records in the
source.
· So if the LAST_REFRESH_DATE is less than LAST_UPD of a source
record, then the record being processed has either been updated or created
between the time of last execution and the current run
· But what will happen if a record in the source is committed
between the time of starting the data extract and completing the extract with a
commit date that is less than the LAST_REFRESH_DATE
Prune days to the
rescue!!! The
solution to the above problem is to set the PRUNE_DAYS parameter.
If the prune days parameter is set to 1 for
example, the change capture process will use LAST_REFRESH_DATE – Prune days as
the cutoff timestamp for change capture
Here is an example of
such a scenario
· Let us assume that there is a table called S_CUSTOMER in
transaction database with 2 records created during business hours on the 1st of
January.
· On the January 1st, Extraction of Customer Data begins at 1700
hours and completes at January 1st, 1800 hours into a customer Dimension called
WC_CUSTOMER.
· LAST_REFRESH_DATE stored is January 1, 1800 hours
Now, a new customer is created between 1700 and 1800. But it is
committed to the database AFTER the extract is completed with a timestamp of
January 1st, 17.30
More customers are added to the transaction database during the
next working day. As a result the following is the condition of the source
table at the time of the next ETL Extract i.e. Jan 2, 12:00 am
Now the change capture process in DAC will compare the stored LAST_REFRESH_DATE with the update timestamp of the records in the source data and pull only the records updated after this LAST_REFRESH_DATE i.e. Jan 1, 1800 hours into the target table. Unfortunately for the 3rd record depicted below, the LAST_UPD is Jan 1, 1730 i.e. it is before the LAST_REFRESH_DATE. So, it does not qualify for the incremental extract and we have a missing record in the target table.
· g) But if we had set the PRUNE DAYS parameter to 1, the cutoff
date used to identify changes in source data would be .Dec 31, 1800 hours i.e
LAST_REFRESH_DATE – 1 day . If this ploy is used, all the records with LAST_UPD
> Dec 31, 1800 hours will qualify for the incremental extract and we will
have a complete set of data in the target table
The Prune days parameter can be used to ensure that DAC Change Capture does not miss out data created around the time of data extraction.
Set it with care to a reasonable value to
avoid increasing the volume of data selected by incremental load process.
New Feature in DAC 7.9
for non-Siebel Sources
For source systems other than Siebel, the
Prune Days setting is used in the same way except that the DAC subtracts the
number of prune days from the LAST_REFRESH_DATE of a given source and supplies
this as the value for the $$LAST_EXTRACT_DATE parameter.
No comments:
Post a Comment