DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE: September 2017

Performance Tuning the Siebel Change Capture Process in DAC

DAC performs the change capture process for Siebel source systems. This process has two components:

1)The change capture process occurs before any task in an ETL process runs.

2)The change capture sync process occurs after all of the tasks in an ETL process have completed successfully.

Supporting Source Tables:-

The source tables that support the change capture process are as follows:

S_ETL_I_IMG. Used to store the primary key (along with MODIFICATION_NUM and LAST_UPD) of the records that were either created or modified since the time of the last ETL.

S_ETL_R_IMG. Used to store the primary key (along with MODIFICATION_NUM and LAST_UPD) of the records that were loaded into the data warehouse for the prune time period.

S_ETL_D_IMG. Used to store the primary key records that are deleted on the source transactional system.

Full and Incremental Change Capture Processes:-

The full change capture process (for a first full load) does the following:

Inserts records into the S_ETL_R_IMG table, which has been created or modified for the prune time period.

Creates a view on the base table. For example, a view V_CONTACT would be created for the base table S_CONTACT.

The incremental change capture process (for subsequent incremental loads) does the following:

Queries for all records that have changed in the transactional tables since the last ETL date, filters them against the records from the R_IMG table, and inserts them into the S_ETL_I_IMG table.

Queries for all records that have been deleted from the S_ETL_D_IMG table and inserts them into the S_ETL_I_IMG table.

Removes the duplicates in the S_ETL_I_IMG table. This is essential for all the databases where "dirty reads" (queries returning uncommitted data from all transactions) are allowed.

Creates a view that joins the base table with the corresponding S_ETL_I_IMG table.

Performance Tips for Siebel Sources:-

Performance Tip: Reduce Prune Time Period:-

Reducing the prune time period (in the Connectivity Parameters subtab of the Execution Plans tab) can improve performance, because with a lower prune time period, the S_ETL_R_IMG table will contain a fewer

number of rows. The default prune time period is 2 days. You can reduce it to a minimum of 1 day.

Note: If your organization has mobile users, when setting the prune time period, you must consider the lag time that may exist between the timestamp of the transactional system and the mobile users' local timestamp. You

should interview your business users to determine the potential lag time, and then set the prune time period accordingly.

Performance Tip: Eliminate S_ETL_R_IMG From the Change Capture Process

If your Siebel implementation does not have any mobile users (which can cause inaccuracies in the values of the "LAST_UPD" attribute), you can simplify the change capture process by doing the following:

Removing the S_ETL_R_IMG table.

Using the LAST_REFRESH_DATE rather than PRUNED_LAST_REFRESH_DATE.

To override the default DAC behavior, add the following SQL to the customsql.xml file before the last line in the file, which reads as

. The customsql.xml file is located in the dac\CustomSQLs directory.

TRUNCATE TABLE S_ETL_I_IMG_%SUFFIX
;
TRUNCATE TABLE S_ETL_D_IMG_%SUFFIX
;

TRUNCATE TABLE S_ETL_I_IMG_%SUFFIX
;
INSERT %APPEND INTO S_ETL_I_IMG_%SUFFIX
(ROW_ID, MODIFICATION_NUM, OPERATION, LAST_UPD)
SELECT
ROW_ID
,MODIFICATION_NUM
,'I'
,LAST_UPD
FROM
%SRC_TABLE
WHERE
%SRC_TABLE.LAST_UPD > %LAST_REFRESH_TIME
%FILTER
INSERT %APPEND INTO S_ETL_I_IMG_%SUFFIX
(ROW_ID, MODIFICATION_NUM, OPERATION, LAST_UPD)
SELECT
ROW_ID
,MODIFICATION_NUM
,'D'
,LAST_UPD
FROM
S_ETL_D_IMG_%SUFFIX
WHERE NOT EXISTS
(
SELECT
'X'
FROM
S_ETL_I_IMG_%SUFFIX
WHERE
S_ETL_I_IMG_%SUFFIX.ROW_ID = S_ETL_D_IMG_%SUFFIX.ROW_ID
)
;

DELETE
FROM S_ETL_D_IMG_%SUFFIX
WHERE
EXISTS
(SELECT
'X'
FROM
S_ETL_I_IMG_%SUFFIX
WHERE
S_ETL_D_IMG_%SUFFIX .ROW_ID = S_ETL_I_IMG_%SUFFIX.ROW_ID
AND S_ETL_I_IMG_%SUFFIX.OPERATION = 'D'
)
;

Performance Tip: Omit the Process to Eliminate Duplicate Records :-

When the Siebel change capture process runs on live transactional systems, it can run into deadlock issues when DAC queries for the records that changed since the last ETL process. To alleviate this problem, you need to

enable "dirty reads" on the machine where the ETL is run. If the transactional system is on a database that requires "dirty reads" for change capture, such as MSSQL, DB2, or DB2-390, it is possible that the record

identifiers columns (ROW_WID) inserted in the S_ETL_I_IMG table may have duplicates. Before starting the ETL process, DAC eliminates such duplicate records so that only the record with the smallest

MODIFICATION_NUM is kept. The SQL used by DAC is as follows:

SELECT
ROW_ID, LAST_UPD, MODIFICATION_NUM
FROM
S_ETL_I_IMG_%SUFFIX A
WHERE EXISTS
(
SELECT B.ROW_ID, COUNT(*) FROM S_ETL_I_IMG_%SUFFIX B
WHERE B.ROW_ID = A.ROW_ID
AND B.OPERATION = 'I'
AND A.OPERATION = 'I'
GROUP BY
B.ROW_ID
HAVING COUNT(*) > 1
)
AND A.OPERATION = 'I'
ORDER BY 1,2

However, for situations where deadlocks and "dirty reads" are not an issue, you can omit the process that detects the duplicate records by using the following SQL block. Copy the SQL block into the customsql.xml file

before the last line in the file, which reads as
. The customsql.xml file is located in the dac\CustomSQLs directory.

SELECT
ROW_ID, LAST_UPD, MODIFICATION_NUM
FROM
S_ETL_I_IMG_%SUFFIX A
WHERE 1=2

Performance Tip: Omit the Process to Eliminate Duplicate Records
When the Siebel change capture process runs on live transactional systems, it can run into deadlock issues when DAC queries for the records that changed since the last ETL process. To alleviate this problem, you need to

enable "dirty reads" on the machine where the ETL is run. If the transactional system is on a database that requires "dirty reads" for change capture, such as MSSQL, DB2, or DB2-390, it is possible that the record

identifiers columns (ROW_WID) inserted in the S_ETL_I_IMG table may have duplicates. Before starting the ETL process, DAC eliminates such duplicate records so that only the record with the smallest

MODIFICATION_NUM is kept. The SQL used by DAC is as follows:

SELECT
ROW_ID, LAST_UPD, MODIFICATION_NUM
FROM
S_ETL_I_IMG_%SUFFIX A
WHERE EXISTS
(
SELECT B.ROW_ID, COUNT(*) FROM S_ETL_I_IMG_%SUFFIX B
WHERE B.ROW_ID = A.ROW_ID
AND B.OPERATION = 'I'
AND A.OPERATION = 'I'
GROUP BY
B.ROW_ID
HAVING COUNT(*) > 1
)
AND A.OPERATION = 'I'
ORDER BY 1,2

However, for situations where deadlocks and "dirty reads" are not an issue, you can omit the process that detects the duplicate records by using the following SQL block. Copy the SQL block into the customsql.xml file

before the last line in the file, which reads as
. The customsql.xml file is located in the dac\CustomSQLs directory.

SELECT
ROW_ID, LAST_UPD, MODIFICATION_NUM
FROM
S_ETL_I_IMG_%SUFFIX A
WHERE 1=2

Performance Tip: Manage Change Capture Views

DAC drops and creates the incremental views for every ETL process. This is done because DAC anticipates that the transactional system may add new columns on tables to track new attributes in the data warehouse. If you

do not anticipate such changes in the production environment, you can set the DAC system property "Drop and Create Change Capture Views Always" to "false" so that DAC will not drop and create incremental views. On

DB2 and DB2-390 databases, dropping and creating views can cause deadlock issues on the system catalog tables. Therefore, if your transactional database type is DB2 or DB2-390, you may want to consider setting the

DAC system property "Drop and Create Change Capture Views Always" to "false." For other database types, this action may not enhance performance.

Note: If new columns are added to the transactional system and the ETL process is modified to extract data from those columns, and if views are not dropped and created, you will not see the new column definitions in the

view, and the ETL process will fail.

Performance Tip: Determine Whether Informatica Filters on Additional Attributes

DAC populates the S_ETL_I_IMG tables by querying only for data that changed since the last ETL process. This may cause all of the records that were created or updated since the last refresh time to be extracted.

However, the extract processes in Informatica may be filtering on additional attributes. Therefore, for long-running change capture tasks, you should inspect the Informatica mapping to see if it has additional WHERE

clauses not present in the DAC change capture process. You can modify the DAC change capture process by adding a filter clause for a . combination in the ChangeCaptureFilter.xml file, which

is located in the dac\CustomSQLs directory.

SQL for Change Capture and Change Capture Sync Processes

The SQL blocks used for the change capture and change capture sync processes are as follows:

TRUNCATE TABLE S_ETL_I_IMG_%SUFFIX
;
TRUNCATE TABLE S_ETL_R_IMG_%SUFFIX
;
TRUNCATE TABLE S_ETL_D_IMG_%SUFFIX
;
INSERT %APPEND INTO S_ETL_R_IMG_%SUFFIX
(ROW_ID, MODIFICATION_NUM, LAST_UPD)
SELECT
ROW_ID
,MODIFICATION_NUM
,LAST_UPD
FROM
%SRC_TABLE
WHERE
LAST_UPD > %PRUNED_ETL_START_TIME
%FILTER
;

TRUNCATE TABLE S_ETL_I_IMG_%SUFFIX
;
INSERT %APPEND INTO S_ETL_I_IMG_%SUFFIX
(ROW_ID, MODIFICATION_NUM, OPERATION, LAST_UPD)
SELECT
ROW_ID
,MODIFICATION_NUM
,'I'
,LAST_UPD
FROM
%SRC_TABLE
WHERE
%SRC_TABLE.LAST_UPD > %PRUNED_LAST_REFRESH_TIME
%FILTER
AND NOT EXISTS
(
SELECT
ROW_ID
,MODIFICATION_NUM
,'I'
,LAST_UPD
FROM
S_ETL_R_IMG_%SUFFIX
WHERE
S_ETL_R_IMG_%SUFFIX.ROW_ID = %SRC_TABLE.ROW_ID
AND S_ETL_R_IMG_%SUFFIX.MODIFICATION_NUM = %
SRC_TABLE.MODIFICATION_NUM
AND S_ETL_R_IMG_%SUFFIX.LAST_UPD = %
SRC_TABLE.LAST_UPD
)
;
INSERT %APPEND INTO S_ETL_I_IMG_%SUFFIX
(ROW_ID, MODIFICATION_NUM, OPERATION, LAST_UPD)
SELECT
ROW_ID
,MODIFICATION_NUM
,'D'
,LAST_UPD
FROM
S_ETL_D_IMG_%SUFFIX
WHERE NOT EXISTS
(
SELECT
'X'
FROM
S_ETL_I_IMG_%SUFFIX
WHERE
S_ETL_I_IMG_%SUFFIX.ROW_ID = S_ETL_D_IMG_%
SUFFIX.ROW_ID
)
;

DELETE
FROM S_ETL_D_IMG_%SUFFIX
WHERE
EXISTS
(
SELECT
'X'
FROM
S_ETL_I_IMG_%SUFFIX
WHERE
S_ETL_D_IMG_%SUFFIX.ROW_ID = S_ETL_I_IMG_%SUFFIX.ROW_ID
AND S_ETL_I_IMG_%SUFFIX.OPERATION = 'D'
)
;
DELETE
FROM S_ETL_I_IMG_%SUFFIX
WHERE LAST_UPD < %PRUNED_ETL_START_TIME
;
DELETE
FROM S_ETL_I_IMG_%SUFFIX
WHERE LAST_UPD > %ETL_START_TIME
;
DELETE
FROM S_ETL_R_IMG_%SUFFIX
WHERE
EXISTS
(
SELECT
'X'
FROM
S_ETL_I_IMG_%SUFFIX
WHERE
S_ETL_R_IMG_%SUFFIX.ROW_ID = S_ETL_I_IMG_%SUFFIX.ROW_ID
)
;
INSERT %APPEND INTO S_ETL_R_IMG_%SUFFIX
(ROW_ID, MODIFICATION_NUM, LAST_UPD)
SELECT
ROW_ID
,MODIFICATION_NUM
,LAST_UPD
FROM
S_ETL_I_IMG_%SUFFIX
;
DELETE FROM S_ETL_R_IMG_%SUFFIX WHERE LAST_UPD < %PRUNED_ETL_START_TIME
;

Normalization of Database

Database Normalisation is a technique of organizing the data in the database. Normalization is a systematic approach of decomposing tables to eliminate data redundancy and undesirable characteristics like Insertion, Update and Deletion Anamolies. It is a multi-step process that puts data into tabular form by removing duplicated data from the relation tables.

Normalization is used for mainly two purpose,

Eliminating reduntant(useless) data.
Ensuring data dependencies make sense i.e data is logically stored.

Normalization Rule

Normalization rule are divided into following normal form.

First Normal Form
Second Normal Form
Third Normal Form
BCNF

First Normal Form (1NF)

As per First Normal Form, no two Rows of data must contain repeating group of information i.e each set of column must have a unique value, such that multiple columns cannot be used to fetch the same row. Each table should be organized into rows, and each row should have a primary key that distinguishes it as unique.

The Primary key is usually a single column, but sometimes more than one column can be combined to create a single primary key. For example consider a table which is not in First normal form

Student Table :

Student	Age	Subject
Adam	15	Biology, Maths
Alex	14	Maths
Stuart	17	Maths

In First Normal Form, any row must not have a column in which more than one value is saved, like separated with commas. Rather than that, we must separate such data into multiple rows.

Student Table following 1NF will be :

Student	Age	Subject
Adam	15	Biology
Adam	15	Maths
Alex	14	Maths
Stuart	17	Maths

Using the First Normal Form, data redundancy increases, as there will be many columns with same data in multiple rows but each row as a whole will be unique.

Second Normal Form (2NF)

As per the Second Normal Form there must not be any partial dependency of any column on primary key. It means that for a table that has concatenated primary key, each column in the table that is not part of the primary key must depend upon the entire concatenated key for its existence. If any column depends only on one part of the concatenated key, then the table fails Second normal form.

In example of First Normal Form there are two rows for Adam, to include multiple subjects that he has opted for. While this is searchable, and follows First normal form, it is an inefficient use of space. Also in the above Table in First Normal Form, while the candidate key is {Student, Subject}, Age of Student only depends on Student column, which is incorrect as per Second Normal Form. To achieve second normal form, it would be helpful to split out the subjects into an independent table, and match them up using the student names as foreign keys.

New Student Table following 2NF will be :

Student	Age
Adam	15
Alex	14
Stuart	17

In Student Table the candidate key will be Student column, because all other column i.e Age is dependent on it.

New Subject Table introduced for 2NF will be :

Student	Subject
Adam	Biology
Adam	Maths
Alex	Maths
Stuart	Maths

In Subject Table the candidate key will be {Student, Subject} column. Now, both the above tables qualifies for Second Normal Form and will never suffer from Update Anomalies. Although there are a few complex cases in which table in Second Normal Form suffers Update Anomalies, and to handle those scenarios Third Normal Form is there.

Third Normal Form (3NF)

Third Normal form applies that every non-prime attribute of table must be dependent on primary key, or we can say that, there should not be the case that a non-prime attribute is determined by another non-prime attribute. So this transitive functional dependency should be removed from the table and also the table must be in Second Normal form. For example, consider a table with following fields.

Student_Detail Table :

Student_id	Student_name	DOB	Street	city	State	Zip

In this table Student_id is Primary key, but street, city and state depends upon Zip. The dependency between zip and other fields is called transitive dependency. Hence to apply 3NF, we need to move the street, city and state to new table, with Zip as primary key.

New Student_Detail Table :

Student_id	Student_name	DOB	Zip

Address Table :

Zip	Street	city	state

The advantage of removing transtive dependency is,

Amount of data duplication is reduced.
Data integrity achieved.

Boyce and Codd Normal Form (BCNF)

Boyce and Codd Normal Form is a higher version of the Third Normal form. This form deals with certain type of anamoly that is not handled by 3NF. A 3NF table which does not have multiple overlapping candidate keys is said to be in BCNF. For a table to be in BCNF, following conditions must be satisfied:

R must be in 3rd Normal Form
and, for each functional dependency ( X -> Y ), X should be a super Key.

DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE

Friday, September 22, 2017

Performance Tuning the Siebel Change Capture Process in DAC

Wednesday, September 20, 2017

Normalization of Database

Normalization of Database

Normalization Rule

First Normal Form (1NF)

Second Normal Form (2NF)

Third Normal Form (3NF)

Boyce and Codd Normal Form (BCNF)

ADF functions

DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE