DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE: October 2017

--------------------------------------------------------------------------
Components of Teradata
--------------------------------------------------------------------------
Teradata is made up of following components –
Processor Chip – The processor is the BRAIN of the Teradata system. It is responsible for all the processing done by the system. All task are done according to the direction of the processor.
Memory – The memory is known as the HAND of the Teradata system. Data is retrieved from the hard drives into memory, where processor manipulates, change or alter the data. Once changes are made in memory, the processor directs the information back to the hard drive for storage.
Hard Drives – This is known as the SPINE of the Teradata system. All the data of the Teradata system is stored in the hard drives. Size of hard drives reflects the size of the Teradata system.

Teradata has Linear Scalability
One of the most important asset of Teradata is that it has Linear Scalability. There is no limit on Teradata system. We can grow it to as many times as we want. Any time you want to double the speed of Teradata system, just double the numbers of AMPs and PE. This can be better explained with the help of an example –
Teradata takes every table in the system and spread evenly among different AMPs. Each Amp works on the portion of records which it holds.
Suppose a EMPLOYEE table has 8 different employee id’s. Now in a 2 AMP system each AMP will hold 4 rows in its DISK to accommodate total 8 rows.

At the time of data retrieval each AMP will work on its DISK and send 4 rows to PE for further processing. If we suppose, one AMP will take 1 microseconds (MS) to retrieve 1 rows, then the time taken to retrieve 4 rows is 4 MS. And as we know that AMPs work in parallel, so both the AMPs will retrieve all 8 records in 4 MS only (4 MS time for each AMP).
Now we double the AMP in our system, and we use total 4 AMP. As Teradata distribute the records evenly among all AMPs, so now each AMP will store 2 records of the table.

--------------------------------------------------------------------------
SET or MULTISET tables:-
--------------------------------------------------------------------------
SET tables – SET tables did not allow duplicate values in the table.
MULTISET tables – MULTISET tables allow duplicate values in table.

CREATE VOLATILE TABLE temp
(
ID INT,
Name VARCHAR(20)
)
PRIMARY INDEX(ID)
ON COMMIT PRESERVE ROWS;

or
CREATE VOLATILE TABLE dept_stat (
dept_no INTEGER,
avg_salary INTEGER,
max_salary INTEGER,
min_salary INTEGER
)
PRIMARY INDEX(dept_no)
ON COMMIT PRESERVE ROWS;

show test;

select * from temp;

insert into test values(1,'DINESH');
insert into test values(2,'LUCK');

---

create index (id) on test1

create unique index (id) on test1

HELP STATISTICS test1
-----------------------------------------------------------
*EXPLAIN command returns the execution plan of parsing engine in English*

EXPLAIN select * from test1
show table test

-----------------------------------------------------------------------------
Built-in Functions in Teradata
-----------------------------------------------------------------------------

select ACCOUNT;
select CURRENT_DATE;
select CURRENT_TIME;
select CURRENT_TIMESTAMP;
select DATABASE;
select date;
select session;
select time;
select user;

-----------------------------------------------------------------------------
How to get Case Sensitivity in Teradata
-----------------------------------------------------------------------------
Teradata by default is case insensitive. Whatever value you compare in the WHERE clause it does not match the exact case of that value.
So what if there is a scenario when you want to compare the value along with its specific case in the WHERE clause. Here come the CASESPECIFIC keyword for the rescue.Just by adding this keyword after the value in WHERE clause we can force the Teradata to impose case sensitivity on the checking condition

select current_date where upper('teradata')='teradata'(CASESPECIFIC);---not return any date.

select current_date where lower('teradata')='teradata'(CASESPECIFIC);---return date.

SELECT emp_name FROM emp_table WHERE emp_name (CASESPECIFIC) LIKE ‘John%’ —— This query will return all name starting with capital J only
So just by using CASESPECIFIC we can force Teradata to check for case sensitivity
--------------------------------------
Primary Index in Teradata-
--------------------------------------
Each table in Teradata is required to have a primary index.Even if you did not define any primary index in CREATE table statement, the Teradata system will automatically create the primary index based on the setting of DBScontrol setting field . The primary index defines where data will reside and which AMP receives the row.
The three most important roles the primary index does is the following –
Data Distribution
Fastest way to retrieve Data
Incredibly important for Joins
In short primary index provides the fastest physical path to retrieving dazta.

2 type:-
A unique primary index means that the value for the selected column must be unique.
A Non unique primary index means that the value for the selected column can be non unique.

The primary index is the key to determine where the ROW of the table will reside on which AMP. When a new row arrive for insert in Teradata the following steps occur –
Teradata Parsing Engine (PE) examines the primary index of the row.
Teradata takes the primary index of the rows and run it through HASHING ALGORITHM.
The output of the Hashing Algorithm is the 32 bit Row – Hash value.

-----------------------------------------------------------------------------
Secondary Index in Teradata:--
-----------------------------------------------------------------------------
Secondary Indexes provide an alternate path to the data, and should be used on queries that run many times.

Syntax of creating Secondary Index
Syntax of UNIQUE SI:
CREATE UNIQUE INDEX (Column/Columns) ON .;
Syntax of NON-UNIQUE SI:
CREATE INDEX (Column/Columns) ON .;

Whenever you create SI on the table, Teradata will create a subtable on all AMP. This subtable contains three columns given below –
Secondary Index Value
Secondary Index Row ID (this is the hashed value of SI value)
Base table Row ID (this is the actual base row id )

-----------------------------------------------------------------------------
Recursive Query in Teradata
-----------------------------------------------------------------------------
Using Recursive query data with in the single table can be scanned multiple times using a single SQL statement.
A recursive query is useful when relationships between data elements exist in multiple iterations. Recursive query consist of two parts a seeding query and a recursive query.
Seeding query provides the initial records for which data needs to be iterated whereas Recursive query part helps in data iteration.

Recursive query syntax :

WITH RECURSIVE < recursive_tablename >
(,,….)
AS
(

UNION ALL

)

1) Recursive_tablename is a derived recursive table defined using WITH RECURSIVE keyword.
2) Column names of recursive table is specified as . This column names will be finally displayed in the output.
3) Seeding query ( query which is used as the input for recursive logic ) is mentioned as the first select statement.
4) Second select statement is known as the recursive query and is joined with seed table for multiple iterations.
5) Third select statement finally gives the output of recursive logic.

CREATE VOLATILE TABLE Employee
(
EMP_ID INTEGER,
MGR_ID INTEGER,
EMP_NAME VARCHAR(100)
)
ON COMMIT PRESERVE ROWS;

-----------------------------------------------------------------------------------------------------------------------
ROLLUP Function:

-----------------------------------------------------------------------------------------------------------------------
ROLLUP is used to aggregate data along all the levels of hierarchy within a single dimension.
SELECT PRODUCT_ID
,SUM(QUANTITY) AS TOTAL_QUANTITY
FROM STORE_QTY
GROUP BY ROLLUP (PRODUCT_ID)
ORDER BY 1;
roll2
Here along with all Product_Id you get a row which has ? in PRODUCT_ID column.
This ? Isn’t null but instead indicates the grand total of all the product_id.

Top SQL Commands
--------------------------------
--------------------------------

SQL to changing the default Database
Profile Keywords

DATABASE EMP_DATA_BASE;

SQL to find Information about a Database
HELP DATABASE EMP_DATA_BASE;

SQL to get Sample number of rows
SELECT * FROM EMP_TBL SAMPLE 10;

SQL to get a sample Percentage of rows
SELECT * FROM EMP_TBL SAMPLE .50;

SQL to find information about a Table
SHOW TABLE EMP_TBL;

SQL to Use an Access Locked Table
LOCKING ROW FOR ACCESS SELECT * FROM EMP_TBL;

SQL Keywords that describe you
SELECT DATABASE, USER, SESSION,
ACCOUNT,
PROFILE,
ROLE;"
"SELECT DATE,
CURRENT_DATE,
TIME,
CURRENT_TIME,
CURRENT_TIMESTAMP;

SQL to Use Aggregates functions
SELECT TOP 10 STUDENT_NO, FIRST_NAME, LAST_NAME, CLASS_CODE
FROM STUDENT_TBL
ORDER BY GRADE DESC;"
"SELECT DEPT_NO
,MAX(SALARY) AS ""MAXIMUM""
,MIN(SALARY) AS ""MINIMUM""
,AVG(SALARY) AS ""AVERAGE""
,SUM(SALARY) AS ""SUM""
,COUNT(*) AS ""COUNT""
FROM EMP_TBL
GROUP BY DEPT_NO
ORDER BY DEPT_NO;

SQL to Select TOP Rows in a Rank Order
SELECT TOP 10 STUDENT_NO
,FIRST_NAME
,LAST_NAME
,CLASS_CODE
FROM STUDENT_TBL
ORDER BY GRADE DESC;

SQL Using Date, Time and Timestamp
SELECT
CALENDAR_DATE
,DAY_OF_WEEK
,DAY_OF_MONTH
,DAY_OF_YEAR
,DAY_OF_CALENDAR
,WEEKDAY_OF_MONTH
,WEEK_OF_MONTH
,WEEK_OF_YEAR
,WEEK_OF_CALENDAR
,MONTH_OF_QUARTER
,MONTH_OF_YEAR
,MONTH_OF_CALENDAR
,QUARTER_OF_YEAR
,QUARTER_OF_CALENDAR
,YEAR_OF_CALENDAR
FROM SYS_CALENDAR.CALENDAR;

SQL to Find out how much Space a USER have
SELECT
USERNAME
,CREATORNAME
,PERMSPACE
,SPOOLSPACE
,TEMPSPACE
,LASTALTERNAME
,LASTALTERTIMESTAMP
FROM DBC.USERS
WHERE USERNAME='USER';

SQL to find how much Space left Per AMP in database
SELECT
VPROC
,DATABASENAME
,ACCOUNTNAME
,MAXPERM
,MAXSPOOL
,MAXTEMP
FROM DBC.DISKSPACE
WHERE DATABASENAME='EMP_DB' ;

SQL to finding USER Space
SELECT
MAX(MAXPERM)
,MAX(MAXSPOOL)
,MAX(MAXTEMP)
FROM DBC.DISKSPACE
WHERE DATABASENAME='USER' ;

SQL to find Space Skew in Tables in a Database
SELECT VPROC
,CAST(TABLENAME AS CHAR(20))
,CURRENTPERM
,PEAKPERM
FROM DBC.TABLESIZEV
WHERE DATABASENAME='USER'
ORDER BY TABLENAME, VPROC;

SQL to Find Table Skew
SELECT
TABLENAME,
SUM(CURRENTPERM) /(1024*1024) AS CURRENTPERM,
(100 - (AVG(CURRENTPERM)/MAX(CURRENTPERM)*100)) AS SKEWFACTOR
FROM
DBC.TABLESIZE
WHERE DATABASENAME=
AND TABLENAME =
GROUP BY 1;

SQL to Find AMP Skew
SELECT DATABASENAME
,TABLENAME
,VPROC
,CURRENTPERM
,PEAKPERM
FROM DBC.TABLESIZE
WHERE
DATABASENAME=
AND
TABLENAME=
ORDER BY VPROC ;

SQL to find number of rows per AMP for a Column
SELECT HASHAMP(HASHBUCKET( HASHROW(EMP_NO))) AS ""AMP"" , COUNT(*)
FROM EMP_TABLE
GROUP BY 1
ORDER BY 1;

SQL to Identify duplicate records
SELECT COLUMN1, COLUMN2, COLUMN3, COUNT(*)
FROM
DATABASE.TABLE
GROUP BY COLUMN1, COLUMN2, COLUMN3
HAVING COUNT(*) >1;

SQL to Delete Duplicate records
CREATE TABLE TABLE1_BACKUP AS (SELECT * FROM TABLE1 QUALIFY ROW_NUMBER() OVER (PARTITION BY COLUMN1 ORDER BY COLUMN1 DESC )=1) WITH DATA;
DELETE FROM TABLE1;
INSERT INTO TABLE1 SELECT * FROM TABLE_BACKUP;

SQL below to find TOP Databases by space occupied
SELECT
DatabaseName
,MAX(CurrentPerm) * (HASHAMP()+1)/1024/1024 AS USEDSPACE_IN_MB
FROM DBC.DiskSpace
GROUP BY DatabaseName
ORDER BY USEDSPACE_IN_MB DESC;

SQL to find TOP Tables by space occupied
SELECT DATABASENAME
,TABLENAME
,SUM(CurrentPerm)/1024/1024 AS TABLESIZE_IN_MB
FROM DBC.TableSize
GROUP BY DATABASENAME,TABLENAME
ORDER BY TABLESIZE_IN_MB DESC;

SQL to find out list of nodes
SELECT DISTINCT NODEID FROM DBC.RESUSAGESPMA;

SQL to find Account Information
SELECT * FROM DBC.AccountInfoV ORDER BY 1;

DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE

Tuesday, October 17, 2017

TERA DATA PART1

ADF functions

DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE