DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE

Thursday, October 9, 2014

Data Warehousing Question

What is Data Warehousing?

A data warehouse is the main repository of an organization's historical data, it is corporate memory. It contains the raw material for management's decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operational systems. Data warehousing collection of data designed to support management decision making. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time. It is a repository of integrated information, available for queries and analysis.

What is Business Intelligence (BI)?

Business Intelligence (BI) refers to technologies, applications and practices for the collection, integration, analysis, and presentation of business information and sometimes to the information itself. The purpose of business intelligence is to support better business decision making. Thus, BI is also described as a decision support system (DSS).

BI systems provide historical, current, and predictive views of business operations, most often using data that has been gathered into a data warehouse or a data mart and occasionally working from operational data.

What is Dimension table?

Dimensional table contains textual attributes of measurements stored in the facts tables. Dimensional table is a collection of hierarchies, categories and logic which can be used for user to traverse in hierarchy nodes.

What is Dimensional Modeling?

Dimensional data model concept involves two types of tables and it is different from the 3rd normal form. This concept uses Facts table which contains the measurements of the business and Dimension table which contains the context (dimension of calculation) of the measurements.

What is Fact table?

Fact table contains measurements of business process. Fact table contains the foreign keys for the dimension tables. Example, if you are business process is "paper production", "average production of paper by one machine" or "weekly production of paper" will be considered as measurement of business process.

What are fundamental stages of Data Warehousing?

There are four different fundamental stages of Data Warehousing.

Offline Operational Databases:

Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off‐line server where the processing load of reporting does not impact on the operational system's performance.

Offline Data Warehouse:

Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting‐oriented data structure

Real Time Data Warehouse:

Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)

Integrated Data Warehouse:

Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.

What are the Different methods of loading Dimension tables?

There are two different ways to load data in dimension tables.

Conventional (Slow):

All the constraints and keys are validated against the data before, it is loaded, this way data integrity is maintained.

Direct (Fast):

All the constraints and keys are disabled before the data is loaded. Once data is loaded, it is validated against all the constraints and keys. If data is found invalid or dirty it is not included in index and all future processes are skipped on this data.

Describes the foreign key columns in fact table and dimension table?

Foreign keys of dimension tables are primary keys of entity tables.

Foreign keys of facts tables are primary keys of Dimension tables.

What is Data Mining?

Data Mining is the process of analyzing data from different perspectives and summarizing it into useful information.

What is the difference between view and materialized view?

A view takes the output of a query and makes it appear like a virtual table and it can be used in place of tables.

A materialized view provides indirect access to table data by storing the results of a query in a separate schema object.

What is OLTP?

OLTP is abbreviation of On‐Line Transaction Processing. This system is an application that modifies data the instance it receives and has a large number of concurrent users.

What is OLAP?

OLAP is abbreviation of Online Analytical Processing. This system is an application that collects, manages, processes and presents multidimensional data for analysis and management purposes.

What is the difference between OLTP and OLAP?

Data Source

OLTP: Operational data is from original data source of the data

OLAP: Consolidation data is from various sources.

Process Goal

OLTP: Snapshot of business processes which does fundamental business tasks

OLAP: Multi‐dimensional views of business activities of planning and decision making

Queries and Process Scripts

OLTP: Simple quick running queries ran by users.

OLAP: Complex long running queries by system to update the aggregated data.

Database Design

OLTP: Normalized small database. Speed will be not an issue due to smaller database and normalization will not degrade performance. This adopts entity relationship(ER) model and an application‐oriented database design.

OLAP: De‐normalized large database. Speed is issue due to larger database and de‐normalizing will improve performance as there will be lesser tables to scan while performing tasks. This adopts star, snowflake or fact constellation mode of subject‐oriented database design.

Back up and System Administration

OLTP: Regular Database backup and system administration can do the job.

OLAP: Reloading the OLTP data is good considered as good backup option.

What are normalization forms?

There are different types of normalization forms like,

1NF: Eliminate Repeating Groups

Make a separate table for each set of related attributes, and give each table a primary key. Each field contains at most one value from its attribute domain.

2NF: Eliminate Redundant Data

If an attribute depends on only part of a multi‐valued key, remove it to a separate table.

3NF: Eliminate Columns Not Dependent On Key

If attributes do not contribute to a description of the key, remove them to a separate table. All attributes must be directly dependent on the primary key

BCNF: Boyce‐Codd Normal Form

If there are non‐trivial dependencies between candidate key attributes, separate them out into distinct tables.

4NF: Isolate Independent Multiple Relationships

No table may contain two or more 1:n or n:m relationships that are not directly related.

5NF: Isolate Semantically Related Multiple Relationships

There may be practical constrains on information that justify separating logically related many‐to‐many relationships.

ONF: Optimal Normal Form

A model limited to only simple (elemental) facts, as expressed in Object Role Model notation.

DKNF: Domain‐Key Normal Form A model free from all modification anomalies.

Remember, these normalization guidelines are cumulative. For a database to be in

3NF, it must first fulfill all the criteria of a 2NF and 1NF database.

What is ODS?

ODS is abbreviation of Operational Data Store. A database structure that is a repository for near real‐time operational data rather than long term trend data. The

ODS may further become the enterprise shared operational database, allowing operational systems that are being re‐engineered to use the ODS as there operation databases.

What is ER Diagram?

Entity Relationship Diagrams are a major data modeling tool and will help organize

the data in your project into entities and define the relationships between the

entities. This process has proved to enable the analyst to produce a good database

structure so that the data can be stored and retrieved in a most efficient manner.

An entity‐relationship (ER) diagram is a specialized graphic that illustrates the

interrelationships between entities in a database. A type of diagram used in data

modeling for relational data bases. These diagrams show the structure of each table

and the links between tables.

What is ETL?

ETL is abbreviation of extract, transform, and load. ETL is software that enables

businesses to consolidate their disparate data while moving it from place to place,

and it doesn't really matter that that data is in different forms or formats. The data

can come from any source.ETL is powerful enough to handle such data disparities.

First, the extract function reads data from a specified source database and extracts a desired subset of data. Next, the transform function works with the acquired data using rules or lookup tables, or creating combinations with other data ‐ to convert it to the desired state. Finally, the load function is used to write the resulting data to a target database.

What is VLDB?

VLDB is abbreviation of Very Large Database. A one terabyte database would

Normally be considered to be a VLDB. Typically, these are decision support systems

Or transaction processing applications serving large numbers of users.

Is OLTP database is design optimal for Data Warehouse?

No. OLTP database tables are normalized and it will add additional time to queries to return results. Additionally OLTP database is smaller and it does not contain longerperiod (many years) data, which needs to be analyzed. A OLTP system is basically ER model and not Dimensional Model. If a complex query is executed on a OLTP system, it may cause a heavy overhead on the OLTP server that will affect the normal business processes.

If de‐normalized is improves data warehouse processes, why fact table is in normal form?

Foreign keys of facts tables are primary keys of Dimension tables. It is clear that fact table contains columns which are primary key to other table that itself make normal form table.

What are lookup tables?

A lookup table is the table placed on the target table based upon the primary key of

The target, it just updates the table by allowing only modified (new or updated)

Records based on the lookup condition.

What are Aggregate tables?

Aggregate table contains the summary of existing warehouse data which is grouped

to certain levels of dimensions. It is always easy to retrieve data from aggregated

tables than visiting original table which has million records. Aggregate tables reduce the load in the database server and increase the performance of the query and can retrieve the result quickly.

What is real time data‐warehousing?

Data warehousing captures business activity data. Real‐time data warehousing

captures business activity data as it occurs. As soon as the business activity is

Complete and there is data about it, the completed activity data flows into the data

Warehouse and becomes available instantly.

Factless fact table?

What is a Factless Fact table?

A Fact table is a table which consists of measurements, metrics or facts of a business process. A

Factless fact table is a fact table that doesn't contain any facts. They contain information to capture

Information but not for calculation

Example:

Tracking a student or employee attendance.

What is factless facts table?

A fact table which does not contain numeric fact columns it is called factless facts

Table.

What are conformed dimensions?

Conformed dimensions mean the exact same thing with every possible fact table to

which they are joined. They are common to the cubes.

A Dimension which exists in more than one fact table is known as conformed dimensions.

What is conformed fact?

Conformed dimensions are the dimensions which can be used across multiple Data

Marts in combination with multiple facts tables accordingly.

How do you load the time dimension?

Time dimensions are usually loaded by a program that loops through all possible

Dates that may appear in the data. 100 years may be represented in a time

Dimension, with one row per day.

What is a level of Granularity of a fact table?

Level of granularity means level of detail that you put into the fact table in a data

Warehouse. Level of granularity would mean what detail are you willing to put for

Each transactional fact.

What are non‐additive facts?

Non‐additive facts are facts that cannot be summed up for any of the dimensions

Present in the fact table. However they are not considered as useless. If there are

Changes in dimensions the same facts can be useful.

What are slowly changing dimensions (SCD)?

SCD is abbreviation of slowly changing dimensions. SCD applies to cases where the

Attribute for a record varies over time. There are three different types of SCD.

1) SCD1: The new record replaces the original record. Only one record exist in

Database ‐ current data.

2) SCD2: A new record is added into the customer dimension table. Two records

Exist in database ‐ current data and previous history data.

3) SCD3: The original data is modified to include new data. One record exist in

Database ‐ new information are attached with old information in same row.

What is hybrid slowly changing dimension?

Hybrid SCDs are combination of both SCD 1 and SCD 2. It may happen that in a table, some columns are important and we need to track changes for them i.e. capture the historical data for them whereas in some columns even if the data changes, we don't care.

What is BUS Schema?

BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts.

What is a Star Schema?

Star schema is a type of organizing the tables such that we can retrieve the result from the database quickly in the warehouse environment.

What Snow Flake Schema?

Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table.

Differences between star and snowflake schema?

Star schema: A single fact table with N number of Dimension, all dimensions will be linked directly with a fact table. This schema is denormalized and results in simple join and less complex query as well as faster results.

Snow schema: Any dimensions with extended dimensions are known as Snowflake schema, dimensions maybe interlinked or may have one to many relationships with other tables. This schema is normalized and results in complex join and very complex query as well as slower results.

What is Difference between ER Modeling and Dimensional Modeling?

ER modeling is used for normalizing the OLTP database design. Dimensional

Modeling is used for de‐normalizing the ROLAP/MOLAP design.

What is degenerate dimension table?

If a table contains the values, which is neither dimension nor measures is called

degenerate dimensions.

Why is Data Modeling Important?

Data modeling is probably the most labor intensive and time consuming part of the

development process. The goal of the data model is to make sure that the all data

objects required by the database are completely and accurately represented.

Because the data model uses easily understood notations and natural language, it

can be reviewed and verified as correct by the end‐users.

In computer science, data modeling is the process of creating a data model by

applying a data model theory to create a data model instance. A data model theory

is a formal data model description. When data modeling, we are structuring and

organizing data. These data structures are then typically implemented in a database management system. In addition to defining and organizing the data, data modelling will impose (implicitly or explicitly) constraints or limitations on the data placed within the structure.

Managing large quantities of structured and unstructured data is a primary function

of information systems. Data models describe structured data for storage in data

management systems such as relational databases. They typically do not describe

unstructured data, such as word processing documents, email messages, pictures,

digital audio, and video.

What is surrogate key?

Surrogate key is a substitution for the natural primary key. It is just a unique

Identifier or number for each row that can be used for the primary key to the table.

The only requirement for a surrogate primary key is that it is unique for each row in

The table. It is useful because the natural primary key can change and this makes

updates more difficult. Surrogated keys are always integer or numeric.

What is junk dimension?

A number of very small dimensions might be lumped together to form a single

Dimension, a junk dimension ‐ the attributes are not closely related. Grouping of

Random flags and text attributes in a dimension and moving them to a separate sub dimension is known as junk dimension.

What is Data Mart?

A data mart (DM) is a specialized version of a data warehouse (DW). Like data

Warehouses, data marts contain a snapshot of operational data that helps business

people to strategize based on analyses of past trends and experiences. The key

Difference is that the creation of a data mart is predicated on a specific, predefined

Need for a certain grouping and configuration of select data. A data mart

Configuration emphasizes easy access to relevant information (Reference : Wiki).

Data Marts are designed to help manager make strategic decisions about their

Business.

What is the difference between OLAP and data warehouse?

Datawarehouse is the place where the data is stored for analyzing where as OLAP is

the process of analyzing the data, managing aggregations, partitioning information

into cubes for in depth visualization.

What is a Cube and Linked Cube with reference to data warehouse?

Cubes are logical representation of multidimensional data. The edge of the cube

contains dimension members and the body of the cube contains data values. The

linking in cube ensures that the data in the cubes remain consistent.

What is snapshot with reference to data warehouse?

You can disconnect the report from the catalog to which it is attached by saving the

report with a snapshot of the data.

What is active data warehousing?

An active data warehouse provides information that enables decision‐makers within

an organization to manage customer relationships nimbly, efficiently and

proactively.

What is the difference between data warehousing and business intelligence?

Data warehousing deals with all aspects of managing the development,

Implementation and operation of a data warehouse or data mart including meta

data management, data acquisition, data cleansing, data transformation, storage

management, data distribution, data archiving, operational reporting, analytical

reporting, security management, backup/recovery planning, etc. Business

intelligence, on the other hand, is a set of software tools that enable an organization to analyze measurable aspects of their business such as sales performance, profitability, operational efficiency, effectiveness of marketing campaigns, market penetration among certain customer groups, cost trends, anomalies and exceptions, etc. Typically, the term “business intelligence” is used to encompass OLAP, data visualization, data mining and query/reporting tools. (Reference: Les Barbusinski)

Explain paradigm of Bill Inmon and Ralph Kimball.

Bill Inmon's paradigm: Data warehouse is one part of the overall business

intelligence system. An enterprise has one data warehouse, and data marts source

their information from the data warehouse. In the data warehouse, information is

stored in 3rd normal form.

Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts

within the enterprise. Information is always stored in the dimensional model.

Maximum of Connection pools

The maximum number of connections allowed for this connection pool. The default is 10. This value should be determined by the database make and model and the configuration of the hardware box on which the database runs as well as the number of concurrent users who require access.

NOTE: For deployments with Intelligence Dashboard pages, consider estimating this value at 10% to 20% of the number of simultaneous users multiplied by the number of requests on a dashboard. This number may be adjusted based on usage. The total number of all connections in the repository should be less than 800. To estimate the maximum connections needed for a connection pool dedicated to an initialization block, you might use the number of users concurrently logged on during initialization block execution.

Define pipeline. Did you use it in your projects?
Yes, pipelines are the stages in a particular transaction. Assessment, finance etc.

How do you create filter on repository?
Where condition on content tab.

How do you work in a multi user environment? What are the steps?

Create a shared directory on the network for Multi-user Development (MUD).
Open the RPD to use in MUD. From Tools->Options, setup the MUD directory to point to the above directory.
Define projects within the RPD to allow multiple users to develop within their subject area or Facts.
Save and move the RPD to the shared directory setup in point 1.
When users work in the MUD mode, they open the admin tool and start with
MUD ->Checkout to check out the project they need to work on (not use the File open as you would usually do).
After completely the development, user checking the changes back to the network and merge the changes.

Where are passwords for userid? LDAP, external table authentication stored respectively?
Passwords for userid are in OBIEE server repository LDAP authentication in LDAP server external database in a table in external database.

Can you bypass OBIEE server security? if so how?
Yes you can by-pass by setting authententication type in NQSCONFIG file in the security section as: authentication_type=bypass_nqs.instanceconfig.xml and nqsconfig.ini are the 2 places.

Where can you add new groups and set permissions?
You can add groups by going to manage>security>add new groups> You can give permissions to a group for query limitation and filter conditions.

What are the things you can do in the BMM layer?
Aggregation navigation, level base metrics, time series wizard, create new logical column, complex join.

What is ragged hierarchy and how do you manage it?
Ragged Hierarchy is one of the different kinds of hierarchy.
A hierarchy in which each level has a consistent meaning, but the branches have inconsistent depths because at least one member attributes in a branch level is unpopulated. A ragged hierarchy can represent a geographic hierarchy in which the meaning of each level such as city or country is used consistently, but the depth of the hierarchy varies.
For example, a geographic hierarchy that has Continent, Country, Province/State, and City levels defined. One branch has North America as the Continent, United States as the Country, California as the Province or State, and San Francisco as the City. However, the hierarchy becomes ragged when one member does not have an entry at all of the levels. For example, another branch has Europe as the Continent, Greece as the Country, and Athens as the City, but has no entry for the Province or State level because this level is not applicable to Greece for the business model in this example. In this example, the Greece and United States branches descend to different depths, creating a ragged hierarchy.

What is the difference between Single Logical Table Source and Multiple Logical Table Sources?
If a logical table in BMM layer has only one Table as the source table then it is Single LTS.
If the logical table in BMM layer has more than one table as the sources to it then it is called Multiple LTS.
Ex: Usually Fact table has Multiple LTS’, for which sources will be coming from different Physical tables.

Can you let me know how many aggregate tables you have in your project? On what basis have you created them?
As per resume justification document

How do you bring/relate the aggregate tables into the OBIEE Logical layer?
One way of bringing the Aggregate Tables into the BMM layer is by bringing them as Logical Table sources for the corresponding Fact table.
This is done by dragging and dropping the aggregate table into the corresponding fact table. After doing that establish the column mappings and the set the aggregation levels.

How do you know which report is hitting which table, either the fact table or the aggregate table?
After running the report, go to Administration tab and go to click on Manage Sessions. There you can find the queries that are run and in the View Log option in the Session Management you can find which report is hitting which table.

Suppose I have report which is running for about 3 minutes typically. What is the first step you take to improve the performance of the query?
o Find the sql query of the report in Admin->manage Session-> run the sql query on toad ->read the explain plan output ->modify the SQL based on the explain plan output

Suppose you have a report which has the option of running on aggregate table. How does the tool know to hit the Aggregate table and for that what the steps you follow to configure them?
o Explain the process of Aggregate navigation
Have you heard of Implicit Facts? If, so what are they?
O An implicit fact column is a column that will be added to a query when it contains columns from two or more dimension tables and no measures. You will not see the column in the results. It is used to specify a default join path between dimension tables when there are several possible alternatives.
o For example, there might be many star schemas in the database that have the Campaign dimension and the Customer dimension, such as the following stars:-
Campaign History star. Stores customers targeted in campaign.
Campaign Response star. Stores customer responses to a campaign.
Order star. Stores customers who placed orders as a result of a campaign.
In this example, because Campaign and Customer information might appear in many segmentation catalogs, users selecting to count customers from the targeted campaigns catalog would be expecting to count customers that have been targeted in specific campaigns.
To make sure that the join relationship between Customers and Campaigns is through the campaign history fact table, a campaign history implicit fact needs to be specified in Campaign History segmentation catalog. The following guidelines should be followed in creating
segmentation catalogs:
Each segmentation catalog should be created so that all columns come from only one physical star.
Because the Marketing module user interface has special features that allow users to specify their aggregations, level-based measures typically should not be exposed to segmentation users in a segmentation catalog.

What is aggregate navigation? How do you configure the Aggregate tables in OBIEE?
Aggregate tables store pre-computed results, which are measures that have been aggregated (typically summed) over a set of dimensional attributes. Using aggregate tables is a very popular technique for speeding up query response times in decision support systems.
If you are writing SQL queries or using a tool that only understands what physical tables exist (and not their meaning), taking advantage of aggregate tables and putting them to good use becomes more difficult as the number of aggregate tables increases. The aggregate navigation capability of the OBIEE Server, however, allows queries to use the information stored in aggregate tables automatically, without query authors or query tools having to specify aggregate tables in their queries. The OBIEE Server allows you to concentrate on asking the right business question; the server decides which tables provide the fastest answers.

(Assume you are in BMM layer) We have 4 dimension tables, in that, 2 tables need to have hierarchy, then in such a case is it mandatory to create hierarchies for all the dimension tables?
No, it’s not mandatory to define hierarchies to other Dimension tables.

Can you have multiple data sources in OBIEE?
Yes.

How do you deal with case statement and expressions in OBIEE?
Use expression builder to create case when…then... end statement

Do you know about Initialization Blocks? Can you give me an example where you used them?
Init blocks are used for instantiating a session when a user logs in.
To create dynamic variable you have to create IB to write sql statement.

What is query repository tool?
It is utility of Seibel/OBIEE Admin tool
allows you to examine the repository metadata tool
for example: search for objects based on name, type.
Examine relationship between metadata objects like which column in the presentation layer maps to which table in physical layer.

What is JDK and why do we need it?
Java Development Kit (JDK), a software package that contains the minimal set of tools needed to write, compile, debug, and run Java applets.

Oracle doesn’t recommend Opaque Views because of performance considerations, so why/when do we use them?
An opaque view is a physical layer table that consists of select statement. An opaque view should be used only if there is no other solution.

Can you migrate the presentation layer to a different server?
No we have to migrate the whole web & RPD files

How do you identify what are the dimension tables and how do you decide them during the Business/Data modeling?
Dimension tables contain descriptions that data analysts use as they query the database. For example, the Store table contains store names and addresses; the Product table contains product packaging information; and the Period table contains month, quarter, and year values. Every table contains a primary key that consists of one or more columns; each row in a table is uniquely identified by its primary-key value or values.

What is the full form of RPD?
There is no full form for RPD as such, it is just a repository file (Rapid file Database)

How do i disable cache for only 2 particular tables?

In the physical layer, right click on the table there we will have the option which says cacheable

How do you split a table in the RPD given the condition? (The condition given was Broker and customer in the same table) Split Broker and customer.

We need to make an alias table in the physical layer.

What type of protocol did you use in SAS (Statistical Analysis System)?
TCP/IP

Did you create any new logical column in BMM layer, how?
Yes. We can create new logical column in BMM layer.
Example: Right click on fact table -new logical column-give name for new logical column like Total cost.
Now in fact table source, we have one option column mapping, in that we can do all calculation for that new column.

Can you use physical join in BMM layer?
Yes we can use physical join in BMM layer. When there is SCD type 2 we need complex join in BMM layer.

Can you use outer join in BMM layer?

Yes we can. When we are doing complex join in BMM layer, there is one option type, outer join is there.

What are other ways of improving summary query reports other than Aggregate Navigation and Cache Management
Indexes
Join algorithm
Mat/view query rewrite
Web proper report design it’s optimal by making sure that it is not getting any addition column or rows

What is level-based metrics?
Level-based metrics means, having a measure pinned at a certain level of the dimension. For Example, if you have a measure called Dollars, you can create a Level Based Measure called Yearly Dollars which (you guessed it) is Dollars for a Year. This measure will always return the value for the year even if you drill down to a lower level like quarter, month… etc. To create a level based measure, create a new logical column based on the original measure (like Dollars in the example above). Drag and drop the new logical column to the appropriate level in the Dimension hierarchy (in the above example you will drag and drop it to Year in Time Dim
o A LBM is a metric that is defined for a specific level or intersection of levels.
o Monthly Total Sales or Quarterly Sales are the examples.
o You can compare monthly sales with quarterly sales. You can compare customer orders this quarter to orders this year.

What is logging level? Where can you set logging levels?
You can enable logging level for individual users; you cannot configure a logging level for a group.
Set the logging level based on the amount of logging you want to do. In normal operations, logging is generally disabled (the logging level is set to 0). If you decide to enable logging, choose a logging
level of 1 or 2. These two levels are designed for use by OBIEE Server administrators.

Set Logging Level
1. In the Administration Tool, select Manage > Security.
2. The Security Manager dialog box appears.
3. Double-click the user’s user ID.
4. The User dialog box appears.
5. Set the logging level by clicking the Up or down arrows next to the Logging Level field.

What is variable in Siebel?
You can use variables in a repository to streamline administrative tasks and modify metadata content dynamically to adjust to a changing data environment. The Administration Tool includes a Variable Manager for defining variables.

What is system variable and non-system variable?
System variables
System variables are session variables that the OBIEE Server and OBIEE Web use for specific purposes. System variables have reserved names, which cannot be used for other kinds of variables (such as static or dynamic repository variables, or for non-system session variables).
When using these variables in the Web, preface their names with NQ_SESSION. For example, to filter a column on the value of the variable LOGLEVEL set the filter to the Variable NQ_SESSION.LOGLEVEL.

Non-system variables.
A common use for non-system session variables is setting user filters. For example, you could define a non-system variable called SalesRegion that would be initialized to the name of the user’s sales region. You could then set a security filter for all members of a group that would allow them to see only data pertinent to their region.
o When using these variables in the Web, preface their names with NQ_SESSION. For example, to filter a column on the value of the variable SalesRegion set the filter to the Variable NQ_SESSION.SalesRegion.

What are different types of variables? Explain each.
There are two classes of variables:
1. Repository variables
2. Session variables.
Repository variables.

A repository variable has a single value at any point in time. There are two types of repository variables:
static: This value persists, and does not change until a OBIEE Server administrator decides to change it.
Dynamic: The values are refreshed by data returned from queries. When defining a dynamic repository variable, you will create an initialization block or use a preexisting one that contains a SQL query. You will also set up a schedule that the OBIEE Server will follow to execute the query and periodically refresh the value of the variable.
Session Variables

Session variables are created and assigned a value when each user logs on.

There are two types of session variables:
1.system
2.nonsystem.

What are the cache management? Name all of them and their uses. For Event polling table do u need the table in your physical layer?
Monitoring and managing the cache is cache management.There are three ways to do that.
Disable caching for the system. (INI NQ config file), Cashe persistence time for specified physical tables and Setting event polling table.
Disable caching for the system. (INI NQ config file :
You can disable caching for the whole system by setting the ENABLE parameter to NO in the NQSConfig.INI file and restarting the OBIEE Server. Disabling caching stops all new cache entries and stops any new queries from using the existing cache. Disabling caching allows you to enable it at a later time without losing any entries already stored in the cache.
Cache persistence time for specified physical tables:
You can specify a cacheable attribute for each physical table; that is, if queries involving the specified table can be added to the cache to answer future queries. To enable caching for a particular physical table, select the table in the Physical layer of the Administration Tool and select the option Make table cacheable in the General tab of the Physical Table properties dialog box. You can also use the Cache Persistence Time settings to specify how long the entries for this table should persist in the query cache. This is useful for OLTP data sources and other data sources that are updated frequently, potentially down to every few seconds.
Setting event polling table:

OBIEE Server event polling tables store information about updates in the underlying databases. An application (such as an application that loads data into a data mart) could be configured to add rows to an event polling table each time a database table is updated. The Analytics server polls this table at set intervals and invalidates any cache entries corresponding to the updated tables.
For event polling table, it is a standalone table and doesn’t require to be joined with other tables in the physical layer

What is Authentication? How many types of authentication.
Authentication is the process, by which a system verifies, through the use of a user ID and password, that a user has the necessary permissions and authorizations to log in and access data. The OBIEE Server authenticates each connection request it receives.
Operating system authentication
External table authentication
Database authentication
LDAP authentication.

What is object level security?
There are two types of object level security: Repository level and Web level
Repository level: In presentation layer we can set Repository level security by giving permission or deny permission to users/groups to see particular table or column.
Web level: this provides security for objects stored in the OBIEE web catalog, such as dashboards, dashboards pages, folder, and reports you can only view the objects for which you are authorized. For example, a mid-level manager may not be granted access to a dashboard containing summary information for an entire department.

What is data level security?
This controls the type an amount of data that you can see in a report. When multiple users run the same report the results that are returned to each depend on their access rights and roles in the organization. For example a sales vice president sees results for all regions, while a sales representative for a particular region sees only data for that region.

What is the difference between Data Level Security and Object Level Security?
Data level security controls the type and amount of data that you can see in a reports. Security provides security for objects stored in the OBIEE web catalog, like dashboards, dashboards pages, folder, and reports.

How do you implement security using External Tables and LDAP?
Instead of storing user IDs and passwords in a OBIEE Server repository, you can maintain lists of users and their passwords in an external database table and use this table for authentication purposes. The external database table contains user IDs and passwords, and could contain other information, including group membership and display names used for OBIEE Web users. The table could also contain the names of specific database catalogs or schemas to use for each user when querying data
Instead of storing user IDs and passwords in a OBIEE Server repository, you can have the OBIEE Server pass the user ID and password entered by the user to an LDAP(Lightweight Directory Access Protocol ) server for authentication. The server uses clear text passwords in LDAP authentication. Make sure your LDAP servers are set up to allow this.

If you have 2 fact and you want to do report on one with quarter level and the other with month level how do you do that with just one time dimension?
Using level base metrics.

Did you work on a standalone Siebel system or was it integrated to other platforms?
Deploying the OBIEE platform without other Siebel applications is called OBIEE Stand -Alone .If your deployment includes other OBIEE Application it called integrated analytics -You can say Stand-Alone OBIEE

How to sort columns in RPD and web?
Sorting on web column, sort in the RPD its sort order column

If you want to create new logical column where will you create (in repository or dashboard) why?
I will create new logical column in repository. Because if it is in repository, you can use for any report. If you create new logical column in dashboard then it is going to effect on those reports, which are on that dashboard. You cannot use that new logical column for other dashboard(or request)

What is complex join, and where it is used?
We can join dimension table and fact table in BMM layer using complex join. When there is SCD type 2 we have to use complex join in BMM layer.

If you have dimension table like customer, item, time and fact table like sale and if you want to find out how often a customer comes to store and buys a particular item, what will you do?

write a query as SELECT customer_name, item_name, sale_date, sum(qty) FROM customer_dim a, item_dim b, time_dim c, sale_fact d WHERE d.cust_key = a.cust_key AND d.item_key = b.item_key AND d.time_key = c.time_key GROUP BY customer_name, item_name, sale_date.

You worked on standalone or integrated system?
Standalone.

If you want to limit the users by the certain region to access only certain data, what would you do?
Using data level security.
OBIEE Administrator: go to Manage -> Security in left hand pane u will find the user, groups, LDAP server, Hierarchy
What you can do is select the user and right click and go to properties, you will find two tabs named as users and logon, go to user tab and click at permission button in front of user name you have selected as soon as u click at permission you will get a new window with user group permission having three tabs named as general ,query limits and filter and you can specify your condition at filter tab, in which you can select presentation table ,presentation columns ,logical table and logical columns where you can apply the condition according to your requirement for the selected user or groups.

If there are 100 users accessing data, and you want to know the logging details of all the users, where can you find that?
To set a user’s logging level
1. In the Administration Tool, select Manage > Security.
The Security Manager Dialog box appears.
2. Double-click the user’s user ID. The User dialog box appears.
3. Set the logging level by clicking the Up or Down arrows next to the Logging Level field.

How do implement event polling table?
OBIEE Server event polling tables store information about updates in the underlying databases. An application (such as an application that loads data into a data mart) could be configured to add rows to an event polling table each time a database table is updated. The Analytics server polls this table at set intervals and invalidates any cache entries corresponding to the updated tables.

Wednesday, October 8, 2014

DAC Interviews Questions.

1. Over all architecture of DAC?
DAC server and DAC Client. They must co-locate with Informatica Integration service, repository service and Informatica repository

2. Why we should use DAC and not control all execution through informatica?
For better performance management, such as creating index, dropping index, truncating before load. Without DAC a custom ETL process will be needed, which has to survive the upgrate.

3. Can we run multiple execution plan at the same time in DAC?
Yes. only if the execution plan are not loading into the same table or using the same phyiscal table source

4. Explain DAC export/import
A way to import or export DAC repository metadata for upgrade or backup. Logical, System, runtime objects can be import/export

5. Have you change any of the DAC parameters ? If so which one and why ?
You have to understand what are the DAC parameters and the purpose of each. For example, Initial_extract_date can be modified when configure for initial full load, so the value for initial extract date will be used to filter out records from the source that are older than this date.

6. How do you Determine the Informatica Server Maximum Sessions Parameter Setting in DAC?
One you register informatica server in Dac client

7. Can dac send an email just in case of any failures ?
In DAC Client, toolbar, click email recipient, then in Tools--> DAC Server setup, Email configuration

8. Can you execute the sql scrip through DAC ? If yes how ?

Yes, at task level, in execution type, select SQL file. As a bonus to this answer, this article explains how to run store procedures in DAC.

9. in DAC How you can disable table indexes before loading and enable the index once load is complete ?
Just drop and recreate index

10.Let say you are running the normal incremental load. But just for today you want to extract data from AP_INVOCIES_ALL from 12/12/2011? How you can achieve this ?

Modify the refresh date to be 12/12/2011

11.How DAC Determines the Order of Task Execution within an Execution Plan ?
Based on tasks source/target table, Task phase (extract dim, load fact etc) and 'truncate always' properties, to run them in particular order, create task group

12.What are Micro ETL Execution Plans ? How can you Build and run them ?

According to Oracle document:
Micro ETL execution plans are ETL processes that you schedule at very frequent intervals, such as hourly or half-hourly. They usually handle small subject areas or subsets of larger subject areas. The DAC tracks refresh dates for tables in micro ETL execution plans separately from other execution plans and uses these refresh dates in the change capture process.

in design -- subject areas, create copy of subject area, inactive the unwanted tasks and create new execution plan for this subject area

13.From you past experience – explain scenario where Micro ETL Execution Plans produced wrong results on reports?

According to Oracle Document:
CAUTION: Micro ETL processes can cause issues with data inconsistencies, data availability, and additional load on the transactional database. Therefore, you should consider the following factors before implementing a micro ETL process:

For related star schemas, if one schema is omitted from a micro ETL execution plan, the cross-star reports may be inaccurate. For example, if the Person fact table is refreshed more frequently than the Revenue fact table, a report that spans the Person and Revenue dimensional schemas may produce inconsistent results.
If you omit dimension tables from a micro ETL execution plan, the foreign keys for the fact tables will point to Unspecified rows for the new dimension records. The foreign key references will be resolved when the Complete ETL execution plan is run, but users of the reports should be aware of such inconsistencies.
If you do not include aggregate tables in micro ETL execution plans, the reports that use data from these tables will be inconsistent with the reports that use data from the detailed fact tables. However, if aggregate tables are included in the micro ETL execution plan, the aggregate calculations are performed for each ETL process, which will take a constant amount of time and may be inefficient to perform at such frequent intervals.
Hierarchy tables are rebuilt during every ETL execution plan by querying the base dimension tables. This operation takes a constant amount of time. If the base tables are big, this operation may take a long time and may be inefficient if the micro ETL execution plan runs several times a day. However, if you avoid populating the hierarchy tables during micro ETL processes, data inconsistencies will occur.
With micro ETL execution plans, caching will occur more frequently, which may have performance implications.
Micro ETL execution plans will put more load on the transactional database because of the frequent extracts.

14. Let say you can not use DAC scheduler to schedule you execution plan. What other options do you have ? How you can achieve this ?

Use Informatica scheduler.

15.Does DAC keeps track of refresh dates for all the source/target tables ?

According to Oracle Document:
Refresh dates are tracked only for tables that are either a primary source or a primary target on tasks in a completed run of an execution plan. The DAC runs the full load command for tasks on which a table is a primary source or target if the refresh date against the table is null. When there are multiple primary sources, the earliest of the refresh dates will trigger a full load or an incremental load. If any one of the primary source tables has no refresh date, then the DAC will run the full load command.

16.Consider the scenario as below for task T1
Primary Source has not null last refresh date
Primary Target has null last refresh date
Will task T1 executes in full or incremental ?

Based on answers provided from question 15, what do you think?

18. Using DAC command line – write a script to check weather informatica services are up or not ?
use dacCmdLine InformaticaStatus. Below is the list of all commend lines according to Oracle:

19.Can we have two DAC server on the same machine ?
You can run two DAC servers on the same machine as long as they are listening on different ports and pointing to two different repositories

20.Explain briefly What kind of DAC Repository Objects Held in Source System Containers ?

Subject Areas -- A logical grouping of tables related to a particular subject or application context. It also includes the tasks that are associated with the tables, as well as the tasks required to load the tables. Subject areas are assigned to execution plans, which can be scheduled for full or incremental loads

tables -- Phsyical tables in DB

Indexes -- Just like your physical DB indexes

Tasks -- Unit of work for loading tables

Task groups ---- Grouping of tasks that can be bundled to run as a group

Execution plans -- A data transformation plans defined on subject areas that needs to be transformed at certain frequencies of time

Schedules -- Determine how often execution plan runs.

21.What is Authentication file ? If you have dac client installed can you access DAC repository without Authentication file ?

According to Oracle Document:
When you configure a connection to the DAC Repository, the configuration process includes creating a new authentication file or selecting an existing authentication file. The authentication file authenticates the database in which the repository resides. If you create a new authentication file, you will specify the table owner and password for the database.

22.Explain Index, Table and Task Actions in DAC ?

According to Oracle Document:
Index action: Override the default behavior for dropping and creating indexes

Table action: Override the default behavior for truncating and analyzing tables

Task action: Can add new functionality of task behavior, such as precedinf action, success action, failure action, upon failure restart

23.How DAC Handles Parameters at Runtime ?

According to Oracle Document:
During an ETL execution, DAC reads and evaluates all parameters associated with that ETL run, including static and runtime parameters defined in DAC, parameters held in flat files, and parameters defined externally to DAC. DAC consolidates all the parameters for the ETL run, deduplicates any redundant parameters, and then creates an individual parameter file for each Informatica session. This file contains the evaluated name-value pairs for all parameters, both static and runtime, for each workflow that DAC executes. The parameter file contains a section for each session under a workflow. DAC determines the sessions under a workflow during runtime by using the Informatica pmrep function ListObjectDependencies.

The naming convention for the parameter file is

....txt

DAC writes this file to a location specified in the DAC system property InformaticaParameterFileLocation. The location specified by the property InformaticaParameterFileLocation must be the same as the location specified by the Informatica parameter property $PMSourcefileDir.

24. How DAC Determines Tasks Required for any given subject area ?

According to Oracle Document:
You define a subject area by specifying a fact table or set of fact tables to be the central table or tables in the subject area. When a subject area is defined, DAC performs the following process to determine the relevant tasks:

DAC identifies the dimension tables associated with the facts and adds these tables to the subject area.

DAC identifies the related tables, such as aggregates, associated with the fact or dimension tables and adds them to the subject area definition.

DAC identifies the tasks for which the dimension and fact tables listed in the two processes above are targets tables and adds these tasks into the subject area.

Tasks that DAC automatically assigns to a subject area are indicated with the Autogenerated flag (in the Tasks subtab of the Subject Areas tab).

You can inactivate a task from participating in a subject area by selecting the Inactive check box (in the Tasks subtab of the Subject Areas tab). When the Inactive check box is selected, the task remains inactive even if you reassemble the subject area.

You can also remove a task from a subject area using the Add/Remove command in the Tasks subtab of the subject Areas tab, but when you remove a task it is only removed from the subject area until you reassemble the subject area.

DAC identifies the source tables for the tasks identified in the previous process and adds these tables to the subject area.

DAC performs this process recursively until all necessary tasks have been added to the subject area. A task is added to the subject area only once, even if it is associated with several tables in the subject area. DAC then expands or trims the total number of tasks based on the configuration rules, which are defined as configuration tags. This process can be resource intensive because DAC loads all of the objects in the source system container into memory before parsing.

25.Difference between Homogeneous and Heterogeneous execution plans.

According to Oracle Document:

Homogeneous

This type of execution plan extracts data from multiple instances of the same source system. For example, a business might have an instance of Oracle EBS 11i in one location and time zone and another instance of Oracle EBS 11i in another location and time zone. In such cases, the timing of data extraction from the different instances can be staggered to meet your business requirements.

Heterogeneous

This type of execution plan extracts data from one or more instances of dissimilar source systems. For example, a business might have an instance of Siebel 7.8 in one location, an instance of Oracle EBS 11i in another location, and a second instance of Oracle EBS 11i in yet a third location. You can also stagger the timing of data extraction when you use this type of execution plan.

Thursday, October 9, 2014

Data Warehousing Question

Wednesday, October 8, 2014

DAC Interviews Questions.

ADF functions

DELTALAKE AZURE CLOUD DATABRICS -PYSPARK SNOWFLAKE