Tuesday, September 10, 2024

Azure Data Factory (ADF) Interview questions

Introduction

Azure Data Factory (ADF) is a cloud-based data integration service that enables you to create data-driven workflows for orchestrating and automating data movement and data transformation. If you're preparing for an Azure Data Factory interview, you'll want to be familiar with the core concepts and features of ADF. In this article, we'll cover some of the top Azure Data Factory interview questions and answers to help you prepare.

1. What is Azure Data Factory (ADF)?

Azure Data Factory (ADF) is a cloud-based data integration service that enables you to create data-driven workflows for orchestrating and automating data movement and data transformation. ADF provides a managed service that is continuously monitored and updated, and it provides built-in security features, such as data encryption, identity and access management, and data privacy.

2. What are the core components of Azure Data Factory (ADF)?

The core components of Azure Data Factory (ADF) are:

·       Pipelines: A pipeline is a logical grouping of activities that perform a specific task.

·       Activities: An activity is a unit of work within a pipeline.

·       Datasets: A dataset is a named view of data that is used by activities.

·       Linked services: A linked service is a connection to a data store or a compute service.

·       Triggers: A trigger is a mechanism that starts a pipeline run.

3. What are the different types of activities in Azure Data Factory (ADF)?

The different types of activities in Azure Data Factory (ADF) are:

·       Data movement activities: Data movement activities move data from one location to another.

·       Data transformation activities: Data transformation activities transform data from one format to another.

·       Control activities: Control activities control the flow of a pipeline.

4. What is a linked service in Azure Data Factory (ADF)?

A linked service in Azure Data Factory (ADF) is a connection to a data store or a compute service. Linked services are used to connect to various data stores, such as Azure Blob Storage, Azure Data Lake Storage, and Azure SQL Database.

5. What is a dataset in Azure Data Factory (ADF)?

A dataset in Azure Data Factory (ADF) is a named view of data that is used by activities. A dataset represents the input or output of an activity.

6. What is a pipeline in Azure Data Factory (ADF)?

A pipeline in Azure Data Factory (ADF) is a logical grouping of activities that perform a specific task. A pipeline can contain one or more activities, and it can be triggered manually or scheduled to run at a specific time.

7. What is a trigger in Azure Data Factory (ADF)?

A trigger in Azure Data Factory (ADF) is a mechanism that starts a pipeline run. Triggers can be scheduled to run at a specific time, or they can be triggered by an event, such as the arrival of a new file in a data store.

8. What is the difference between a tumbling window and sliding window trigger in Azure Data Factory (ADF)?

A tumbling window trigger in Azure Data Factory (ADF) triggers a pipeline run at a fixed interval, while a sliding window trigger triggers a pipeline run at a sliding interval. For example, a tumbling window trigger might trigger a pipeline run every hour, while a sliding window trigger might trigger a pipeline run every 30 minutes.

9. What is the difference between a single-node and an integrated runtime in Azure Data Factory (ADF)?

A single-node runtime in Azure Data Factory (ADF) is a standalone runtime that is used for data integration tasks, while an integrated runtime is a runtime that is integrated with Azure Data Factory. An integrated runtime provides additional features, such as support for custom activities and integration with Azure DevOps.

10. What is the difference between a tumbling window and a sliding window in Azure Data Factory (ADF)?

A tumbling window in Azure Data Factory (ADF) is a fixed-size window that moves data at regular intervals, while a sliding window is a moving window that moves data based on a specific time interval.

For example, a tumbling window might move data every hour, while a sliding window might move data every 30 minutes, but also include the previous 30 minutes of data.

11. What is the difference between a dataset and a linked service in Azure Data Factory (ADF)?

A dataset in Azure Data Factory (ADF) is a named view of data that is used by activities, while a linked service is a connection to a data store or a compute service. A dataset represents the input or output of an activity, while a linked service is used to connect to various data stores or compute services.

12. What is the difference between a pipeline and a trigger in Azure Data Factory (ADF)?

A pipeline in Azure Data Factory (ADF) is a logical grouping of activities that perform a specific task, while a trigger is a mechanism that starts a pipeline run. A pipeline can contain one or more activities, and it can be triggered manually or scheduled to run at a specific time, while a trigger starts a pipeline run based on a specific event or schedule.

13. What is the difference between a data flow and a mapping data flow in Azure Data Factory (ADF)?

A data flow in Azure Data Factory (ADF) is a data transformation that is executed in a managed runtime, while a mapping data flow is a data transformation that is executed in a Spark runtime. A data flow provides a visual interface for data transformation, while a mapping data flow provides a code-first interface for data transformation.

14. What is the difference between a tumbling window and a tumbling window trigger in Azure Data Factory (ADF)?

A tumbling window in Azure Data Factory (ADF) is a fixed-size window that moves data at regular intervals, while a tumbling window trigger is a trigger that starts a pipeline run at regular intervals. A tumbling window trigger might start a pipeline run every hour, while a tumbling window moves data every hour.

15. What is the difference between a sliding window and a sliding window trigger in Azure Data Factory (ADF)?

A sliding window in Azure Data Factory (ADF) is a moving window that moves data based on a specific time interval, while a sliding window trigger is a trigger that starts a pipeline run based on a specific time interval. A sliding window trigger might start a pipeline run every 30 minutes, while a sliding window moves data every 30 minutes, but also include the previous 30 minutes of data.

 

 

1. What is Azure Data Factory?

 

2. Is Azure Data Factory ETL or ELT tool?

It is a cloud-based Microsoft tool that provides a cloud-based integration service for data analytics at scale and supports ETL and ELT paradigms. 

3. Why is ADF needed?Azure Data Factory is a cloud-based, fully managed, serverless ETL and data integration service offered by Microsoft Azure for automating data movement from its native place to, say, a data lake or data warehouse using ETL (extract-transform-load) OR extract-load-transform (ELT). It lets you create and run data pipelines to help move and transform data and run scheduled pipelines. 

With an increasing amount of big data, there is a need for a service like ADF that can orchestrate and operationalize processes to refine the enormous stores of raw business data into actionable business insights. 

4. What sets Azure Data Factory apart from conventional ETL tools?

Azure Data Factory stands out from other ETL tools as it provides: -

·       Enterprise Readiness: Data integration at Cloud Scale for big data analytics!

·       Enterprise Data Readiness: There are 90+ connectors supported to get your data from any disparate sources to the Azure cloud!

·       Code-Free Transformation: UI-driven mapping dataflows.

·       Ability to run Code on Any Azure Compute: Hands-on data transformations

·       Ability to rehost on-prem services on Azure Cloud in 3 Steps: Many SSIS packages run on Azure cloud.

·       Making DataOps seamless: with Source control, automated deploy & simple templates.

·       Secure Data Integration: Managed virtual networks protect against data exfiltration, which, in turn, simplifies your networking.

Data Factory contains a series of interconnected systems that provide a complete end-to-end platform for data engineers.  The below snippet summarizes the same.

Azure Data Factory Introduction

 

5. What are the major components of a Data Factory?

To work with Data Factory effectively, one must be aware of below concepts/components associated with it: -

·       Pipelines: Data Factory can contain one or more pipelines, which is a logical grouping of tasks/activities to perform a task. An activity can read data from Azure blob storage and load it into Cosmos DB or Synapse DB for analytics while transforming the data according to business logic. This way, one can work with a set of activities using one entity rather than dealing with several tasks individually.

·       Activities: Activities represent a processing step in a pipeline. For example, you might use a copy activity to copy data between data stores. Data Factory supports data movement, transformations, and control activities. 

·       Datasets: Datasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or outputs.

·       Linked Service: This is more like a connection string, which will hold the information that Data Factory can connect to various sources. In the case of reading from Azure Blob storage, the storage-linked service will specify the connection string to connect to the blob, and the Azure blob dataset will select the container and folder containing the data.

·       Integration Runtime:  Integration runtime instances bridged the activity and linked Service. The linked Service or activity references it and provides the computing environment where the activity runs or gets dispatched. This way, the activity can be performed in the region closest to the target data stores or compute Services in the most performant way while meeting security (no publicly exposing data) and compliance needs.

·       Data Flows: These are objects you build visually in Data Factory, which transform data at scale on backend Spark services. You do not need to understand programming or Spark internals. Design your data transformation intent using graphs (Mapping) or spreadsheets (Power query activity).  

Refer to the documentation for more details: https://docs.microsoft.com/en-us/azure/data-factory/frequently-asked-questions

The below snapshot explains the relationship between pipeline, activity, dataset, and linked service.

Azure Data Factory Concepts

 

6. What are the different ways to execute pipelines in Azure Data Factory?

There are three ways in which we can execute a pipeline in Data Factory:

·       Debug mode can be helpful when trying out pipeline code and acts as a tool to test and troubleshoot our code.

·       Manual Execution is what we do by clicking on the ‘Trigger now’ option in a pipeline. This is useful if you want to run your pipelines on an ad-hoc basis.

·       We can schedule our pipelines at predefined times and intervals via a Trigger. As we will see later in this article, there are three types of triggers available in Data Factory. 

 

7. What is the purpose of Linked services in Azure Data Factory?

Linked services are used majorly for two purposes in Data Factory:

1.       For a Data Store representation, i.e., any storage system like Azure Blob storage account, a file share, or an Oracle DB/ SQL Server instance.

2.       For Compute representation, i.e., the underlying VM will execute the activity defined in the pipeline. 

8. Can you Elaborate more on Data Factory Integration Runtime?

The Integration Runtime, or IR, is the compute infrastructure for Azure Data Factory pipelines. It is the bridge between activities and linked services. The linked Service or Activity references it and provides the computing environment where the activity is run directly or dispatched. This allows the activity to be performed in the closest region to the target data stores or computing Services.

The following diagram shows the location settings for Data Factory and its integration runtimes:

Azure Data Factory Integration Runtime

Source: https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime  

Azure Data Factory supports three types of integration runtime, and one should choose based on their data integration capabilities and network environment requirements.

1.       Azure Integration Runtime: To copy data between cloud data stores and send activity to various computing services such as SQL Server, Azure HDInsight, etc.

2.       Self-Hosted Integration Runtime: Used for running copy activity between cloud data stores and data stores in private networks. Self-hosted integration runtime is software with the same code as the Azure Integration Runtime but installed on your local system or machine over a virtual network. 

3.       Azure SSIS Integration Runtime: You can run SSIS packages in a managed environment. So, when we lift and shift SSIS packages to the data factory, we use Azure SSIS Integration Runtime. 

9. What is required to execute an SSIS package in Data Factory?

We must create an SSIS integration runtime and an SSISDB catalog hosted in the Azure SQL server database or Azure SQL-managed instance before executing an SSIS package.

10. What is the limit on the number of Integration Runtimes, if any?

Within a Data Factory, the default limit on any entities is set to 5000, including pipelines, data sets, triggers, linked services, Private Endpoints, and integration runtimes. If required, one can create an online support ticket to raise the limit to a higher number.

11.  What are ARM Templates in Azure Data Factory? What are they used for?

An ARM template is a JSON (JavaScript Object Notation) file that defines the infrastructure and configuration for the data factory pipeline, including pipeline activities, linked services, datasets, etc. The template will contain essentially the same code as our pipeline.

ARM templates are helpful when we want to migrate our pipeline code to higher environments, say Production or Staging from Development, after we are convinced that the code is working correctly.

 

12.  How can we deploy code to higher environments in Data Factory?

At a very high level, we can achieve this with the below set of steps:

·       Create a feature branch that will store our code base.

·       Create a pull request to merge the code after we’re sure to the Dev branch.

·       Publish the code from the dev to generate ARM templates.

·       This can trigger an automated CI/CD DevOps pipeline to promote code to higher environments like Staging or Production.

13.  Which three activities can you run in Microsoft Azure Data Factory?

Azure Data Factory supports three activities: data movement, transformation, and control activities.

·       Data movement activities: As the name suggests, these activities help move data from one place to another.
e.g., Copy Activity in Data Factory copies data from a source to a sink data store.

·       Data transformation activities: These activities help transform the data while we load it into the data's target or destination.
e.g., Stored Procedure, U-SQL, Azure Functions, etc.

·       Control flow activities: Control (flow) activities help control the flow of any activity in a pipeline. 

e.g., wait activity makes the pipeline wait for a specified time. 

14.  What are the two types of compute environments supported by Data Factory to execute the transform activities?

Below are the types of computing environments that Data Factory supports for executing transformation activities: -

i) On-Demand Computing Environment: This is a fully managed environment provided by ADF. This type of calculation creates a cluster to perform the transformation activity and automatically deletes it when the activity is complete.

ii) Bring Your Environment: In this environment, you can use ADF to manage your computing environment if you already have the infrastructure for on-premises services. 

15.  What are the steps involved in an ETL process?

The ETL (Extract, Transform, Load) process follows four main steps:

i) Connect and Collect: Connect to the data source/s and move data to local and crowdsource data storage.

ii) Data transformation using computing services such as HDInsight, Hadoop, Spark, etc.

iii) Publish: To load data into Azure data lake storage, Azure SQL data warehouse, Azure SQL databases, Azure Cosmos DB, etc.

iv)Monitor: Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.

16.  If you want to use the output by executing a query, which activity shall you use? 

Look-up activity can return the result of executing a query or stored procedure.

The output can be a singleton value or an array of attributes, which can be consumed in subsequent copy data activity, or any transformation or control flow activity like ForEach activity.

 

17.  Can we pass parameters to a pipeline run? 

Yes, parameters are a first-class, top-level concept in Data Factory. We can define parameters at the pipeline level and pass arguments as you execute the pipeline run on demand or using a trigger. 

18.  Have you used Execute Notebook activity in Data Factory? How to pass parameters to a notebook activity?

We can execute notebook activity to pass code to our databricks cluster. We can pass parameters to a notebook activity using the baseParameters property. If the parameters are not defined/ specified in the activity, default values from the notebook are executed.          

19.  What are some useful constructs available in Data Factory?

·       parameter: Each activity within the pipeline can consume the parameter value           passed to the pipeline and run with the @parameter construct.

·       coalesce: We can use the @coalesce construct in the expressions to handle null values gracefully.

·       activity: An activity output can be consumed in a subsequent activity with the            @activity construct. 

20.  Can we push code and have CI/CD (Continuous Integration and Continuous Delivery) in ADF?

Data Factory fully supports CI/CD of your data pipelines using Azure DevOps and GitHub. This allows you to develop and deliver your ETL processes incrementally before publishing the finished product. After the raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse or Azure SQL Azure Data Lake, Azure Cosmos DB, or whichever analytics engine your business uses can point to from their business intelligence tools.

 

21.  What do you mean by variables in the Azure Data Factory?

Variables in the Azure Data Factory pipeline provide the functionality to hold the values. They are used for a similar reason as we use variables in any programming language and are available inside the pipeline.

Set variables and append variables are two activities used for setting or manipulating the values of the variables. There are two types of variables in a data factory: -

I) System variables:  These are fixed variables from the Azure pipeline. For example, pipeline name, pipeline id, trigger name, etc. You need these to get the system information required in your use case.

ii) User variable: A user variable is declared manually in your code based on your pipeline logic.

22.  What are mapping data flows?

Mapping data flows are visually designed data transformations in Azure Data Factory. Data flows allow data engineers to develop a graphical data transformation logic without writing code. The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. Data flow activities can be operationalized using Azure Data Factory scheduling, control flow, and monitoring capabilities.

Mapping data flows provides an entirely visual experience with no coding required. Data flows run on ADF-managed execution clusters for scaled-out data processing. Azure Data Factory manages all the code translation, path optimization, and execution of the data flow jobs.

23.  What is copy activity in the Azure Data Factory?

Copy activity is one of the most popular and universally used activities in the Azure data factory. It is used for ETL or Lift and Shift, where you want to move the data from one data source to another. While you copy the data, you can also do the transformation; for example, you read the data from the TXT/CSV file, which contains 12 columns; however, while writing to your target data source, you want to keep only seven columns. You can transform it and send only the required columns to the destination data source.

24.  Can you elaborate more on the Copy activity?

The copy activity performs the following steps at high-level:

i) Read data from the source data store. (e.g., blob storage)

ii) Perform the following tasks on the data:

·       Serialization/deserialization

·       Compression/decompression

·       Column mapping

iii) Write data to the destination data store or sink. (e.g., azure data lake)

This is summarized in the below graphic:

Azure Data Factory_Copy Activity

 

 

25.  What are the different activities you have used in Azure Data Factory?

Here you can share some of the significant activities if you have used them in your career, whether your work or college project. Here are a few of the most used activities :

1.       Copy Data Activity to copy the data between datasets.

2.       ForEach Activity for looping.

3.       Get Metadata Activity that can provide metadata about any data source.

4.       Set Variable Activity to define and initiate variables within pipelines.

5.       Lookup Activity to do a lookup to get some values from a table/file.

6.       Wait Activity to wait for a specified amount of time before/in between the pipeline run.

7.       Validation Activity will validate the presence of files within the dataset.

8.       Web Activity to call a custom REST endpoint from an ADF pipeline.

26.  How can I schedule a pipeline?

You can use the time window or scheduler trigger to schedule a pipeline. The trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically or in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM).

 Currently, the service supports three types of triggers:

·       Tumbling window trigger: A trigger that operates on a periodic interval while retaining a state.

·       Schedule Trigger: A trigger that invokes a pipeline on a wall-clock schedule.

·       Event-Based Trigger: A trigger that responds to an event. e.g., a file getting placed inside a blob.
Pipelines and triggers have a many-to-many relationship (except for the tumbling window trigger). Multiple triggers can kick off a single pipeline, or a single trigger can kick off numerous pipelines.

27.   When should you choose Azure Data Factory?

One should consider using Data Factory-

·       When working with big data, there is a need for a data warehouse to be implemented; you might require a cloud-based integration solution like ADF for the same.

·       Not all team members are experienced in coding and may prefer graphical tools to work with data.

·       When raw business data is stored at diverse data sources, which can be on-prem and on the cloud, we would like to have one analytics solution like ADF to integrate them all in one place.

·       We would like to use readily available data movement and processing solutions and be light regarding infrastructure management. So, a managed solution like ADF makes more sense in this case.

 28.  How can you access data using the other 90 dataset types in Data Factory?

The mapping data flow feature allows Azure SQL Database, Azure Synapse Analytics, delimited text files from Azure storage account or Azure Data Lake Storage Gen2, and Parquet files from blob storage or Data Lake Storage Gen2 natively for source and sink data source.

Use the Copy activity to stage data from any other connectors and then execute a Data Flow activity to transform data after it's been staged.  

 

29.  What is the difference between mapping and wrangling data flow (Power query activity)? 

Mapping data flows transform data at scale without requiring coding. You can design a data transformation job in the data flow canvas by constructing a series of transformations. Start with any number of source transformations followed by data transformation steps. Complete your data flow with a sink to land your results in a destination. It is excellent at mapping and transforming data with known and unknown schemas in the sinks and sources.

Power Query Data Wrangling allows you to do agile data preparation and exploration using the Power Query Online mashup editor at scale via spark execution.

It supports 24 SQL data types from char, nchar to int, bigint and timestamp, xml, etc.

Azure Data Factory Interview Questions For 4 Years Experience

If you’re a professional with 4 years of experience in Azure Data Factory, check out the list of these common ADF interview questions that you may encounter during your job interview.

30.   Can a value be calculated for a new column from the existing column from mapping in ADF?

We can derive transformations in the mapping data flow to generate a new column based on our desired logic. We can create a new derived column or update an existing one when developing a derived one. Enter the name of the column you're making in the Column textbox.

You can use the column dropdown to override an existing column in your schema. Click the Enter expression textbox to start creating the derived column’s expression. You can input or use the expression builder to build your logic.

31.  How is the lookup activity useful in the Azure Data Factory?

In the ADF pipeline, the Lookup activity is commonly used for configuration lookup purposes, and the source dataset is available. Moreover, it retrieves the data from the source dataset and then sends it as the activity output. Generally, the output of the lookup activity is further used in the pipeline for making decisions or presenting any configuration as a result.

Simply put, lookup activity is used for data fetching in the ADF pipeline. The way you would use it entirely relies on your pipeline logic. Obtaining only the first row is possible, or you can retrieve the complete rows depending on your dataset or query.

32.  Elaborate more on the Get Metadata activity in Azure Data Factory.

The Get Metadata activity is used to retrieve the metadata of any data in the Azure Data Factory or a Synapse pipeline. We can use the output from the Get Metadata activity in conditional expressions to perform validation or consume the metadata in subsequent activities.

It takes a dataset as input and returns metadata information as output. Currently, the following connectors and the corresponding retrievable metadata are supported. The maximum size of returned metadata is 4 MB

Please refer to the snapshot below for supported metadata which can be retrieved using the Get Metadata activity.

Get Metadata Activity in Azure Data Factory

33.  How to debug an ADF pipeline?

Debugging is one of the crucial aspects of any coding-related activity needed to test the code for any issues it might have. It also provides an option to debug the pipeline without executing it. 

Azure Data Factory Interview Questions For 5 Years Experience

Here are some of the most likely asked Azure Data Factory interview questions for professionals with 5 years of experience to help you prepare for your next job interview and feel confident in showcasing your expertise.

34 What does it mean by the breakpoint in the ADF pipeline?

To understand better, for example, you are using three activities in the pipeline, and now you want to debug up to the second activity only. You can do this by placing the breakpoint at the second activity. To add a breakpoint, click the circle present at the top of the activity.

35.  What is the use of the ADF Service?

ADF primarily organizes the data copying between relational and non-relational data sources hosted locally in data centers or the cloud. Moreover, you can use ADF Service to transform the ingested data to fulfill business requirements. In most Big Data solutions, ADF Service is used as an ETL or ELT tool for data ingestion.

36.  Explain the data source in the Azure data factory.

The data source is the source or destination system that comprises the data intended to be utilized or executed. The data type can be binary, text, CSV, JSON, image files, video, audio, or a proper database.

Examples of data sources include Azure data lake storage, azure blob storage, or any other database such as MySQL DB, Azure SQL database, Postgres, etc.

37.  Can you share any difficulties you faced while getting data from on-premises to Azure cloud using Data Factory?

One of the significant challenges we face while migrating from on-prem to the cloud is throughput and speed. When we try to copy the data using Copy activity from on-prem, the process rate could be faster, and hence we need to get the desired throughput. 

There are some configuration options for a copy activity, which can help in tuning this process and can give desired results.

i) We should use the compression option to get the data in a compressed mode while loading from on-prem servers, which is then de-compressed while writing on the cloud storage.

ii) Staging area should be the first destination of our data after we have enabled the compression. The copy activity can decompress before writing it to the final cloud storage buckets.

iii) Degree of Copy Parallelism is another option to help improve the migration process. This is identical to having multiple threads processing data and can speed up the data copy process.

There is no right fit-for-all here, so we must try different numbers like 8, 16, or 32 to see which performs well.

iv) Data Integration Unit is loosely the number of CPUs used, and increasing it may improve the performance of the copy process. 

38.  How to copy multiple sheet data from an Excel file?

When using an Excel connector within a data factory, we must provide a sheet name from which we must load data. This approach is nuanced when we have to deal with a single or a handful of sheets of data, but when we have lots of sheets (say 10+), this may become a tedious task as we have to change the hard-coded sheet name every time!

However, we can use a data factory binary data format connector for this and point it to the Excel file and need not provide the sheet name/s. We’ll be able to use copy activity to copy the data from all the sheets present in the file.

39.  Is it possible to have nested looping in Azure Data Factory?

There is no direct support for nested looping in the data factory for any looping activity (for each / until). However, we can use one for each/until loop activity which will contain an execute pipeline activity that can have a loop activity. This way, when we call the looping activity, it will indirectly call another loop activity, and we'll be able to achieve nested looping.

 

 

40.  How to copy multiple tables from one datastore to another datastore?

An efficient approach to complete this task would be:

·       Maintain a lookup table/ file containing the list of tables and their source, which needs to be copied.

·       Then, we can use the lookup activity and each loop activity to scan through the list.

·       Inside the for each loop activity, we can use a copy activity or a mapping dataflow to copy multiple tables to the destination datastore.

41.  What are some performance-tuning techniques for Mapping Data Flow activity?

We could consider the below set of parameters for tuning the performance of a Mapping Data Flow activity we have in a pipeline.

i) We should leverage partitioning in the source, sink, or transformation whenever possible. Microsoft, however, recommends using the default partition (size 128 MB) selected by the Data Factory as it intelligently chooses one based on our pipeline configuration.

Still, one should try out different partitions and see if they can have improved performance.

ii) We should not use a data flow activity for each loop activity. Instead, we have multiple files similar in structure and processing needs. In that case, we should use a wildcard path inside the data flow activity, enabling the processing of all the files within a folder.

iii) The recommended file format to use is ‘. parquet’. The reason being the pipeline will execute by spinning up spark clusters, and Parquet is the native file format for Apache Spark; thus, it will generally give good performance.

iv) Multiple logging modes are available: Basic, Verbose, and None.

We should only use verbose mode if essential, as it will log all the details about each operation the activity performs. e.g., It will log all the details of the operations performed for all our partitions. This one is useful when troubleshooting issues with the data flow.

The basic mode will give out all the necessary basic details in the log, so try to use this one whenever possible.

v)  Try to break down a complex data flow activity into multiple data flow activities. Let’s say we have several transformations between source and sink, and by adding more, we think the design has become complex. In this case, try to have it in multiple such activities, which will give two advantages:

·       All activities will run on separate spark clusters, decreasing the run time for the whole task.

·       The whole pipeline will be easy to understand and maintain in the future. 

42.  What are some of the limitations of ADF?

Azure Data Factory provides great functionalities for data movement and transformations. However, there are some limitations as well.

i) We can’t have nested looping activities in the data factory, and we must use some workaround if we have that sort of structure in our pipeline. All the looping activities come under this: If, Foreach, switch, and until activities.

ii) The lookup activity can retrieve only 5000 rows at a time and not more than that. Again, we need to use some other loop activity along with SQL with the limit to achieve this sort of structure in the pipeline.

iii) We can have 40 activities in a single pipeline, including inner activity, containers, etc. To overcome this, we should modularize the pipelines regarding the number of datasets, activities, etc.

44. How are all the components of Azure Data Factory combined to complete an ADF task?

The below diagram depicts how all these components can be clubbed together to fulfill Azure Data Factory ADF tasks.

Azure Data Factory Tasks

Source: https://docs.microsoft.com/en-us/learn/modules/intro-to-azure-data-factory/3-how-azure-data-factory-works                 

45. How do you send email notifications on pipeline failure?

There are multiple ways to do this:

1.       Using Logic Apps with Web/Webhook activity.
Configure a logic app that, upon getting an HTTP request, can send an email to the required set of people for failure. In the pipeline, configure the failure option to hit the URL generated by the logic app.

2.       Using Alerts and Metrics from pipeline options.
We can set up this from the pipeline itself, where we get numerous options for email on any activity failure within the pipeline.

46. Can we integrate Data Factory with Machine learning data?

Yes, we can train and retrain the model on machine learning data from the pipelines and publish it as a web service.

47. What is an Azure SQL database? Can you integrate it with Data Factory?

Part of the Azure SQL family, Azure SQL Database is an always up-to-date, fully managed relational database service built for the cloud for storing data. Using the Azure data factory, we can easily design data pipelines to read and write to SQL DB.

48. Can you host SQL Server instances on Azure?

Azure SQL Managed Instance is the intelligent, scalable cloud database service that combines the broadest SQL Server instance or SQL Server database engine compatibility with all the benefits of a fully managed and evergreen platform as a service.

50. What is Azure Data Lake Analytics?

Azure Data Lake Analytics is an on-demand analytics job service that simplifies storing data and processing big data.

 

51.What are the various types of loops in ADF? 

Loops in Azure Data Factory are used to iterate over a collection of items to perform a specific action repeatedly. There are three major types of loops in Azure Data Factory: 

·       For Each Loop: This loop is used to iterate over a collection of items and perform a specific action for each item in the collection. For example, if you have a list of files in a folder and want to copy each file to another location, you can use a For Each Loop to iterate over the list of files and copy each file to the target location.

·       Until Loop: This loop repeats a set of activities until a specific condition is met. For example, you could use an Until Loop to keep retrying an operation until it succeeds or until a certain number of attempts have been made.

·       While Loop: This loop repeats a specific action while a condition is true. For example, if you want to keep processing data until a specific condition is met, you can use a While Loop to repeat the processing until the condition is no longer true.

52. Can you list all the activities that can be performed in ADF?

Here are some of the key activities that can be performed in ADF:

·       Data Ingestion

·       Data Transformation

·       Data Integration 

·       Data Migration

·       Data Integration

·       Data Orchestration 

·       Data Enrichment 

 

 

 

1. In the context of Azure Data Factory, what does the term "variables" mean?

The variables that are used in the Azure Data Factory pipeline serve this storing function. Variables can be accessed within the pipeline in the same way that they can be used in any programming language because they are likewise available there.

Changing or setting the values of variables can be accomplished through the use of the Set Variable and Add Variable actions, respectively. A data factory typically has both continuous and discrete variables in its database.

1.       These Azure Pipeline system variables are grouped together under the heading of System variables. The name, ID, and name of any Triggers that are used in Pipelines, etc. These are the things that you need in order to access the system information that might be relevant to the use case that you are working on.

2.       The user variable is the second kind of variable, and it is the sort of variable that is defined explicitly in your code and is driven by the pipeline logic.

2. What is a "data flow map"?

Visual data transformations are referred to as mapping data flows when working in Azure Data Factory. Because of data flows, data engineers can construct logic for altering data without having to write any code at all. After the data flows have been generated, they are then implemented as activities inside of the scaled-out Apache Spark clusters that are contained within Azure Data Factory pipelines. The scheduling, control flow, and monitoring elements that are currently available in Azure Data Factory can be utilized to operationalize data flow operations.

The method of data flow mapping is highly immersive visually and does away with the necessity for any form of scripting. The execution clusters in which the data flows are carried out are managed by the ADF, which enables the data to be processed in a manner that is massively parallel. Azure Data Factory is responsible for all of the coding tasks, including the interpretation of code, the optimization of pathways, and the execution of data flow operations.

3. In the context of the Azure Data Factory, just what does it mean when it's referred to as "copy activity"?

The copy operation is one of the most extensively used and generally popular operations in the Azure data factory. The procedure that is known as "lift and shift" is useful in situations in which it is necessary to copy information from one database to another. You can make modifications to the data as you copy it. For instance, before you transmit it to your target data source, you can decide to lower the number of columns in the source txt/csv file from 12 to 7. You can change it in such a way that the target database receives only the required number of columns after the transfer.

4. Could you explain to me how I should go about planning a pipeline?

You can set up a pipeline's schedule by utilizing either the time window trigger or the scheduler trigger. Pipelines can be programmed to run on a timed basis on a periodic basis or in cyclical patterns according to the wall-clock calendar schedule of the trigger (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM).

There are now three different kinds of triggers that can be utilized with the service, and they are as follows:

1.       A state-preserving periodic trigger, the tumbling window trigger is used in this game.

2.       A time-based trigger that triggers a specified pipeline at a time that has been predetermined is referred to as a Schedule Trigger.

3.       One category of triggers is known as "event-based," which indicates that they respond in some way to any occurrence that takes place, such as when a file is copied into a blob. Pipelines and triggers are mapped to one another in a way that is many-to-many (except for the tumbling window trigger). It is conceivable for a single trigger to launch many pipelines, or it is possible for numerous triggers to initiate a single pipeline. Both scenarios are viable.

5. In which situations does Azure Data Factory seem the best option?

The utilization of the Data Factory is essential at this time.

1.       When dealing with massive amounts of data, it is likely required to implement a cloud-based integration solution such as ADF. This is because a data warehouse needs to be created.

2.       Not everyone in the team is a coder, and some members may discover that graphical interfaces make it simpler to analyze and manipulate data.

3.       When raw business data is located in many places, both on-premises and in the cloud, we need a unified analytics solution such as ADF to analyze it all in one place.

4.       We would like to minimize the management of our infrastructure to a minimum by making use of methods that are widely utilized for the transportation of data and the processing of it. Because of this, going with a managed solution such as ADF is the choice that makes the most sense.

6. Do you have any tips on how to access the information you require by taking advantage of the other ninety dataset types that are accessible in the Data Factory?

Data can originate from a wide variety of Azure services, including Azure SQL Database, Azure Synapse Analytics, delimited text files stored in an Azure storage account or Azure Data Lake Storage Gen2, and Parquet files stored in blob storage or Data Lake Storage Gen2. Source and sink data can be combined. In order to transform data coming from external connectors, first stage the data using the Copy action, and then perform an activity in the Data Flow category.

7. Can the value of a new column in an ADF table be determined by using an existing mapping column?

The logic that we specify can be used to generate a new column, and this is done by deriving transformations within the mapping data flow. When developing a derived column, we have the option of creating a brand-new one from scratch or making changes to an existing one. You can recognise the new column by giving it a name in the textbox labelled Column.

If you choose a different column from the menu, the one you currently have selected will be removed from your schema. Select the textbox that corresponds to the derived column, and then press the Enter key on your keyboard to get started crafting an expression for it. You can either manually enter your reasoning or make use of the expression builder to construct it.

8. Where can I find more information on the benefits of using lookup operations in the Azure Data Factory?

In the ADF pipeline, the Lookup activity is typically utilized for configuration lookup most of the time due to the ready availability of the source dataset. In addition to this, the output of the activity can be used to retrieve the data from the dataset that served as the source. In most cases, the outcomes of a lookup operation are sent back down the pipeline to be used as input for later phases.

In order to retrieve data, the ADF pipeline makes heavy use of lookup operations. You may only utilize it in a manner that is appropriate for the process you are going through. You have the option of retrieving either the first row or all of the rows, depending on the dataset or query you choose.

9. Please provide any more information that you have on the Azure Data Factory Get Metadata operation.

The Get Metadata operation can be used to access the metadata associated with any piece of data that is contained within an Azure Data Factory or a Synapse pipeline. We can use the results from the Get Metadata activity in conditional expressions to validate or utilize the metadata in subsequent actions. This can be done by using the Get Metadata activity.

It takes a dataset as an input and then generates descriptive data based on that dataset as an output. The supported connectors and the metadata that may be retrieved for each one are outlined in the table that can be found below. It is possible to accept metadata returns that are up to 4 MB in size.

10. Where did you experience the most difficulty while attempting to migrate data from on-premises to the Azure cloud via Data Factory?

Within the context of our ongoing transition from on-premises to cloud storage, the problems of throughput and speed have emerged as important obstacles. When we attempt to replicate the data from on-premises using the Copy activity, we do not achieve the throughput that we require.

The configuration variables that are available for a copy activity make it possible to fine-tune the process and achieve the desired results.

1.       If we load data from servers located on-premises, we should first compress it using the available compression option before writing it to cloud storage, where the compression will afterwards be erased.

2.       After the compression has been activated, ii) it is imperative that all of our data be quickly sent to the staging area. Before being stored in the target cloud storage buckets, the data that was transferred might be uncompressed for convenience.

3.       Copying Proportion, The use of parallelism is yet another alternative that offers the potential to make the process of transfer more seamless. This accomplishes the same thing as employing a number of different threads to process the data and can speed up the rate at which data is copied.

4.       Because there is no one size those fits all, we will need to try out a variety of different values, such as 8, 16, and 32, to see which one functions the most effective.

5.       It may be possible to hasten the duplication process by increasing the Data Integration Unit, which is roughly comparable to the number of central processing units.

11. Do I have the ability to copy information simultaneously from many Excel sheets?

When using an Excel connector within a data factory, it is necessary to specify the sheet name from which the data is going to be loaded. This method is unobtrusive when dealing with data from just one or a few sheets, but when dealing with data from tens of sheets or more, it can become tiresome because the sheet name needs to be updated in the code each time.

By utilizing a data factory binary data format connector and directing it to the Excel file, we can avoid having to manually insert the sheet names into the spreadsheet. Using the copy action, you will be able to simultaneously copy the information that is located on each of the sheets that are contained within the file.

12. Nesting of loops within loops in Azure Data Factory: yes or no?

Nesting loops are not directly supported by any of the activities that use the for each or till looping structures in the data factory. On the other hand, we have the option of utilizing an execute pipeline activity that incorporates a for each/until loop activity. By carrying out the aforementioned steps, we will be able to successfully implement nested looping by having one loop activity call another loop activity.

13. Are there any particular limitations placed on ADF members?

Azure Data Factory provides superior tools for transmitting and manipulating data, and these tools can be found in its feature set. However, you should be aware that there are certain limitations as well.

1.       Because the data factory does not allow the use of nested looping activities, any pipeline that has such a structure will require a workaround in order to function properly. Here is where we classify everything that has a looping structure: actions involving the conditions if, for, and till respectively.

2.       The lookup activity is capable of retrieving a maximum of 5000 rows in a single operation at its maximum capacity. To reiterate, in order to implement this kind of pipeline design, we are going to need to use some additional loop activity in conjunction with SQL with the limit.

3.       It is not possible for a pipeline to have more than forty activities in total, and this number includes any inner activities as well as any containers. To find a solution to this problem, pipelines ought to be modularized with regard to the number of datasets, activities, and so on.

14. What is Data Flow Debug?

It is possible to do data flow troubleshooting in Azure Data Factory and Synapse Analytics while simultaneously monitoring the real-time transformation of the data shape. The versatility of the debug session is beneficial to both the Data Flow design sessions as well as the pipeline debug execution.

 

15. Is it possible to use ADF to implement CI/CD, which stands for continuous integration and delivery?

The Data Factory provides full support for CI and CD for your data pipelines by utilizing Azure DevOps and GitHub. As a consequence of this, you are able to construct and roll out new versions of your ETL procedures in stages before delivering the completed product. After the raw data has been converted into a form that a firm can utilize, it should be imported into an Azure Data Warehouse, Azure SQL Azure Data Lake, Azure Cosmos DB, or another analytics engine that your organization's BI tools can reference. This step should take place as soon as possible.

16. Which components of Data Factory's building blocks are considered to be the most useful ones?

1.       Each activity inside the pipeline has the ability to use the @parameter construct in order to make use of the parameter value that has been provided to the pipeline.

2.       By making use of the @coalesce construct within the expressions, we are able to deal with null values in a pleasant manner.

3.       The @activity construct makes it possible to make use of the results obtained from one activity in another.

17. Do you have any prior experience with the Execute Notebook activity in Data Factory? Does anybody have any idea how to configure the settings for a laptop task?

Through the use of the execute notebook activity, we can communicate with our data bricks cluster from within a notebook. We are able to send parameters to an activity within a notebook by utilizing the baseParameters attribute of that activity. In the event that the parameters are not explicitly defined or specified in the activity, the notebook's default settings are utilized.

18. Is it possible to communicate with a pipeline run by passing information along in the form of parameters?

In Data Factory, a parameter is handled just like any other fully-fledged top-level notion would be. The defining of parameters at the pipeline level enables the passage of arguments during on-demand or triggered execution of the pipeline.

19. Which activity should be performed if the goal is to make use of the results that were acquired by performing a query?

A lookup operation can be used to acquire the result of a query or a stored process. The end result might be a single value or an array of attributes that can be utilized in a ForEach activity or some other control flow or transformation function. Either way, it could be a single value.

20. How many individual steps are there in an ETL procedure?

The ETL (Extract, Transform, Load) technique consists of carrying out these four stages in the correct order.

1.       Establishing a link to the data source (or sources) is the initial stage. After that, collecting the information and transferring it to either a local database or a crowdsourcing database is the next step in the process.

2.       Making use of computational services includes activities such as transforming data by utilizing HDInsight, Hadoop, Spark, and similar tools.

3.       Send information to an Azure service, such as a data lake, a data warehouse, a database, Cosmos DB, or a SQL database. This step can also be accomplished by using the Publish API.

4.       To facilitate pipeline monitoring, Azure Data Factory makes use of Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure site.

21. How well does Data Factory support the Hadoop and Spark computing environments when it comes to carrying out transform operations?

The following types of computer environments are able to carry out transformation operations with the assistance of a Data Factory:

1.       The On-Demand Computing Environment provided by I ADF is a solution that is ready to use and includes full management. A cluster is created for the computation to carry out the transformation, and this cluster is afterwards removed once the transformation has been carried out.

2.       Bring Your Own Equipment: If you already possess the computer gear and software required to deliver services on-premises, you can use ADF to manage your computing environment in this situation.

22. How about discussing the three most important tasks that you can complete with Microsoft Azure Data Factory?

As was discussed in the previous section's third question, Data Factory makes it easier to carry out three processes: moving data, transforming data, and exercising control.

1.       The operations known as data movement do precisely what their name suggests, which is to say that they facilitate the flow of data from one point to another. For example, information can be moved from one data store to another using Data Factory's Copy Activity. Other data stores may also be used.

2.       "Data transformation activities" are any operations that modify data as it is being loaded into its final destination system. Stored Procedures, U-SQL, Azure Functions, and so on are just a few examples.

3.       Control (flow) activities, as their name suggests, are designed to help regulate the speed of any process that is going through a pipeline. For example, selecting the Wait action will result in the pipeline pausing for the amount of time that was specified.

23. What is meant by the term "ARM Templates" when referring to Azure Data Factory? Where do we plan to use them?

An ARM template is a file that uses JavaScript Object Notation (JSON), and it is where all of the definitions for the data factory pipeline operations, associated services, and datasets are stored. Code that is analogous to the one used in our pipeline will be incorporated into the template.

Once we have determined that the code for our pipeline is operating as it should, we will be able to use ARM templates to migrate it to higher environments, such as Production or Staging, from the Development setting.

24. Is there a limit to the number of Integration Runtimes that may be built or is it unlimited?

The default maximum for anything that may be contained within a Data Factory is 5000, and this includes a pipeline, data set, trigger, connected service, Private Endpoint, and integration runtime. You can file a request to increase this amount through the online help desk if you find that you require more.

25. What are the prerequisites that need to be met before an SSIS package can be executed in Data Factory?

Setting up an SSIS integration runtime and an SSISDB catalogue in an Azure SQL server database or an Azure SQL-managed instance is required before an SSIS package can be executed. This can be done in either of these locations.

 

 

Scenario-Based Questions for Azure Data Factory

1. Scenario: Your company needs to move data from an on-premises SQL Server database to an Azure SQL Database daily. How would you set up this data movement in Azure Data Factory?

Answer: To set up this data movement:

1.       Create a Self-hosted Integration Runtime (IR) to securely connect to the on-premises SQL Server.

2.       Create linked services for both the on-premises SQL Server and Azure SQL Database.

3.       Create datasets for the source and destination tables.

4.       Create a pipeline with a Copy Data activity to move the data.

5.       Schedule the pipeline using a schedule trigger to run daily.

2. Scenario: You need to transform data from a CSV file in Azure Blob Storage and load it into an Azure SQL Database. Describe how you would accomplish this using Azure Data Factory.

Answer: To accomplish this:

1.       Create linked services for Azure Blob Storage and Azure SQL Database.

2.       Create datasets for the source CSV file and the destination SQL table.

3.       Create a pipeline with a Data Flow activity.

4.       In the Data Flow, read the data from the CSV file, apply the required transformations, and write the transformed data to the SQL table.

5.       Trigger the pipeline as needed.

3. Scenario: Your data pipeline fails intermittently due to network issues. How would you handle this in Azure Data Factory?

Answer: To handle intermittent pipeline failures:

1.       Configure retry policies for the affected activities, specifying the maximum retry count and the retry interval.

2.       Use the Set Variable activity to capture and log error details.

3.       Implement conditional activities like If Condition to retry or reroute the process based on error types.

4. Scenario: You need to copy data from multiple CSV files stored in an Azure Data Lake Storage Gen2 account to an Azure SQL Database. How would you configure this in Azure Data Factory?

Answer: To configure this data movement:

1.       Create linked services for Azure Data Lake Storage Gen2 and Azure SQL Database.

2.       Create datasets for the source CSV files and the destination SQL table.

3.       Use a wildcard in the source dataset to specify multiple CSV files.

4.       Create a pipeline with a Copy Data activity to move the data from the CSV files to the SQL table.

5. Scenario: You have a pipeline that must run only after another pipeline completes successfully. How would you implement this in Azure Data Factory?

Answer: To implement this dependency:

1.       Use Execute Pipeline activity to call the dependent pipeline.

2.       Set up an activity dependency to ensure that the subsequent pipeline runs only if the previous pipeline completes successfully.

6. Scenario: Your data transformation logic involves multiple steps, including filtering, aggregation, and joining data from two different sources. How would you implement this in Azure Data Factory?

Answer: To implement complex data transformations:

1.       Create linked services for the data sources.

2.       Create datasets for the input and output data.

3.       Create a pipeline with a Mapping Data Flow activity.

4.       In the Data Flow, add transformations to filter, aggregate, and join the data from the two sources.

5.       Write the transformed data to the desired output destination.

7. Scenario: You need to incrementally load data from an on-premises SQL Server to an Azure SQL Database. Explain how you would achieve this in Azure Data Factory.

Answer: To achieve incremental data loading:

1.       Identify a watermark column (e.g., last modified date) in the source table.

2.       Store the last processed value of the watermark column.

3.       Create a pipeline with a Copy Data activity.

4.       Use a dynamic query in the source dataset to filter data based on the stored watermark value.

5.       Update the watermark value after each successful load.

8. Scenario: You are tasked with integrating data from various formats (CSV, JSON, Parquet) stored in an Azure Data Lake Storage Gen2 into a single Azure SQL Database table. Describe your approach.

Answer: To integrate data from various formats:

1.       Create linked services for Azure Data Lake Storage Gen2 and Azure SQL Database.

2.       Create datasets for each file format and the destination SQL table.

3.       Create a pipeline with multiple Copy Data activities, each handling a different file format.

4.       Use Data Flow activities to apply necessary transformations and merge the data into a single table.

9. Scenario: You need to implement a solution that dynamically chooses the source and destination based on input parameters. How would you configure this in Azure Data Factory?

Answer: To configure dynamic source and destination selection:

1.       Create parameters in the pipeline for the source and destination.

2.       Use parameterized linked services and datasets to reference the source and destination based on input parameters.

3.       Pass the parameter values at runtime when triggering the pipeline.

10. Scenario: Your company requires a data pipeline to process and analyze streaming data in near real-time. Explain how you would implement this using Azure Data Factory.

Answer: To implement near real-time data processing:

1.       Use Azure Event Hubs or Azure IoT Hub to ingest streaming data.

2.       Set up an Azure Stream Analytics job to process the streaming data and write the output to a data store like Azure Blob Storage or Azure SQL Database.

3.       Use Azure Data Factory to orchestrate the process, periodically running pipelines to load and transform the processed data for further analysis.

 

Monday, September 9, 2024

SNOWFLAKE INTERVIEW QUESTIONS

 What is Snowflake and how does it differ from traditional data warehouses?

Can you explain the architecture of Snowflake and its key components?

How does Snowflake handle data storage and compression?

What is the concept of virtual warehouses in Snowflake and how do they work?

Can you describe the process of loading data into Snowflake?

How does Snowflake ensure security and access control for data?

Have you worked with Snowflakes SQL interface or its API? Can you give an example?

How can you optimize performance in Snowflake, and what factors affect query execution time?

Have you used Snowflake for real-time data processing or stream processing? If so, can you describe your experience?

Can you discuss any challenges you have faced while working with Snowflake and how you overcame them?



============================================================================================

Here are some potential Snowflake interview questions:

============================================================================================

Types of tables ,Views and stage.

How does Snowflake handle concurrency and scalability?

What is materialized views in Snowflake and when would you use them?

How do you load data into Snowflake and what are the different methods available for data loading?

Can you explain the concept of clustering in Snowflake and how it can improve query performance?

How does Snowflake secure data at rest and in transit?

Have you worked with Snowflakes SQL interface or one of its connectors (e.g., JDBC, Python, Spark)? If so, can you provide an example of a query you have written against Snowflake?

How do you optimize queries in Snowflake, and what tools or techniques do you use to identify bottlenecks?

Can you describe your experience working with large datasets in Snowflake? How did you ensure efficient processing and querying?

How do you manage users, roles, and permissions in Snowflake?

=============================================================================================

Have you integrated Snowflake with other technologies such as BI tools, ETL tools, or data pipelines? Can you give an example?

How do you monitor and troubleshoot issues in Snowflake? Are there any specific metrics or logs that you look at?

Can you discuss a challenging problem you encountered while using Snowflake and how you solved it?

How familiar are you with Snowflakes roadmap and upcoming features? Which ones are you most excited about?


Can you discuss a challenging problem you encountered while using Snowflake and how you solved it?

=============================================================================================


Sure! Here are some scenario-based interview questions related to Snowflake:


You are tasked with designing a new data warehouse using Snowflake. 

The data includes structured and semi-structured data from various sources such as CSV files, relational databases, and APIs.

Walk us through the steps you would take to load this data into Snowflake and make it available for analysis.


Your company has been experiencing slow query performance during peak usage times in Snowflake. As the database administrator, what steps would you take to identify the root cause and improve performance without increasing costs?


A business user requests access to a table containing sensitive customer information in Snowflake. However, they only need read-only access to perform analyses. Describe the steps you would take to grant the necessary permissions while ensuring data security and compliance with regulations.


Your team wants to implement real-time data streaming into Snowflake. Which tools or services would you consider using to achieve this goal, and why?


You are building a dashboard using Tableau that connects to a Snowflake database. Suddenly, the connection fails due to high concurrency. Explain how you would troubleshoot the issue and prevent similar occurrences in the future.


Your organization is planning to migrate its existing data warehouse to Snowflake. Identify the critical success factors for such a migration project and outline a plan for minimizing downtime and maintaining data integrity throughout the transition.


You notice unusual activity patterns in Snowflake, indicating possible unauthorized access attempts. Outline the steps you would take to investigate and mitigate the situation while preserving forensic evidence.


Your analytics team requires near-real-time insights from Snowflake but doesnt want to wait for nightly batch jobs. Propose a solution for addressing this challenge while balancing cost, complexity, and system performance.


A developer accidentally dropped a large table in Snowflake, causing significant disruption to ongoing queries. Discuss the recovery options available and the tradeoffs between each approach.


Your company uses multiple clouds (e.g., AWS, Azure, GCP) for different applications and needs to consolidate data into a single source for reporting purposes. Evaluate whether Snowflake would be a suitable choice for this requirement and propose a strategy for implementing multi-cloud data integration.



==================================================================================================

==================================================================================================

1.Tell me about the roles in current project and previous projects as well.

2.Have you worked in performance and tuning part in Snowflake?

3.Have you worked on COPY commands.?

4.What type of transformation you have done in COPY command?

5.If I have already loaded a file in Snowflake, is it possible to load same file? If yes, how?

6.How is it possible with Snowpipe.

7.Have you worked on clustering part? What is that?

8.What are the table types available in Snowflake?

9.How to know which query is taking more time to execute in code from execution plan?

10.What are the services you worked on AWS?

11.What are the common issues you faced while loading the data?

12.What is the difference between scale up and scale out in snowflake?

13.What is memory spillage in snowflake?

14.Do you know about RBAC?

15.How the privileges are being inherited in Snowflake?

16.Do you know what is cloning?

17.What is caching mechanism in Snowflake.

18.I want to load a file with some error records. How can you load?

19.In which languages you worked to create a Procedure?

20.Have you written any stored Procedure by using Python?

21.Types of cluster keys

22.Timestamp,Offset,Query_id differences

23.what is the purpose of using coalesce function?

24.views and materialized views differnce?

25.how to connect cloud(S3 or AWS) ?

26.what is the uses of CDC (Strems & Task)



How we share data with client

Benefits of time travel

Benefits of Zero copy cloning

Befits of snowflake data warehouses

Data modeling

Micro partition

Stream

Task

=============================================

Snowflake Interview Questions ** PART-1 **

 

Difficulty Level Easy to Medium

 

1- How Snowflake is different than On Prem Data warehouses?

 

2- What is Time travel and How we can retrieve our dropped/updated Data back?

 

3- If We clone entire Database, which all objects will not be cloned?

 

4- If i have mistakenly recreated a table with Same name, Now How would i restore that previous table in snowflake?

 

5- What is Scale up & Scale out? Which all factors we will keep in our mind to enable these.

 

6- If I have one named Stage "COMMON_STAGE" and Client is uploading 10 files everyday.

 

a- I want to list down all faulty records and Insert those into a separate table.

 

b- If I want to know which all files have been copied and which is not, How would you implement this pipeline?

 

7- How Time travel works under the hood?

 

8- Explain the Complete Query execution flow.

 

9- What are the COPY INTO parameters is not supported in SNOWPIPE?

 

10- How would you compute the Table wise Storage cost?



=====

ANS:-

1)Snowflake is a cloud-based data warehousing platform that differs from traditional on-premises data warehouses in several ways. Here are some key differences:


Cloud-based vs. On-premises: The most obvious difference is that Snowflake is a cloud-based platform, 

whereas on-premises data warehouses are deployed and managed on-premises, within an organizations own data center.

Scalability: Snowflake is designed to scale automatically, allowing users to easily increase or decrease computing power and storage capacity as needed. 

             In contrast, on-premises data warehouses require manual scaling, which can be time-consuming and costly.

Cost: Snowflake offers a pay-as-you-go pricing model, which means users only pay for the resources they use. 

      On-premises data warehouses require significant upfront capital expenditures for hardware, software, and maintenance.

Maintenance: Snowflake handles maintenance tasks, such as patching, upgrades, and backups, freeing up IT resources for more strategic activities.

              On-premises data warehouses require manual maintenance, which can be time-consuming and resource-intensive.

Data Storage: Snowflake uses columnar storage, which is optimized for querying and analytics workloads. 

              On-premises data warehouses often use row-based storage, which can lead to slower query performance.

Data Sharing: Snowflake enables secure and governed data sharing across multiple organizations, making it easier to collaborate and share data with partners, suppliers, or customers. 

              On-premises data warehouses often require manual data extraction, transformation, and loading (ETL) processes to share data.

Security: Snowflake provides enterprise-grade security features, including encryption, access controls, and auditing.

          On-premises data warehouses require manual security configuration and management.

Performance: Snowflake s architecture is optimized for high-performance querying and analytics, with features like multi-cluster architecture, auto-scaling, and caching. 

             On-premises data warehouses may require manual performance tuning and optimization.

Data Integration: Snowflake provides native integration with various data sources, including cloud storage, on-premises databases, and SaaS applications. 

                  On-premises data warehouses often require manual integration using ETL tools.

Upgrades: Snowflake automatically upgrades its platform, ensuring users have access to the latest features and security patches. 

          On-premises data warehouses require manual upgrades, which can be complex and time-consuming.

These differences make Snowflake an attractive option for organizations looking for a scalable, flexible, and cost-effective data warehousing solution 

that can handle large volumes of data and support advanced analytics workloads.



2)

Time Travel in Snowflake:


Time Travel is a feature in Snowflake that allows you to recover data that has been accidentally deleted, updated, or altered. 

It's like having a "undo" button for your data warehouse!


How to retrieve dropped/updated data:


Enable Time Travel: Make sure Time Travel is enabled on your Snowflake account.

Set a retention period: Specify how long you want to retain historical data (e.g., 1 day, 7 days, etc.).

Query historical data: Use the AT or BEFORE clause in your SQL query to access historical data, like this: SELECT * FROM table_name AT (TIMESTAMP => '2023-02-15 14:30:00');

Restore data: Use the COPY INTO statement to restore the desired data to a new table, 

like this: COPY INTO new_table (column1, column2) FROM table_name AT (TIMESTAMP => '2023-02-15 14:30:00');

That's it! You can now retrieve your dropped or updated data using Time Travel.



3)


When you clone an entire database, the following objects are not cloned:


Users and user-defined roles

Database-level permissions and access control

Server-level objects (e.g. linked servers, server-level triggers)

System databases (e.g. master, model, msdb, tempdb)

Database-level collations and compatibility levels

Note that this may vary depending on the specific database management system (DBMS) being used.




4)Restoring a dropped table in Snowflake:


If you've mistakenly recreated a table with the same name, you can restore the previous table using Time Travel. Here's how:


Find the system-generated name: Run 

SELECT * FROM information_schema.tables WHERE table_name ='my_table'; 

to find the system-generated name of the dropped table (e.g., my_table#20230215143000).

Restore the table: Use the UNDROP command to restore the dropped table, like this: UNDROP TABLE my_table#20230215143000;

Rename the new table: Rename the newly created table to a different name, like this: RENAME TABLE my_table TO my_table_new;

Restore the original table name: Rename the restored table to its original name, like this: RENAME TABLE my_table#20230215143000 TO my_table;

Thats it! Your previous table is now restored with its original name.



5)Scale Up vs. Scale Out:


Scale Up:


Increase the power of a single node (e.g., add more CPU, memory, or storage)

Suitable for small to medium-sized workloads

Advantages: Easy to manage, less complex

Limitations: Limited scalability, potential bottlenecks


Scale Out:


Add more nodes to a cluster (e.g., add more servers or virtual machines)

Suitable for large-scale workloads, high concurrency, and big data

Advantages: Highly scalable, flexible, and fault-tolerant

Limitations: More complex to manage, higher costs

Factors to consider for Scale Up:


Workload size: Small to medium-sized workloads

Resource utilization: CPU, memory, and storage utilization are within limits

Data size: Small to medium-sized data sets

Query complexity: Simple to moderate query complexity

Factors to consider for Scale Out:


Workload size: Large-scale workloads

Resource utilization: High CPU, memory, and storage utilization

Data size: Large data sets

Query complexity: Complex queries, high concurrency

Fault tolerance: High availability and redundancy requirements

Scalability: Anticipated growth in workload, data, or users

In Snowflake, you can scale up by increasing the warehouse size or scale out by adding more warehouses. Consider your workload requirements, data size, and query complexity when deciding between scaling up or out.



6)If I have one named Stage "COMMON_STAGE" and Client is uploading 10 files everyday.   in simple few lines


If you have a named stage "COMMON_STAGE" and a client is uploading 10 files every day, then:


All 10 files will be loaded into the "COMMON_STAGE" stage.

The files will overwrite any existing files with the same name in the stage.

If you want to keep a history of files, you may need to rename or move the files to a different location after loading.



7)

Time Travel Under the Hood:


Snowflakes Time Travel feature uses a combination of techniques to store and manage historical data:


Micro-partitions: Snowflake divides data into small, immutable micro-partitions ( typically 50MB-100MB each).

Change Tracking: Snowflake tracks changes to data at the micro-partition level, creating a new version of the micro-partition each time data is updated or deleted.

Versioning: Snowflake stores multiple versions of each micro-partition, allowing for easy retrieval of historical data.

Retention Period: Snowflake stores historical versions for a specified retention period (e.g., 1 day, 7 days, etc.).

Metadata Management: Snowflake maintains metadata about each micro-partition, including its version, timestamp, and location.

When you query historical data using Time Travel:


Query Routing: Snowflakes query router directs the query to the correct micro-partition version.

Version Retrieval: Snowflake retrieves the requested version of the micro-partition from storage.

Data Reconstruction: Snowflake reconstructs the historical data by combining the retrieved micro-partition version with other necessary data.

By using micro-partitions, change tracking, and versioning, Snowflakes Time Travel feature provides efficient and scalable storage and retrieval of historical data.



8- Explain the Complete Query execution flow.


Here is the complete query execution flow:


Parsing: The query is parsed to check syntax and semantics.

Binding: The parsed query is bound to the database objects (tables, views, etc.) to resolve object references.

Optimization: The bound query is optimized to create an efficient execution plan.

Compilation: The optimized query is compiled into a machine-readable format.

Execution: The compiled query is executed by the database engine.

Fetching: The results are fetched from the database storage.

Processing: The fetched data is processed according to the query operations (e.g., sorting, aggregating).

Result Set Generation: The processed data is formatted into a result set.

Returning Results: The result set is returned to the client.

Alternatively, this can be broken down into the following phases:


Parse-Bind-Optimize-Compile-Execute (PBOCE)


Parse: Parse the query

Bind: Bind the query to database objects

Optimize: Optimize the query

Compile: Compile the query

Execute: Execute the query

Fetch-Process-Return (FPR)


Fetch: Fetch the results from storage

Process: Process the fetched data

Return: Return the result set to the client



9- What are the COPY INTO parameters is not supported in SNOWPIPE?


COPY INTO Parameters Not Supported in Snowpipe:


Snowpipe is a serverless data loading service in Snowflake that allows for continuous data ingestion from external sources.

While Snowpipe supports many of the COPY INTO parameters, there are some limitations. Here are some COPY INTO parameters that are NOT supported in Snowpipe:


ENCRYPTION: Snowpipe does not support encryption for data loading.

VALIDATION_MODE: Snowpipe does not support validation mode, which is used to validate data before loading.

ERROR_ON_COLUMN_MAPPING: Snowpipe does not support error handling for column mapping errors.

ERROR_ON_DATA: Snowpipe does not support error handling for data errors.

ON_ERROR: Snowpipe does not support custom error handling.

LOAD_UNCERTAIN_FILES: Snowpipe does not support loading uncertain files.

PURGE: Snowpipe does not support purging files after loading.

RETURN_FAILED_ONLY: Snowpipe does not support returning only failed files.

These limitations are due to the serverless nature of Snowpipe, which is designed for high-performance and scalability. 

However, most other COPY INTO parameters are supported in Snowpipe, allowing for flexible and efficient data loading.


10- How would you compute the Table wise Storage cost?


To compute the table-wise storage cost, you can use the following steps:


Calculate the total row count: Estimate the number of rows in the table.

Calculate the average row size: Calculate the average size of each row in bytes.

Calculate the total data size: Multiply the total row count by the average row size.

Add index size: Calculate the size of any indexes on the table and add it to the total data size.

Add overhead: Add a percentage (usually 10-20%) to account for database overhead, such as metadata, pointers, and padding.

Formula:


Total Storage Cost = (Total Row Count x Average Row Size) + Index Size + Overhead


Where:


Total Row Count is the estimated number of rows in the table.

Average Row Size is the average size of each row in bytes.

Index Size is the estimated size of any indexes on the table.

Overhead is a percentage (usually 10-20%) to account for database overhead.

Example:


Total Row Count: 1,000,000

Average Row Size: 200 bytes

Index Size: 100 MB

Overhead: 15%

Total Storage Cost = (1,000,000 x 200 bytes) + 100 MB + (15% x 200 MB) = 200 MB + 100 MB + 30 MB = 330 MB


Therefore, the estimated storage cost for this table is approximately 330 MB.





Wednesday, April 3, 2024

 BEST PYSPARK LEARNING SITES


https://www.youtube.com/watch?v=s3B8HXLlLTM&list=PL2IsFZBGM_IHCl9zhRVC1EXTomkEp_1zm&index=5



https://www.youtube.com/watch?v=N2Mzs-nJosQ&list=PL2IsFZBGM_IExqZ5nHg0wbTeiWVd8F06b



https://github.com/tomaztk/Spark-for-data-engineers/blob/main/SparkEngineers/01_what-is-apache-spark.md

Thursday, March 28, 2024

DATABRICKS Pyspark Important

 ================DUPLICATE===============

# Define a Window specification

window_spec = Window.partitionBy(*columns_to_check).orderBy("SLS_HUB_INTRNL_ACCT_ID")


# Use rank() function to mark duplicates

df_final_prism_repartition_reject = df_final_prism_repartition.withColumn("is_duplicate", F.rank().over(window_spec) > 1).filter(col("is_duplicate") == "true")




===========READING FILE ===========


df = spark.read.format("csv")\

  .option("header",True)\

    .option("delimiter", "~")\

      .option("quote", "\"")\

        .option("encoding","ISO-8859-1")\

          .option("dateFormat","MM/dd/yyyy")\

            .schema(input_file_read_schema)\

              .load(work_file_path)

  

=================#Current CST Time===============

#Current CST Time

CST = pytz.timezone('US/Central')

Load_TMS =datetime.now().astimezone(CST).strftime('%Y-%m-%d %H:%M:%S')   

===========================================

input_file_read_schema = StructType([

    StructField("ACCOUNT_ID", StringType(), nullable=True),

    StructField("OW_ID", StringType(), nullable=True),

    StructField("HQ_ID", StringType(), nullable=True),

    StructField("JB_ID", StringType(), nullable=True)

])



===========WRITE FILE===========================

df_final_prism_repartition.write\

  .format("delta")\

  .mode("overwrite")\

  .save(struct_path)

====================================


BATCH_ID = datetime.now().astimezone(CST).strftime('%Y%m%d%H%M%S')

20240123021120

====================================


=============Clean unprintable characters from columns=====

'''df = df.withColumn("CART_ID", regexp_replace("CART_ID", r"[^\x01-\xFF]", ""))\

  .withColumn("CORP_NM", regexp_replace("CORP_NM", r"[^\x01-\xFF]", ""))\

  .withColumn("SLS_CONTRCT_AUTHRZ_NBR", regexp_replace("SLS_CONTRCT_AUTHRZ_NBR", r"[^\x01-\xFF]", ""))\

    .withColumn("CONTRCT_CATG_NM", regexp_replace("CONTRCT_CATG_NM", r"[^\x01-\xFF]", ""))

'''


===============Remove all space from all column in dataframe=======

for columnName in df.columns :

  df = df.withColumn(columnName, regexp_replace(col(columnName),"\\s+"," "))

  

======= is null otherwise===============================


df = df.withColumn("stgdate", when(trim(col("EDRA_ACTION_DATE")).isNotNull() == True, substring_index(trim(col("EDRA_ACTION_DATE")),' ',1)).otherwise(""))\

  .withColumn("stgtime", when(trim(col("EDRA_ACTION_DATE")).isNotNull() == True, substring_index(trim(col("EDRA_ACTION_DATE")),' ',-1)).otherwise(""))\

  .withColumn("stgdateValidty", when((col("stgdate") != '') & to_date(col("stgdate"),"MM/dd/yyyy").isNotNull(),lit(1)).otherwise(lit(0)))

==withColumnRenamed===================================


df = df.withColumn("OW_ID", trim(df.OW_ID)).withColumnRenamed("OW_ID","SLS_ACCT_ONEWRLD_ID")\

        .withColumn("HQ_ID", trim(df.HQ_ID)).withColumnRenamed("HQ_ID","SLS_ACCT_HDQ_ID")\

        .withColumn("PRISM_ID", when(trim(col("PRISM_ID")).isNotNull() == True, trim(col("PRISM_ID"))).otherwise("")).withColumnRenamed("PRISM_ID","CORP_ID")\

        .withColumn("COMMON_CORP_ID", when(trim(col("COMMON_CORP_ID")).isNotNull() == True, trim(col("COMMON_CORP_ID"))).otherwise("")).withColumnRenamed("COMMON_CORP_ID","COMMON_CORP_ACCT_ID")\

        .withColumn("ACCOUNT_ID", trim(df.ACCOUNT_ID)).withColumnRenamed("ACCOUNT_ID","SLS_HUB_INTRNL_ACCT_ID")\

        .withColumn("EDRA_REQUEST_TYPE", trim(df.EDRA_REQUEST_TYPE)).withColumnRenamed("EDRA_REQUEST_TYPE","EDRA_RQST_TYPE_TXT")\

        .withColumn("JB_ID", trim(df.JB_ID)).withColumnRenamed("JB_ID","JB_CORP_ID")\

        .withColumn("ROW_EFF_DT", current_date())\

        .withColumn("ROW_EXP_DT", lit('9999-12-31'))\

        .withColumn("ROW_STATUS_IND", lit('I'))\

 ===================widgets ===================== 

 

dbutils.widgets.text("adb_par", "","")

dbutils.widgets.text('env', "","")


PipelineName = dbutils.widgets.get('PipelineName')

env = dbutils.widgets.get('env')



====call other notebook============================

%run "../../config/corpd_parameterSetup"



============Run a query and get=========

status_chk = f"""select count(*) as v_not_empty from corpd_{env}_struct.corpd_job_status where STATUS = 'R' and JOB_NAME = '{PipelineName}'"""

print(status_chk)

last_status_chk = spark.sql(status_chk).collect()[0][0]


=========HOW TO GET NOTEBOOK NAME===========================

notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().getOrElse(None)


# Extract the notebook name from the path

notebook_name = notebook_path.split("/")[-1]

print(notebook_name)


================================================

=======================TABLE to load in VIEW==================

##Mosaic JDBC info###

driver = 'com.teradata.jdbc.TeraDriver'

jdbcURL = f"jdbc:teradata://{mosaic_cpd_server}"


sql_query_text = f"""Select * from cert_MOSAIC_DB.PRISM_CONTRCT_INFO"""


df = spark.read \

      .format('jdbc') \

      .option('driver', driver) \

      .option('url', jdbcURL) \

      .option('query', sql_query_text) \

      .option('user', mosaic_cpd_user) \

      .option('password', mosaic_cpd_passwd)\

      .load()


df = df.drop_duplicates()


print(df.count())

view_name = "PRISM_CONTRCT_INFO"

print(view_name)

df.createOrReplaceTempView(view_name)

============================================================

==================dropDuplicate()===========================


df_accept = df_accept.dropDuplicates(['AccountAnalystName',

                                      'AccountMgrEmployeeID',

                                      'AccountMgrName',

                                      'AccountType',

                                      'Website',

                                      'TaxNumber'])



===========================Count in all coulms wise=================================   

# Find Count of Null, None, NaN of All DataFrame Columns

from pyspark.sql.functions import col,isnan, when, count

df.select([count(when(col(CONTRCT_ID).isNull(), 

    CONTRCT_ID)).alias(CONTRCT_ID) for CONTRCT_ID in df.columns]).show()


df.select([count(col(CORP_ID)).alias(CORP_ID) for CORP_ID in df.columns]).show()



=====================String to date conversion=========


df=df.withColumn("CONTRCT_EFF_DT",to_date(to_date("CONTRCT_EFF_DT","MM/dd/yyyy"),'yyyy-MM-dd'))

df=df.withColumn("CONTRCT_EXP_DT",to_date(to_date("CONTRCT_EXP_DT","MM/dd/yyyy"),'yyyy-MM-dd'))

display(df)



======================================================


AZURE AAD ServicePrincipalCredentials and ClientSecretCredential classes

 Both ServicePrincipalCredentials and ClientSecretCredential are classes in the azure-identity module of the Azure SDK for Python, 

which allow authenticating with Azure Active Directory (AAD) using various methods.

Here are the main differences between these two classes:


Authentication Method:-

ServicePrincipalCredentials supports several ways to authenticate with AAD, including certificate-based authentication,

interactive login via a web browser, and password-based authentication using a service principal. On the other hand, ClientSecretCredential only

supports password-based authentication using a service principal.


Constructor Parameters:-

To construct a ServicePrincipalCredentials object, you need to specify several pieces of information, including the tenant ID, client ID, 

and either a certificate file path or a username and password. By contrast, ClientSecretCredential only requires three constructor arguments: tenant ID, client ID, 

and client secret.



print("Setting ServicePrincipalCredentials")

from azure.common.credentials import ServicePrincipalCredentials

spCredentials = ServicePrincipalCredentials(client_id=service_principle_clientid, secret=spSecret, tenant=service_principle_directory_id)


or 

 from azure.identity import ClientSecretCredential

  print("Setting spCredentials using ClientSecretCredential")

 spCredentials = ClientSecretCredential(tenant_id=service_principle_directory_id, client_id=service_principle_clientid, client_secret=spSecret)


-


Data engineering Interview Questions

1)  What all challenges you have faced and how did you overcome from it? Ans:- Challenges Faced and Overcome As a hypothetical Spark develop...