Introduction
Azure Data Factory (ADF) is a cloud-based data
integration service that enables you to create data-driven workflows for
orchestrating and automating data movement and data transformation. If you're
preparing for an Azure Data Factory interview, you'll want to be familiar with
the core concepts and features of ADF. In this article, we'll cover some of the
top Azure Data Factory interview questions and answers to help you prepare.
1. What is Azure Data Factory (ADF)?
Azure Data Factory (ADF) is a cloud-based data integration
service that enables you to create data-driven workflows for orchestrating and
automating data movement and data transformation. ADF provides a managed
service that is continuously monitored and updated, and it provides built-in
security features, such as data encryption, identity and access management, and
data privacy.
2. What are the core components of Azure Data Factory
(ADF)?
The core components of Azure Data Factory (ADF) are:
· Pipelines:
A pipeline is a logical grouping of activities that perform a specific task.
· Activities:
An activity is a unit of work within a pipeline.
· Datasets:
A dataset is a named view of data that is used by activities.
· Linked
services: A linked service is a connection to a data store or a compute
service.
· Triggers:
A trigger is a mechanism that starts a pipeline run.
3. What are the different types of activities in Azure
Data Factory (ADF)?
The different types of activities in Azure Data Factory
(ADF) are:
· Data
movement activities: Data movement activities move data from one location to
another.
· Data
transformation activities: Data transformation activities transform data from
one format to another.
· Control
activities: Control activities control the flow of a pipeline.
4. What is a linked service in Azure Data Factory (ADF)?
A linked service in Azure Data Factory (ADF) is a connection
to a data store or a compute service. Linked services are used to connect to
various data stores, such as Azure Blob Storage, Azure Data Lake Storage, and
Azure SQL Database.
5. What is a dataset in Azure Data Factory (ADF)?
A dataset in Azure Data Factory (ADF) is a named view of
data that is used by activities. A dataset represents the input or output of an
activity.
6. What is a pipeline in Azure Data Factory (ADF)?
A pipeline in Azure Data Factory (ADF) is a logical grouping
of activities that perform a specific task. A pipeline can contain one or more
activities, and it can be triggered manually or scheduled to run at a specific
time.
7. What is a trigger in Azure Data Factory (ADF)?
A trigger in Azure Data Factory (ADF) is a mechanism that
starts a pipeline run. Triggers can be scheduled to run at a specific time, or
they can be triggered by an event, such as the arrival of a new file in a data
store.
8. What is the difference between a tumbling window and
sliding window trigger in Azure Data Factory (ADF)?
A tumbling window trigger in Azure Data Factory (ADF)
triggers a pipeline run at a fixed interval, while a sliding window
trigger triggers a pipeline run at a sliding interval. For example, a
tumbling window trigger might trigger a pipeline run every hour, while a
sliding window trigger might trigger a pipeline run every 30 minutes.
9. What is the difference between a single-node and an
integrated runtime in Azure Data Factory (ADF)?
A single-node runtime in Azure Data Factory (ADF) is a
standalone runtime that is used for data integration tasks, while an integrated
runtime is a runtime that is integrated with Azure Data Factory. An integrated
runtime provides additional features, such as support for custom activities and
integration with Azure DevOps.
10. What is the difference between a tumbling window and
a sliding window in Azure Data Factory (ADF)?
A tumbling window in Azure Data Factory (ADF) is a
fixed-size window that moves data at regular intervals, while a sliding window
is a moving window that moves data based on a specific time interval.
For example, a tumbling window might move data every hour,
while a sliding window might move data every 30 minutes, but also include the
previous 30 minutes of data.
11. What is the difference between a dataset and a linked
service in Azure Data Factory (ADF)?
A dataset in Azure Data Factory (ADF) is a named view of
data that is used by activities, while a linked service is a connection to a
data store or a compute service. A dataset represents the input or output of an
activity, while a linked service is used to connect to various data stores or
compute services.
12. What is the difference between a pipeline and a
trigger in Azure Data Factory (ADF)?
A pipeline in Azure Data Factory (ADF) is a logical grouping
of activities that perform a specific task, while a trigger is a mechanism that
starts a pipeline run. A pipeline can contain one or more activities, and it
can be triggered manually or scheduled to run at a specific time, while a
trigger starts a pipeline run based on a specific event or schedule.
13. What is the difference between a data flow and a
mapping data flow in Azure Data Factory (ADF)?
A data flow in Azure Data Factory (ADF) is a data
transformation that is executed in a managed runtime, while a mapping data flow
is a data transformation that is executed in a Spark runtime. A data flow
provides a visual interface for data transformation, while a mapping data flow
provides a code-first interface for data transformation.
14. What is the difference between a tumbling window and
a tumbling window trigger in Azure Data Factory (ADF)?
A tumbling window in Azure Data Factory (ADF) is a
fixed-size window that moves data at regular intervals, while a tumbling window
trigger is a trigger that starts a pipeline run at regular intervals. A
tumbling window trigger might start a pipeline run every hour, while a tumbling
window moves data every hour.
15. What is the difference between a sliding window and a
sliding window trigger in Azure Data Factory (ADF)?
A sliding window in Azure Data Factory (ADF) is a moving
window that moves data based on a specific time interval, while a sliding
window trigger is a trigger that starts a pipeline run based on a specific time
interval. A sliding window trigger might start a pipeline run every 30 minutes,
while a sliding window moves data every 30 minutes, but also include the
previous 30 minutes of data.
1. What is Azure Data Factory?
2. Is Azure Data Factory ETL or ELT tool?
It is a cloud-based Microsoft tool that provides a
cloud-based integration service for data analytics at scale and supports ETL and ELT paradigms.
3. Why is ADF needed?Azure Data Factory is a
cloud-based, fully managed, serverless ETL and data integration service
offered by Microsoft Azure for automating data movement from its native place
to, say, a data lake or data
warehouse using ETL (extract-transform-load) OR extract-load-transform
(ELT). It lets you create and run data pipelines to help move and transform
data and run scheduled pipelines.
With an increasing amount of big data, there is a need for a
service like ADF that can orchestrate and operationalize processes to refine
the enormous stores of raw business data into actionable business
insights.
4. What sets Azure Data Factory apart from conventional
ETL tools?
Azure Data Factory stands out from other ETL tools as it
provides: -
· Enterprise
Readiness: Data integration at Cloud
Scale for big data analytics!
· Enterprise
Data Readiness: There are 90+ connectors supported to get your data from any
disparate sources to the Azure cloud!
· Code-Free
Transformation: UI-driven mapping dataflows.
· Ability
to run Code on Any Azure Compute: Hands-on data transformations
· Ability
to rehost on-prem services on Azure Cloud in 3 Steps: Many SSIS packages run on
Azure cloud.
· Making
DataOps seamless: with Source control, automated deploy & simple templates.
· Secure
Data Integration: Managed virtual networks protect against data exfiltration,
which, in turn, simplifies your networking.
Data Factory contains a series of interconnected systems
that provide a complete end-to-end platform for data engineers. The below
snippet summarizes the same.
5. What are the major components of a Data Factory?
To work with Data Factory effectively, one must be aware of
below concepts/components associated with it: -
· Pipelines:
Data Factory can contain one or more pipelines, which is a logical grouping of
tasks/activities to perform a task. An activity can read data from Azure blob
storage and load it into Cosmos DB or Synapse DB for analytics while
transforming the data according to business logic. This way, one can work with
a set of activities using one entity rather than dealing with several tasks
individually.
· Activities:
Activities represent a processing step in a pipeline. For example, you might
use a copy activity to copy data between data stores. Data Factory supports
data movement, transformations, and control activities.
· Datasets: Datasets represent data structures
within the data stores, which simply point to or reference the data you want to
use in your activities as inputs or outputs.
· Linked
Service: This is more like a connection string, which will hold the information
that Data Factory can connect to various sources. In the case of reading from
Azure Blob storage, the storage-linked service will specify the connection
string to connect to the blob, and the Azure blob dataset will select the
container and folder containing the data.
· Integration
Runtime: Integration runtime instances bridged the activity and linked
Service. The linked Service or activity references it and provides the
computing environment where the activity runs or gets dispatched. This way, the
activity can be performed in the region closest to the target data stores or
compute Services in the most performant way while meeting security (no publicly
exposing data) and compliance needs.
· Data
Flows: These are objects you build visually in Data Factory, which transform
data at scale on backend Spark services. You do not need to understand
programming or Spark internals. Design your data transformation intent using
graphs (Mapping) or spreadsheets (Power query activity).
Refer to the documentation for more details: https://docs.microsoft.com/en-us/azure/data-factory/frequently-asked-questions
The below snapshot explains the relationship between
pipeline, activity, dataset, and linked service.
6. What are the different ways to execute pipelines in
Azure Data Factory?
There are three ways in which we can execute a pipeline in
Data Factory:
· Debug
mode can be helpful when trying out pipeline code and acts as a tool to test
and troubleshoot our code.
· Manual
Execution is what we do by clicking on the ‘Trigger now’ option in a pipeline.
This is useful if you want to run your pipelines on an ad-hoc basis.
· We
can schedule our pipelines at predefined times and intervals via a Trigger. As
we will see later in this article, there are three types of triggers available
in Data Factory.
7. What is the purpose of Linked services in Azure
Data Factory?
Linked services are used majorly for two purposes in Data
Factory:
1. For
a Data Store representation, i.e., any storage system like Azure Blob storage
account, a file share, or an Oracle DB/ SQL Server instance.
2. For
Compute representation, i.e., the underlying VM will execute the activity
defined in the pipeline.
8. Can you Elaborate more on Data Factory Integration
Runtime?
The Integration Runtime, or IR, is the compute
infrastructure for Azure Data Factory
pipelines. It is the bridge between activities and linked services. The
linked Service or Activity references it and provides the computing environment
where the activity is run directly or dispatched. This allows the activity to
be performed in the closest region to the target data stores or computing
Services.
The following diagram shows the location settings for Data
Factory and its integration runtimes:
Source: https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime
Azure Data Factory supports three types of integration
runtime, and one should choose based on their data integration capabilities and
network environment requirements.
1. Azure
Integration Runtime: To copy data between cloud data stores and send
activity to various computing services such as SQL Server, Azure HDInsight,
etc.
2. Self-Hosted
Integration Runtime: Used for running copy activity between cloud data
stores and data stores in private networks. Self-hosted integration runtime is
software with the same code as the Azure Integration Runtime but installed on
your local system or machine over a virtual network.
3. Azure
SSIS Integration Runtime: You can run SSIS packages in a managed
environment. So, when we lift and shift SSIS packages to the data factory, we
use Azure SSIS Integration Runtime.
9. What is required to execute an SSIS package in Data
Factory?
We must create an SSIS integration runtime and an SSISDB
catalog hosted in the Azure SQL server database or Azure SQL-managed instance
before executing an SSIS package.
10. What is the limit on the number of Integration
Runtimes, if any?
Within a Data Factory, the default limit on any entities is
set to 5000, including pipelines, data sets, triggers, linked
services, Private Endpoints, and integration runtimes. If required, one can
create an online support ticket to raise the limit to a higher number.
11. What are ARM Templates in Azure Data Factory?
What are they used for?
An ARM template is a JSON (JavaScript Object Notation) file
that defines the infrastructure and configuration for the data factory
pipeline, including pipeline activities, linked services, datasets, etc. The
template will contain essentially the same code as our pipeline.
ARM templates are helpful when we want to migrate our
pipeline code to higher environments, say Production or Staging from
Development, after we are convinced that the code is working correctly.
12. How can we deploy code to higher environments
in Data Factory?
At a very high level, we can achieve this with the below set
of steps:
· Create
a feature branch that will store our code base.
· Create
a pull request to merge the code after we’re sure to the Dev branch.
· Publish
the code from the dev to generate ARM templates.
· This
can trigger an automated CI/CD DevOps pipeline to promote code to higher
environments like Staging or Production.
13. Which three activities can you run in Microsoft
Azure Data Factory?
Azure Data Factory supports three activities: data movement,
transformation, and control activities.
· Data
movement activities: As the name suggests, these activities help move data
from one place to another.
e.g., Copy Activity in Data Factory copies data from a source to a sink data
store.
· Data
transformation activities: These activities help transform the data while
we load it into the data's target or destination.
e.g., Stored Procedure, U-SQL, Azure Functions, etc.
· Control
flow activities: Control (flow) activities help control the flow of any
activity in a pipeline.
e.g., wait activity makes the pipeline wait for a specified
time.
14. What are the two types of compute environments
supported by Data Factory to execute the transform activities?
Below are the types of computing environments that Data
Factory supports for executing transformation activities: -
i) On-Demand Computing Environment: This is a fully
managed environment provided by ADF. This type of calculation creates a cluster
to perform the transformation activity and automatically deletes it when the
activity is complete.
ii) Bring Your Environment: In this environment, you
can use ADF to manage your computing environment if you already have the
infrastructure for on-premises services.
15. What are the steps involved in an ETL process?
The ETL (Extract, Transform, Load) process follows four main
steps:
i) Connect and Collect: Connect to the data source/s and
move data to local and crowdsource data storage.
ii) Data transformation using computing services such as
HDInsight, Hadoop, Spark, etc.
iii) Publish: To load data into Azure data lake storage,
Azure SQL data warehouse, Azure SQL databases, Azure Cosmos DB, etc.
iv)Monitor: Azure Data Factory has built-in support for
pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and
health panels on the Azure portal.
16. If you want to use the output by executing a
query, which activity shall you use?
Look-up activity can return the result of executing a query
or stored procedure.
The output can be a singleton value or an array of
attributes, which can be consumed in subsequent copy data activity, or any
transformation or control flow activity like ForEach activity.
17. Can we pass parameters to a pipeline run?
Yes, parameters are a first-class, top-level concept in Data
Factory. We can define parameters at the pipeline level and pass arguments as
you execute the pipeline run on demand or using a trigger.
18. Have you used Execute Notebook activity in Data
Factory? How to pass parameters to a notebook activity?
We can execute notebook activity to pass code to our
databricks cluster. We can pass parameters to a notebook activity using
the baseParameters property. If the parameters are not
defined/ specified in the activity, default values from the notebook are
executed.
19. What are some useful constructs available in
Data Factory?
· parameter:
Each activity within the pipeline can consume the parameter value
passed to the pipeline and run with the @parameter construct.
· coalesce:
We can use the @coalesce construct in the expressions to
handle null values gracefully.
· activity:
An activity output can be consumed in a subsequent activity with the
@activity construct.
20. Can we push code and have CI/CD (Continuous
Integration and Continuous Delivery) in ADF?
Data Factory fully supports CI/CD of your data pipelines
using Azure DevOps and GitHub. This allows you to develop and deliver your ETL
processes incrementally before publishing the finished product. After the raw
data has been refined into a business-ready consumable form, load the data
into Azure Data Warehouse or
Azure SQL Azure Data Lake, Azure Cosmos DB, or whichever analytics engine your
business uses can point to from their business intelligence tools.
21. What do you mean by variables in the Azure Data
Factory?
Variables in the Azure Data Factory pipeline provide the
functionality to hold the values. They are used for a similar reason as we use
variables in any programming language and are available inside the pipeline.
Set variables and append variables are two activities used
for setting or manipulating the values of the variables. There are two types of
variables in a data factory: -
I) System variables: These are fixed variables
from the Azure pipeline. For example, pipeline name, pipeline id, trigger name,
etc. You need these to get the system information required in your use case.
ii) User variable: A user variable is declared
manually in your code based on your pipeline logic.
22. What are mapping data flows?
Mapping data flows are visually designed data
transformations in Azure Data Factory. Data flows allow data engineers to
develop a graphical data transformation logic without writing code. The
resulting data flows are executed as activities within Azure Data Factory
pipelines that use scaled-out Apache Spark clusters. Data flow activities can
be operationalized using Azure Data Factory scheduling, control flow, and
monitoring capabilities.
Mapping data flows provides an entirely visual experience
with no coding required. Data flows run on ADF-managed execution clusters for
scaled-out data processing. Azure Data Factory manages all the code
translation, path optimization, and execution of the data flow jobs.
23. What is copy activity in the Azure Data
Factory?
Copy activity is one of the most popular and universally
used activities in the Azure data factory. It is used for ETL or Lift and
Shift, where you want to move the data from one data source to another. While
you copy the data, you can also do the transformation; for example, you read
the data from the TXT/CSV file, which contains 12 columns; however, while
writing to your target data source, you want to keep only seven columns. You
can transform it and send only the required columns to the destination data
source.
24. Can you elaborate more on the Copy activity?
The copy activity performs the following steps at
high-level:
i) Read data from the source data store. (e.g., blob
storage)
ii) Perform the following tasks on the data:
· Serialization/deserialization
· Compression/decompression
· Column
mapping
iii) Write data to the destination data store or sink.
(e.g., azure data lake)
This is summarized in the below graphic:
25. What are the different activities you have used
in Azure Data Factory?
Here you can share some of the significant activities if you
have used them in your career, whether your work or college project. Here are a
few of the most used activities :
1. Copy
Data Activity to copy the data between datasets.
2. ForEach
Activity for looping.
3. Get
Metadata Activity that can provide metadata about any data source.
4. Set
Variable Activity to define and initiate variables within pipelines.
5. Lookup
Activity to do a lookup to get some values from a table/file.
6. Wait
Activity to wait for a specified amount of time before/in between the pipeline
run.
7. Validation
Activity will validate the presence of files within the dataset.
8. Web
Activity to call a custom REST endpoint from an ADF pipeline.
26. How can I schedule a pipeline?
You can use the time window or scheduler trigger to schedule
a pipeline. The trigger uses a wall-clock calendar schedule, which can schedule
pipelines periodically or in calendar-based recurrent patterns (for example, on
Mondays at 6:00 PM and Thursdays at 9:00 PM).
Currently, the service supports three types of
triggers:
· Tumbling
window trigger: A trigger that operates on a periodic interval while retaining
a state.
· Schedule
Trigger: A trigger that invokes a pipeline on a wall-clock schedule.
· Event-Based
Trigger: A trigger that responds to an event. e.g., a file getting placed
inside a blob.
Pipelines and triggers have a many-to-many relationship (except for the
tumbling window trigger). Multiple triggers can kick off a single pipeline, or
a single trigger can kick off numerous pipelines.
27. When should you choose Azure Data Factory?
One should consider using Data Factory-
· When
working with big data, there is a need for a data warehouse to be implemented;
you might require a cloud-based integration solution like ADF for the same.
· Not
all team members are experienced in coding and may prefer graphical tools to
work with data.
· When
raw business data is stored at diverse data sources, which can be on-prem and
on the cloud, we would like to have one analytics solution like ADF to
integrate them all in one place.
· We
would like to use readily available data movement and processing solutions and
be light regarding infrastructure management. So, a managed solution like ADF
makes more sense in this case.
28. How can you access data using the other
90 dataset types in Data Factory?
The mapping data flow feature allows Azure SQL
Database, Azure Synapse Analytics,
delimited text files from Azure storage account or Azure Data Lake Storage
Gen2, and Parquet files from blob storage or Data Lake Storage Gen2 natively
for source and sink data source.
Use the Copy activity to stage data from any other
connectors and then execute a Data Flow activity to transform data after it's
been staged.
29. What is the difference between mapping and
wrangling data flow (Power query activity)?
Mapping data flows transform data at scale without requiring
coding. You can design a data transformation job in the data flow canvas by
constructing a series of transformations. Start with any number of source
transformations followed by data transformation steps. Complete your data flow
with a sink to land your results in a destination. It is excellent at mapping
and transforming data with known and unknown schemas in the sinks and sources.
Power Query Data Wrangling allows you to do agile data
preparation and exploration using the Power Query Online mashup editor at scale
via spark execution.
It supports 24 SQL data types from char, nchar to int,
bigint and timestamp, xml, etc.
Azure Data Factory Interview Questions For 4 Years
Experience
If you’re a professional with 4 years of experience in Azure
Data Factory, check out the list of these common ADF interview questions that
you may encounter during your job interview.
30. Can a value be calculated for a new
column from the existing column from mapping in ADF?
We can derive transformations in the mapping data flow to
generate a new column based on our desired logic. We can create a new derived
column or update an existing one when developing a derived one. Enter the name
of the column you're making in the Column textbox.
You can use the column dropdown to override an existing
column in your schema. Click the Enter expression textbox to start creating the
derived column’s expression. You can input or use the expression builder to
build your logic.
31. How is the lookup activity useful in the Azure
Data Factory?
In the ADF pipeline, the Lookup activity is commonly used
for configuration lookup purposes, and the source dataset is available.
Moreover, it retrieves the data from the source dataset and then sends it as
the activity output. Generally, the output of the lookup activity is further
used in the pipeline for making decisions or presenting any configuration as a
result.
Simply put, lookup activity is used for data fetching in the
ADF pipeline. The way you would use it entirely relies on your pipeline logic.
Obtaining only the first row is possible, or you can retrieve the complete rows
depending on your dataset or query.
32. Elaborate more on the Get Metadata activity in
Azure Data Factory.
The Get Metadata activity is used to retrieve the metadata
of any data in the Azure Data Factory or a Synapse pipeline. We can use the
output from the Get Metadata activity in conditional expressions to perform
validation or consume the metadata in subsequent activities.
It takes a dataset as input and returns metadata information
as output. Currently, the following connectors and the corresponding
retrievable metadata are supported. The maximum size of returned metadata
is 4 MB.
Please refer to the snapshot below for supported metadata
which can be retrieved using the Get Metadata activity.
33. How to debug an ADF pipeline?
Debugging is one of the crucial aspects of any
coding-related activity needed to test the code for any issues it might have.
It also provides an option to debug the pipeline without executing it.
Azure Data Factory Interview Questions For 5 Years
Experience
Here are some of the most likely asked Azure Data Factory
interview questions for professionals with 5 years of experience to help you
prepare for your next job interview and feel confident in showcasing your
expertise.
34. What does it mean by the breakpoint
in the ADF pipeline?
To understand better, for example, you are using three
activities in the pipeline, and now you want to debug up to the second activity
only. You can do this by placing the breakpoint at the second activity. To add
a breakpoint, click the circle present at the top of the activity.
35. What is the use of the ADF Service?
ADF primarily organizes the data copying between relational
and non-relational data sources hosted locally in data centers or the cloud.
Moreover, you can use ADF Service to transform the ingested data to fulfill
business requirements. In most Big Data solutions, ADF Service is used as an
ETL or ELT tool for data ingestion.
36. Explain the data source in the Azure data
factory.
The data source is the source or destination system that
comprises the data intended to be utilized or executed. The data type can be
binary, text, CSV, JSON, image files, video, audio, or a proper database.
Examples of data sources include Azure data lake storage, azure blob storage,
or any other database such as MySQL DB, Azure SQL database, Postgres, etc.
37. Can you share any difficulties you faced while
getting data from on-premises to Azure cloud using Data Factory?
One of the significant challenges we face while migrating
from on-prem to the cloud is throughput and speed. When we try to copy the data
using Copy activity from on-prem, the process rate could be faster, and hence
we need to get the desired throughput.
There are some configuration options for a copy activity,
which can help in tuning this process and can give desired results.
i) We should use the compression option to get the data in a
compressed mode while loading from on-prem servers, which is then de-compressed
while writing on the cloud storage.
ii) Staging area should be the first destination of our data
after we have enabled the compression. The copy activity can decompress before
writing it to the final cloud storage buckets.
iii) Degree of Copy Parallelism is another option to help
improve the migration process. This is identical to having multiple threads
processing data and can speed up the data copy process.
There is no right fit-for-all here, so we must try different
numbers like 8, 16, or 32 to see which performs well.
iv) Data Integration Unit is loosely the number of CPUs
used, and increasing it may improve the performance of the copy process.
38. How to copy multiple sheet data from an Excel
file?
When using an Excel connector within a data factory, we must
provide a sheet name from which we must load data. This approach is nuanced
when we have to deal with a single or a handful of sheets of data, but when we
have lots of sheets (say 10+), this may become a tedious task as we have to
change the hard-coded sheet name every time!
However, we can use a data factory binary data format
connector for this and point it to the Excel file and need not provide the
sheet name/s. We’ll be able to use copy activity to copy the data from all the
sheets present in the file.
39. Is it possible to have nested looping in Azure
Data Factory?
There is no direct support for nested looping in the data
factory for any looping activity (for each / until). However, we can use one
for each/until loop activity which will contain an execute pipeline activity
that can have a loop activity. This way, when we call the looping activity, it
will indirectly call another loop activity, and we'll be able to achieve nested
looping.
40. How to copy multiple tables from one datastore
to another datastore?
An efficient approach to complete this task would be:
· Maintain
a lookup table/ file containing the list of tables and their source, which
needs to be copied.
· Then,
we can use the lookup activity and each loop activity to scan through the list.
· Inside
the for each loop activity, we can use a copy activity or a mapping dataflow to copy multiple tables to the destination
datastore.
41. What are some performance-tuning techniques for
Mapping Data Flow activity?
We could consider the below set of parameters for tuning the
performance of a Mapping Data Flow activity we have in a pipeline.
i) We should leverage partitioning in the source, sink, or
transformation whenever possible. Microsoft, however, recommends using the
default partition (size 128 MB) selected by the Data Factory as it
intelligently chooses one based on our pipeline configuration.
Still, one should try out different partitions and see if
they can have improved performance.
ii) We should not use a data flow activity for each loop
activity. Instead, we have multiple files similar in structure and processing
needs. In that case, we should use a wildcard path inside the data flow
activity, enabling the processing of all the files within a folder.
iii) The recommended file format to use is ‘. parquet’. The
reason being the pipeline will execute by spinning up spark clusters, and
Parquet is the native file format for Apache Spark; thus, it will generally
give good performance.
iv) Multiple logging modes are available: Basic, Verbose,
and None.
We should only use verbose mode if essential, as it will log
all the details about each operation the activity performs. e.g., It will log
all the details of the operations performed for all our partitions. This one is
useful when troubleshooting issues with the data flow.
The basic mode will give out all the necessary basic details
in the log, so try to use this one whenever possible.
v) Try to break down a complex data flow activity into
multiple data flow activities. Let’s say we have several transformations
between source and sink, and by adding more, we think the design has become
complex. In this case, try to have it in multiple such activities, which will
give two advantages:
· All
activities will run on separate spark clusters, decreasing the run time for the
whole task.
· The
whole pipeline will be easy to understand and maintain in the future.
42. What are some of the limitations of ADF?
Azure Data Factory provides great functionalities for data
movement and transformations. However, there are some limitations as well.
i) We can’t have nested looping activities in the data
factory, and we must use some workaround if we have that sort of structure in
our pipeline. All the looping activities come under this: If, Foreach, switch,
and until activities.
ii) The lookup activity can retrieve only 5000 rows at a
time and not more than that. Again, we need to use some other loop activity
along with SQL with the limit to achieve this sort of structure in the
pipeline.
iii) We can have 40 activities in a single pipeline,
including inner activity, containers, etc. To overcome this, we should
modularize the pipelines regarding the number of datasets, activities, etc.
44. How are all the components of Azure Data Factory
combined to complete an ADF task?
The below diagram depicts how all these components can be
clubbed together to fulfill Azure Data Factory ADF tasks.
45. How do you send email notifications on pipeline
failure?
There are multiple ways to do this:
1. Using
Logic Apps with Web/Webhook activity.
Configure a logic app that, upon getting an HTTP request, can send an email to
the required set of people for failure. In the pipeline, configure the failure
option to hit the URL generated by the logic app.
2. Using
Alerts and Metrics from pipeline options.
We can set up this from the pipeline itself, where we get numerous options for
email on any activity failure within the pipeline.
46. Can we integrate Data Factory with Machine learning
data?
Yes, we can train and retrain the model on machine learning data
from the pipelines and publish it as a web service.
47. What is an Azure SQL database? Can you integrate it
with Data Factory?
Part of the Azure SQL family, Azure SQL Database is an
always up-to-date, fully managed relational database service built for the
cloud for storing data. Using the Azure data factory, we can easily design data
pipelines to read and write to SQL DB.
48. Can you host SQL Server instances on Azure?
Azure SQL Managed Instance is the intelligent, scalable
cloud database service that combines the broadest SQL Server instance or SQL
Server database engine compatibility with all the benefits of a fully managed
and evergreen platform as a service.
50. What is Azure Data Lake Analytics?
Azure Data Lake Analytics is an
on-demand analytics job service that simplifies storing data and processing big
data.
51.What are the various types of loops in ADF?
Loops in Azure Data Factory are used to iterate over a
collection of items to perform a specific action repeatedly. There are three
major types of loops in Azure Data Factory:
· For
Each Loop: This loop is used to iterate over a collection of items and
perform a specific action for each item in the collection. For example, if you
have a list of files in a folder and want to copy each file to another
location, you can use a For Each Loop to iterate over the list of files and
copy each file to the target location.
· Until
Loop: This loop repeats a set of activities until a specific condition
is met. For example, you could use an Until Loop to keep retrying an operation
until it succeeds or until a certain number of attempts have been made.
· While
Loop: This loop repeats a specific action while a condition is true.
For example, if you want to keep processing data until a specific condition is
met, you can use a While Loop to repeat the processing until the condition is
no longer true.
52. Can you list all the activities that can be performed
in ADF?
Here are some of the key activities that can be performed in
ADF:
· Data
Ingestion
· Data
Transformation
· Data
Integration
· Data
Integration
· Data
Orchestration
· Data
Enrichment
1. In the context of Azure Data Factory, what does the
term "variables" mean?
The variables that are used in the Azure Data Factory pipeline
serve this storing function. Variables can be accessed within the pipeline in
the same way that they can be used in any programming language because they are
likewise available there.
Changing or setting the values of variables can be
accomplished through the use of the Set Variable and Add Variable actions,
respectively. A data factory typically has both continuous and discrete
variables in its database.
1. These
Azure Pipeline system variables are grouped together under the heading of
System variables. The name, ID, and name of any Triggers that are used in
Pipelines, etc. These are the things that you need in order to access the
system information that might be relevant to the use case that you are working
on.
2. The
user variable is the second kind of variable, and it is the sort of variable
that is defined explicitly in your code and is driven by the pipeline logic.
2. What is a "data flow map"?
Visual data transformations are referred to as mapping data
flows when working in Azure Data Factory. Because of data flows, data engineers
can construct logic for altering data without having to write any code at all.
After the data flows have been generated, they are then implemented as
activities inside of the scaled-out Apache Spark clusters that are contained
within Azure Data Factory pipelines. The scheduling, control flow, and
monitoring elements that are currently available in Azure Data Factory can be
utilized to operationalize data flow operations.
The method of data flow mapping is highly immersive visually
and does away with the necessity for any form of scripting. The execution
clusters in which the data flows are carried out are managed by the ADF, which
enables the data to be processed in a manner that is massively parallel. Azure
Data Factory is responsible for all of the coding tasks, including the
interpretation of code, the optimization of pathways, and the execution of data
flow operations.
3. In the context of the Azure Data Factory, just what
does it mean when it's referred to as "copy activity"?
The copy operation is one of the most extensively used and
generally popular operations in the Azure data factory. The procedure that is
known as "lift and shift" is useful in situations in which it is
necessary to copy information from one database to another. You can make
modifications to the data as you copy it. For instance, before you transmit it
to your target data source, you can decide to lower the number of columns in
the source txt/csv file from 12 to 7. You can change it in such a way that the
target database receives only the required number of columns after the
transfer.
4. Could you explain to me how I should go about planning
a pipeline?
You can set up a pipeline's schedule by utilizing either the
time window trigger or the scheduler trigger. Pipelines can be programmed to
run on a timed basis on a periodic basis or in cyclical patterns according to
the wall-clock calendar schedule of the trigger (for example, on Mondays at
6:00 PM and Thursdays at 9:00 PM).
There are now three different kinds of triggers that can be
utilized with the service, and they are as follows:
1. A
state-preserving periodic trigger, the tumbling window trigger is used in this
game.
2. A
time-based trigger that triggers a specified pipeline at a time that has been
predetermined is referred to as a Schedule Trigger.
3. One
category of triggers is known as "event-based," which indicates that
they respond in some way to any occurrence that takes place, such as when a
file is copied into a blob. Pipelines and triggers are mapped to one another in
a way that is many-to-many (except for the tumbling window trigger). It is
conceivable for a single trigger to launch many pipelines, or it is possible
for numerous triggers to initiate a single pipeline. Both scenarios are viable.
5. In which situations does Azure Data Factory seem the
best option?
The utilization of the Data Factory is essential at this
time.
1. When
dealing with massive amounts of data, it is likely required to implement a
cloud-based integration solution such as ADF. This is because a data warehouse
needs to be created.
2. Not
everyone in the team is a coder, and some members may discover that graphical
interfaces make it simpler to analyze and manipulate data.
3. When
raw business data is located in many places, both on-premises and in the cloud,
we need a unified analytics solution such as ADF to analyze it all in one
place.
4. We
would like to minimize the management of our infrastructure to a minimum by
making use of methods that are widely utilized for the transportation of data
and the processing of it. Because of this, going with a managed solution such
as ADF is the choice that makes the most sense.
6. Do you have any tips on how to access the information
you require by taking advantage of the other ninety dataset types that are
accessible in the Data Factory?
Data can originate from a wide variety of Azure services,
including Azure SQL Database, Azure Synapse Analytics, delimited text files
stored in an Azure storage account or Azure Data Lake Storage Gen2, and Parquet
files stored in blob storage or Data Lake Storage Gen2. Source and sink data
can be combined. In order to transform data coming from external connectors,
first stage the data using the Copy action, and then perform an activity in the
Data Flow category.
7. Can the value of a new column in an ADF table be
determined by using an existing mapping column?
The logic that we specify can be used to generate a new
column, and this is done by deriving transformations within the mapping data
flow. When developing a derived column, we have the option of creating a
brand-new one from scratch or making changes to an existing one. You can
recognise the new column by giving it a name in the textbox labelled Column.
If you choose a different column from the menu, the one you
currently have selected will be removed from your schema. Select the textbox
that corresponds to the derived column, and then press the Enter key on your
keyboard to get started crafting an expression for it. You can either manually
enter your reasoning or make use of the expression builder to construct it.
8. Where can I find more information on the benefits of
using lookup operations in the Azure Data Factory?
In the ADF pipeline, the Lookup activity is typically
utilized for configuration lookup most of the time due to the ready
availability of the source dataset. In addition to this, the output of the
activity can be used to retrieve the data from the dataset that served as the
source. In most cases, the outcomes of a lookup operation are sent back down
the pipeline to be used as input for later phases.
In order to retrieve data, the ADF pipeline makes heavy use
of lookup operations. You may only utilize it in a manner that is appropriate
for the process you are going through. You have the option of retrieving either
the first row or all of the rows, depending on the dataset or query you choose.
9. Please provide any more information that you have on
the Azure Data Factory Get Metadata operation.
The Get Metadata operation can be used to access the
metadata associated with any piece of data that is contained within an Azure
Data Factory or a Synapse pipeline. We can use the results from the Get
Metadata activity in conditional expressions to validate or utilize the
metadata in subsequent actions. This can be done by using the Get Metadata
activity.
It takes a dataset as an input and then generates
descriptive data based on that dataset as an output. The supported connectors
and the metadata that may be retrieved for each one are outlined in the table
that can be found below. It is possible to accept metadata returns that are up
to 4 MB in size.
10. Where did you experience the most difficulty while
attempting to migrate data from on-premises to the Azure cloud via Data
Factory?
Within the context of our ongoing transition from
on-premises to cloud storage, the problems of throughput and speed have emerged
as important obstacles. When we attempt to replicate the data from on-premises
using the Copy activity, we do not achieve the throughput that we require.
The configuration variables that are available for a copy
activity make it possible to fine-tune the process and achieve the desired
results.
1. If
we load data from servers located on-premises, we should first compress it
using the available compression option before writing it to cloud storage,
where the compression will afterwards be erased.
2. After
the compression has been activated, ii) it is imperative that all of our data
be quickly sent to the staging area. Before being stored in the target cloud
storage buckets, the data that was transferred might be uncompressed for
convenience.
3. Copying
Proportion, The use of parallelism is yet another alternative that offers the
potential to make the process of transfer more seamless. This accomplishes the
same thing as employing a number of different threads to process the data and
can speed up the rate at which data is copied.
4. Because
there is no one size those fits all, we will need to try out a variety of
different values, such as 8, 16, and 32, to see which one functions the most
effective.
5. It
may be possible to hasten the duplication process by increasing the Data
Integration Unit, which is roughly comparable to the number of central
processing units.
11. Do I have the ability to copy information
simultaneously from many Excel sheets?
When using an Excel connector within a data factory, it is
necessary to specify the sheet name from which the data is going to be loaded.
This method is unobtrusive when dealing with data from just one or a few
sheets, but when dealing with data from tens of sheets or more, it can become
tiresome because the sheet name needs to be updated in the code each time.
By utilizing a data factory binary data format connector and
directing it to the Excel file, we can avoid having to manually insert the
sheet names into the spreadsheet. Using the copy action, you will be able to
simultaneously copy the information that is located on each of the sheets that
are contained within the file.
12. Nesting of loops within loops in Azure Data Factory:
yes or no?
Nesting loops are not directly supported by any of the
activities that use the for each or till looping structures in the data
factory. On the other hand, we have the option of utilizing an execute pipeline
activity that incorporates a for each/until loop activity. By carrying out the
aforementioned steps, we will be able to successfully implement nested looping
by having one loop activity call another loop activity.
13. Are there any particular limitations placed on ADF
members?
Azure Data Factory provides superior tools for transmitting
and manipulating data, and these tools can be found in its feature set.
However, you should be aware that there are certain limitations as well.
1. Because
the data factory does not allow the use of nested looping activities, any
pipeline that has such a structure will require a workaround in order to
function properly. Here is where we classify everything that has a looping
structure: actions involving the conditions if, for, and till respectively.
2. The
lookup activity is capable of retrieving a maximum of 5000 rows in a single
operation at its maximum capacity. To reiterate, in order to implement this
kind of pipeline design, we are going to need to use some additional loop
activity in conjunction with SQL with the limit.
3. It
is not possible for a pipeline to have more than forty activities in total, and
this number includes any inner activities as well as any containers. To find a
solution to this problem, pipelines ought to be modularized with regard to the
number of datasets, activities, and so on.
14. What is Data Flow Debug?
It is possible to do data flow troubleshooting in Azure Data
Factory and Synapse Analytics while simultaneously monitoring the real-time
transformation of the data shape. The versatility of the debug session is
beneficial to both the Data Flow design sessions as well as the pipeline debug
execution.
15. Is it possible to use ADF to implement CI/CD, which
stands for continuous integration and delivery?
The Data Factory provides full support for CI and CD for
your data pipelines by utilizing Azure DevOps and GitHub. As a consequence of
this, you are able to construct and roll out new versions of your ETL
procedures in stages before delivering the completed product. After the raw
data has been converted into a form that a firm can utilize, it should be
imported into an Azure Data Warehouse, Azure SQL Azure Data Lake, Azure Cosmos
DB, or another analytics engine that your organization's BI tools can reference.
This step should take place as soon as possible.
16. Which components of Data Factory's building blocks
are considered to be the most useful ones?
1. Each
activity inside the pipeline has the ability to use the @parameter construct in
order to make use of the parameter value that has been provided to the
pipeline.
2. By
making use of the @coalesce construct within the expressions, we are able to
deal with null values in a pleasant manner.
3. The
@activity construct makes it possible to make use of the results obtained from
one activity in another.
17. Do you have any prior experience with the Execute
Notebook activity in Data Factory? Does anybody have any idea how to configure
the settings for a laptop task?
Through the use of the execute notebook activity, we can
communicate with our data bricks cluster from within a notebook. We are able to
send parameters to an activity within a notebook by utilizing the
baseParameters attribute of that activity. In the event that the parameters are
not explicitly defined or specified in the activity, the notebook's default
settings are utilized.
18. Is it possible to communicate with a pipeline run by
passing information along in the form of parameters?
In Data Factory, a parameter is handled just like any other
fully-fledged top-level notion would be. The defining of parameters at the
pipeline level enables the passage of arguments during on-demand or triggered
execution of the pipeline.
19. Which activity should be performed if the goal is to
make use of the results that were acquired by performing a query?
A lookup operation can be used to acquire the result of a
query or a stored process. The end result might be a single value or an array
of attributes that can be utilized in a ForEach activity or some other control
flow or transformation function. Either way, it could be a single value.
20. How many individual steps are there in an ETL
procedure?
The ETL (Extract, Transform, Load) technique consists of
carrying out these four stages in the correct order.
1. Establishing
a link to the data source (or sources) is the initial stage. After that,
collecting the information and transferring it to either a local database or a
crowdsourcing database is the next step in the process.
2. Making
use of computational services includes activities such as transforming data by
utilizing HDInsight, Hadoop, Spark, and similar tools.
3. Send
information to an Azure service, such as a data lake, a data warehouse, a
database, Cosmos DB, or a SQL database. This step can also be accomplished by
using the Publish API.
4. To
facilitate pipeline monitoring, Azure Data Factory makes use of Azure Monitor,
API, PowerShell, Azure Monitor logs, and health panels on the Azure site.
21. How well does Data Factory support the Hadoop and
Spark computing environments when it comes to carrying out transform
operations?
The following types of computer environments are able to
carry out transformation operations with the assistance of a Data Factory:
1. The
On-Demand Computing Environment provided by I ADF is a solution that is ready
to use and includes full management. A cluster is created for the computation
to carry out the transformation, and this cluster is afterwards removed once
the transformation has been carried out.
2. Bring
Your Own Equipment: If you already possess the computer gear and software
required to deliver services on-premises, you can use ADF to manage your
computing environment in this situation.
22. How about discussing the three most important tasks
that you can complete with Microsoft Azure Data Factory?
As was discussed in the previous section's third question,
Data Factory makes it easier to carry out three processes: moving data,
transforming data, and exercising control.
1. The
operations known as data movement do precisely what their name suggests, which
is to say that they facilitate the flow of data from one point to another. For
example, information can be moved from one data store to another using Data
Factory's Copy Activity. Other data stores may also be used.
2. "Data
transformation activities" are any operations that modify data as it is
being loaded into its final destination system. Stored Procedures, U-SQL, Azure
Functions, and so on are just a few examples.
3. Control
(flow) activities, as their name suggests, are designed to help regulate the
speed of any process that is going through a pipeline. For example, selecting
the Wait action will result in the pipeline pausing for the amount of time that
was specified.
23. What is meant by the term "ARM Templates"
when referring to Azure Data Factory? Where do we plan to use them?
An ARM template is a file that uses JavaScript Object
Notation (JSON), and it is where all of the definitions for the data factory
pipeline operations, associated services, and datasets are stored. Code that is
analogous to the one used in our pipeline will be incorporated into the
template.
Once we have determined that the code for our pipeline is
operating as it should, we will be able to use ARM templates to migrate it to
higher environments, such as Production or Staging, from the Development
setting.
24. Is there a limit to the number of Integration
Runtimes that may be built or is it unlimited?
The default maximum for anything that may be contained
within a Data Factory is 5000, and this includes a pipeline, data set, trigger,
connected service, Private Endpoint, and integration runtime. You can file a
request to increase this amount through the online help desk if you find that
you require more.
25. What are the prerequisites that need to be met before
an SSIS package can be executed in Data Factory?
Setting up an SSIS integration runtime and an SSISDB
catalogue in an Azure SQL server database or an Azure SQL-managed instance is
required before an SSIS package can be executed. This can be done in either of
these locations.
Scenario-Based Questions for Azure Data Factory
1. Scenario: Your company needs to move data from an
on-premises SQL Server database to an Azure SQL Database daily. How would you
set up this data movement in Azure Data Factory?
Answer: To set up this data movement:
1. Create
a Self-hosted Integration Runtime (IR) to securely connect to the on-premises
SQL Server.
2. Create
linked services for both the on-premises SQL Server and Azure SQL Database.
3. Create
datasets for the source and destination tables.
4. Create
a pipeline with a Copy Data activity to move the data.
5. Schedule
the pipeline using a schedule trigger to run daily.
2. Scenario: You need to transform data from a CSV file
in Azure Blob Storage and load it into an Azure SQL Database. Describe how you
would accomplish this using Azure Data Factory.
Answer: To accomplish this:
1. Create
linked services for Azure Blob Storage and Azure SQL Database.
2. Create
datasets for the source CSV file and the destination SQL table.
3. Create
a pipeline with a Data Flow activity.
4. In
the Data Flow, read the data from the CSV file, apply the required
transformations, and write the transformed data to the SQL table.
5. Trigger
the pipeline as needed.
3. Scenario: Your data pipeline fails intermittently due
to network issues. How would you handle this in Azure Data Factory?
Answer: To handle intermittent pipeline
failures:
1. Configure
retry policies for the affected activities, specifying the maximum retry count
and the retry interval.
2. Use
the Set Variable activity to capture and log error details.
3. Implement
conditional activities like If Condition to retry or reroute the process based
on error types.
4. Scenario: You need to copy data from multiple CSV
files stored in an Azure Data Lake Storage Gen2 account to an Azure SQL
Database. How would you configure this in Azure Data Factory?
Answer: To configure this data movement:
1. Create
linked services for Azure Data Lake Storage Gen2 and Azure SQL Database.
2. Create
datasets for the source CSV files and the destination SQL table.
3. Use
a wildcard in the source dataset to specify multiple CSV files.
4. Create
a pipeline with a Copy Data activity to move the data from the CSV files to the
SQL table.
5. Scenario: You have a pipeline that must run only after
another pipeline completes successfully. How would you implement this in Azure
Data Factory?
Answer: To implement this dependency:
1. Use
Execute Pipeline activity to call the dependent pipeline.
2. Set
up an activity dependency to ensure that the subsequent pipeline runs only if
the previous pipeline completes successfully.
6. Scenario: Your data transformation logic involves
multiple steps, including filtering, aggregation, and joining data from two
different sources. How would you implement this in Azure Data Factory?
Answer: To implement complex data
transformations:
1. Create
linked services for the data sources.
2. Create
datasets for the input and output data.
3. Create
a pipeline with a Mapping Data Flow activity.
4. In
the Data Flow, add transformations to filter, aggregate, and join the data from
the two sources.
5. Write
the transformed data to the desired output destination.
7. Scenario: You need to incrementally load data from an
on-premises SQL Server to an Azure SQL Database. Explain how you would achieve
this in Azure Data Factory.
Answer: To achieve incremental data loading:
1. Identify
a watermark column (e.g., last modified date) in the source table.
2. Store
the last processed value of the watermark column.
3. Create
a pipeline with a Copy Data activity.
4. Use
a dynamic query in the source dataset to filter data based on the stored
watermark value.
5. Update
the watermark value after each successful load.
8. Scenario: You are tasked with integrating data from
various formats (CSV, JSON, Parquet) stored in an Azure Data Lake Storage Gen2
into a single Azure SQL Database table. Describe your approach.
Answer: To integrate data from various formats:
1. Create
linked services for Azure Data Lake Storage Gen2 and Azure SQL Database.
2. Create
datasets for each file format and the destination SQL table.
3. Create
a pipeline with multiple Copy Data activities, each handling a different file
format.
4. Use
Data Flow activities to apply necessary transformations and merge the data into
a single table.
9. Scenario: You need to implement a solution that
dynamically chooses the source and destination based on input parameters. How
would you configure this in Azure Data Factory?
Answer: To configure dynamic source and
destination selection:
1. Create
parameters in the pipeline for the source and destination.
2. Use
parameterized linked services and datasets to reference the source and
destination based on input parameters.
3. Pass
the parameter values at runtime when triggering the pipeline.
10. Scenario: Your company requires a data pipeline to
process and analyze streaming data in near real-time. Explain how you would
implement this using Azure Data Factory.
Answer: To implement near real-time data
processing:
1. Use
Azure Event Hubs or Azure IoT Hub to ingest streaming data.
2. Set
up an Azure Stream Analytics job to process the streaming data and write the
output to a data store like Azure Blob Storage or Azure SQL Database.
3. Use
Azure Data Factory to orchestrate the process, periodically running pipelines
to load and transform the processed data for further analysis.