Azure Data Factory

Author : MD TAREQ HASSAN | Updated : 2022/01/30

Understanding ETL

ETL stands for Extract-Transform-Load
Wikipedia - “In computing, extract, transform, load is a three-phase process where data is extracted, transformed and loaded into an output data container. The data can be collated from one or more sources and it can also be outputted to one or more destinations”
E (Extract):
- Connect to data source and pull the data (stuctured, semi-stuctured, unstuctured)
- Ingesting data from multiple, different types of data sources
T (Transform):
- Process the data:
  - Clean - remove unnecessary data
  - Filter - keep only the data that make sense for a particular context
  - Refine - making data more accurate
  - Enrich - add extra stuff if required
  - Etc.
- Make data suitable/prepared for further analysis (i.e. BI, ML)
L (Load):
- Move (load) refined/processed data to target sink (SQL Database, Data Warehouse, Analytics Engine, Analysis Services or ML Services)
- Keep data to a place from where other service can consume it

Capability of ADF

ADF primarily does following two types of data integration operations:

Data movement from source to sink
- Source can be on-premise (i.e., SQL server running in on-premise machine) or cloud service (i.e., storage account)
- Sink is normally a cloud service (i.e., storage account, Azure SQL etc.)
Data transformation (so that the result can be comsumed by other services)

Data transformation (i.e. mapping data flow) can be performed on ADF-managed Spark cluster (Databricks) or dispatched to following using dispatch activity:

Spark cluster i.e. Databricks, HDInsights cluster
ML service i.e. Azure Machine learning

Notes:

Azure Data Factory “Mapping Data Flows” uses Apache Spark in the backend
To dispatch an activity to Spark cluster or ML service, a linked service to that Spark cluster or ML service must be created beforehand

Key components of ADF

Azure Data Factory is composed of following key components:

Integration Runtimes
Linked services
Datasets
Pipelines
Activities
Data Flows

Integration Runtime

The integration runtime (IR) is the compute infrastructure that Azure Data Factory and Synapse pipelines use to provide data-integration capabilities across different network environments.

See details: ADF Integration Runtime

Linked Services

A linked service defines the connection information that’s needed for ADF to connect to the external resources (data source, data sink, compute resource etc.). There are two types of linked service:

Data store linked service (source or sink)
Compute linked service (compute environemnt or underlaying compute resource for transformation activity to run on i.e. running “mapping data flow” activity on Databricks)

See details: ADF Linked Services

Datasets

Datasets represent data structures within the data stores (data source or data sink), which simply point to or reference the data we want to use in our activities as inputs or outputs. A dataset represents the structure of data in a source data store or sink data store.

See details: ADF Datasets

Pipelines and Activities

In Data Factory and Synapse pipelines, an activity defines the action to be performed on the data. It represents a processing step in a pipeline.
A pipeline is a logical grouping of activities that performs a unit of work. Together, the activities in a pipeline perform a task
See details: ADF Pipelines and Activities

Managed virtual networks and managed private endpoint

The Integration Runtime is a software component that runs on a conpute infrastructure (i.e. VM). Therefore Integration Runtime requires Virtual Network to which underlaying VMs will be deployed. That virtual network can be either ADF-managed VNet (Azure IR) or self provisioned VNet (Self-hosted IR). See details: ADF Virtual Network Integration and Private Link

Data Flow

Data Flow is basically a “visually designed data transformation logic”
Data Flow allows us to design/develop data transformation logic without writing code
There are dedicated “Data Flow Activity” and that Data Flow activity can be used to execute already designed “Data Flow” in pipeline
- Data Flow activity is not supported if self-hosted integration runtime is used
- ADF runs Data Flow activity in a managed Spark pool

Mapping data flows

Create and manage graphs of data transformation logic that you can use to transform any-sized data
Data Factory will execute “Mapping Data flow” logic on a Spark cluster that spins-up and spins-down when we need it (we won’t ever have to manage or maintain clusters)

Triggers

Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off
There are different types of triggers for different types of events.

Parameters

Parameters are key-value pairs of read-only configuration
Parameters are defined in the pipeline. The arguments for the defined parameters are passed during execution from the run context that was created by a trigger or a pipeline that was executed manually
Activities within the pipeline consume the parameter values
Dataset and linked service are strongly typed parameters

Control flow

Control flow is an orchestration of pipeline activities that includes:

Chaining activities in a sequence
Branching
Defining parameters at the pipeline level
Passing arguments while invoking the pipeline on-demand or from a trigger
Custom-state passing
Looping containers (For-each iterators)