Author : MD TAREQ HASSAN | Updated : 2023/01/18
Understanding ETL
- ETL stands for Extract-Transform-Load
- Wikipedia - “In computing, extract, transform, load is a three-phase process where data is extracted, transformed and loaded into an output data container. The data can be collated from one or more sources and it can also be outputted to one or more destinations”
- E (Extract):
- Connect to data source and pull the data (stuctured, semi-stuctured, unstuctured)
- Ingesting data from multiple, different types of data sources
- T (Transform):
- Process the data:
- Clean - remove unnecessary data
- Filter - keep only the data that make sense for a particular context
- Refine - making data more accurate
- Enrich - add extra stuff if required
- Etc.
- Make data suitable/prepared for further analysis (i.e. BI, ML)
- Process the data:
- L (Load):
- Move (load) refined/processed data to target sink (SQL Database, Data Warehouse, Analytics Engine, Analysis Services or ML Services)
- Keep data to a place from where other service can consume it
See: ETL VS ELT
Capability of ADF
ADF primarily does following two types of data integration operations:
- Data movement from source to sink
- Source can be on-premise (i.e., SQL server running in on-premise machine) or cloud service (i.e., storage account)
- Sink is normally a cloud service (i.e., storage account, Azure SQL etc.)
- Data transformation (so that the result can be comsumed by other services)
Data transformation (i.e. mapping data flow) can be performed on ADF-managed Spark cluster (Databricks) or dispatched to following using dispatch activity:
- Spark cluster i.e. Databricks, HDInsights cluster
- ML service i.e. Azure Machine learning
Notes:
- Azure Data Factory “Mapping Data Flows” uses Apache Spark in the backend
- To dispatch an activity to Spark cluster or ML service, a linked service to that Spark cluster or ML service must be created beforehand
Key components of ADF
Azure Data Factory is composed of following key components:
- Integration Runtimes
- Linked services
- Datasets
- Pipelines
- Activities
- Data Flows
Integration Runtime
The integration runtime (IR) is the compute infrastructure that Azure Data Factory and Synapse pipelines use to provide data-integration capabilities across different network environments.
See details: ADF Integration Runtime
Linked Services
A linked service defines the connection information that’s needed for ADF to connect to the external resources (data source, data sink, compute resource etc.). There are two types of linked service:
- Data store linked service (source or sink)
- Compute linked service (compute environemnt or underlaying compute resource for transformation activity to run on i.e. running “mapping data flow” activity on Databricks)
See details: ADF Linked Services
Datasets
Datasets represent data structures within the data stores (data source or data sink), which simply point to or reference the data we want to use in our activities as inputs or outputs. A dataset represents the structure of data in a source data store or sink data store.
See details: ADF Datasets
Pipelines and Activities
- In Data Factory and Synapse pipelines, an activity defines the action to be performed on the data. It represents a processing step in a pipeline.
- A pipeline is a logical grouping of activities that performs a unit of work. Together, the activities in a pipeline perform a task
- See details: ADF Pipelines and Activities
Managed virtual networks and managed private endpoint
The Integration Runtime is a software component that runs on a conpute infrastructure (i.e. VM). Therefore Integration Runtime requires Virtual Network to which underlaying VMs will be deployed. That virtual network can be either ADF-managed VNet (Azure IR) or self provisioned VNet (Self-hosted IR). See details: ADF Virtual Network Integration and Private Link
Data Flow
- Data Flow is basically a “visually designed data transformation logic”
- Data Flow allows us to design/develop data transformation logic without writing code
- There are dedicated “Data Flow Activity” and that Data Flow activity can be used to execute already designed “Data Flow” in pipeline
- Data Flow activity is not supported if self-hosted integration runtime is used
- ADF runs Data Flow activity in a managed Spark pool
Mapping data flows
- Create and manage graphs of data transformation logic that you can use to transform any-sized data
- Data Factory will execute “Mapping Data flow” logic on a Spark cluster that spins-up and spins-down when we need it (we won’t ever have to manage or maintain clusters)
Triggers
- Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off
- There are different types of triggers for different types of events.
Parameters
- Parameters are key-value pairs of read-only configuration
- Parameters are defined in the pipeline. The arguments for the defined parameters are passed during execution from the run context that was created by a trigger or a pipeline that was executed manually
- Activities within the pipeline consume the parameter values
- Dataset and linked service are strongly typed parameters
Control flow
Control flow is an orchestration of pipeline activities that includes:
- Chaining activities in a sequence
- Branching
- Defining parameters at the pipeline level
- Passing arguments while invoking the pipeline on-demand or from a trigger
- Custom-state passing
- Looping containers (For-each iterators)