Author : MD TAREQ HASSAN | Updated : 2022/08/09
What is Integration Runtime?
- The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory pipelines to provide data integration capabilities
- The underlaying compute infrastructure used by Azure Data Factory pipelines (and Azure Synapse pipelines)
- An integration runtime provides the bridge between activities and linked services
- Integration Runtime is referenced by the linked service or activity, and provides the compute environment where the activity is either run directly or dispatched
IR provides fully managed:
- Data movement (Azure IR or Self-hosted IR)
- Data flow (Execute a Data Flow in a managed Azure compute environment)
- Activity dispatch (Dispatch and monitor transformation activities running on a variety of compute services such as Azure Databricks, Azure HDInsight)
- SSIS package execution
Integration Runtime Types
- Azure Integration Runtime
- Self-hosted Integration Runtime
- Azure-SSIS Integration Runtime (Note: Synapse pipelines currently only support Azure or self-hosted integration runtimes)
Capability
An Azure integration runtime can do:
- Data Flow (visually designed data transformation logic that can be run in a Spark Pool)
- Data movement (copy activities between data stores)
- Activity dispatch (Dispatch transform activities to the target compute resource in Virtual Network or on-premises network)
Details: Capabilities and network support for integration runtime types:
Controlling Outbound Traffic
- Azure Integration Runtime and Self-hosted Integration Runtime:
- In Data Factory, all ports are opened for outbound communications when utilizing Azure Integration Runtime
- In Synapse, workspaces have options to limit outbound traffic from the managed virtual network when utilizing Azure Integration Runtime
- Azure-SSIS Integration Runtime: can be integrated with your Virtual Network to provide outbound communications controls
Virtual Network Integration and Private Link
The Integration Runtime is a software component that runs on a conpute infrastructure (i.e. VM). Therefore Integration Runtime requires Virtual Network to which underlaying VMs will be deployed
- Azure Integration Runtime
- Compute resource from global pool of ADF infrastructure
- Compute resource in a dedicated ADF-managed Virtual Network
- Provisioning underlaying compute (VM), installing Integration Runtime etc. will be done by Microsoft
- Managed private endpoint is created
- Need to use relay VM and internal load balancer (kind of custom “Private Link Service” on Virtual Network side, “managed private endpoint” will utilize it) so that data can be pulled from:
- Data sources in customer Virtual Network
- On-premise data sources (mentioned that on-premise is connected to customer Virtual Network via VPN or express route)
- Self-hosted Integration Runtime
- VM is provisioned in a subnet of Virtual Network (need to provision Virtual Network, Subnet, and VM by yourself)
- Integration Runtime is installed in the VM
- Download configuration script in the VM
- Run configuration script. It will download Integration Runtime software, install it and will register it to ADF
- Integration Runtime is created in ADF management hub using self-hosted Integration Runtime option
- A private endpoint is created so that some of the communications (i.e, command communication) between data factory and Virtual Network is private
Private Link
- Traffic that can go through private link (one private endpoint can do either of the followings):
- for “command communications between the self-hosted Integration Runtime and Data Factory”
- for authoring and monitoring the data factory in your virtual network
- Traffic that cannot go through private link (since it will go through public internet, need to create Firewall rules, NSG rules to allow outbound traffic):
- Interactive authoring that uses a self-hosted Integration Runtime, such as test connection, browse folder list and table list, get schema etc.
- The new version of the self-hosted Integration Runtime that can be automatically downloaded from Microsoft Download Center if you enable auto-update
Managed virtual networks and managed private endpoint Integration Runtime can be provisioned with ADF-managed virtual network. There are two ways to enable managed virtual network:
- Enable managed virtual network during the creation of data factory
- Enable managed virtual network in integration runtime setting after creation of data factory
With a managed virtual network:
- The burden of managing the virtual network, network infrastructure planning etc. are offloaded to Data Factory
- Creating an integration runtime within a managed virtual network ensures the data integration process is isolated and secure
Managed private endpoints
- Managed private endpoints are private endpoints created in the Data Factory managed virtual network that establishes a private link to Azure resources
- Data Factory manages these private endpoints
- If data source (or sink) is in customer VNet or in on-premise, then need to use relay VM and load balancer on VNet side
- https://docs.microsoft.com/en-us/azure/data-factory/tutorial-managed-virtual-network-sql-managed-instance
Create Integration Runtime
Interactive Authoring
- Interactive authoring capabilities are used for functionalities like test connection, browse folder list and table list, get schema, and preview data
- The backend service will pre-allocate compute for interactive authoring functionalities. Otherwise, the compute will be allocated every time any interactive operation is performed which will take more time
- The time to live (TTL) for interactive authoring is 60 minutes by default, which means it will automatically become disabled after 60 minutes of the last interactive authoring operation