Overview Azure Data Factory and Apache Airflow
Data-driven decision-making allows organizations to make strategic decisions and take actions that align with their objectives and goals at the right time. Undoubtedly, organizations are generating petabytes of data but still struggle to collect data, create pipelines, and manage or monitor it before extracting and understanding the data patterns and insights to make decisions. Azure Data Factory and Apache Airflow automate jobs and make monitoring them easy.
What is Azure Data Factory?
Azure Data Factory (ADF) is a data integration and migration service. It is a fully managed server less data ingestion solution to ingest, prepare and transform all data at scale. Microsoft offers ADF within Azure for constructing ETL and ELT pipelines. It creates an automated data pipeline that automatically performs these processes, thus reducing manual tasks.
What are the advantages of Azure Data Factory?
Below given are the advantages of Azure Data Factory:
- Easy to use: It rehosts and extends SSIS in a few clicks. ADF helps to modernize the SSIS. It makes it easy to move all SSIS packages to the cloud. Moreover, it builds code-free ETL and ELT pipelines with built-in Git and CI/CD support.
- Cost-effective: ADF is cost-effective by nature as it allows pay-as-you-use. It is a fully managed serverless cloud service that scales on demand.
- Powerful: It has 90 built-in connectors that allow it to ingest data from all on-premises and software as a service (SaaS) sources. Prepare and monitor data pipelines code-free at scale.
- Intelligent: Autonomous ETL unlocks operational efficiencies and enables citizen integrators.
What is Apache Airflow?
Apache Airflow is a solution that runs, builds, and manages workflows. It represents workflow as directed acyclic graphs of operations called tasks, where an edge represents a logical dependency between operations.
Airflow installation consists of the following components:
- Scheduler: It handles triggering schedules workflows and submitting tasks to the executor to run.
- Executor: It handles the running of tasks. It runs everything inside the scheduler by default, but most production-suitable executors push task execution out to workers.
- Web Server: It presents a handy user interface to inspect, trigger and debug DAGs behavior and task.
- DAG file: A folder of DAG files that are read by the scheduler and executor.
- Metadata database: It is a metadata database that is used by scheduler web server uses a metadata database and executor to store data.
What are the advantages of Apache Airflow?
The advantages of Apache Airflow are described below:
- Open Source: Apache Airflow is an open-source service wherever improvements can be made quickly. It has no barriers and prolonged procedures.
- Easy to use: Anyone with Python knowledge can deploy a workflow. It can be used to transfer data, manage infrastructure, build ML models, and more.
- Robust Integrations: It offers plug-and-play operators that can be used to execute tasks on Google Cloud Platform, Amazon Web Services, Microsoft Azure, and other third-party services. This capability makes Airflow easy to apply to current infrastructure and extends to next-generation technologies.
Why do we need Apache Airflow and Azure Data Factory?
As organizations move into the cloud and big data, data integration and migration will remain essential elements for organizations across industries. ADF helps to address these two issues efficiently and hence enables to focus on data and allow to schedule, monitor, and manage ETL/ELT pipelines with a single view.
Let’s discuss some reasons why the adoption of Azure Data Factory is on the rise:
- To drive more value.
- Improve business process outcomes
- Reduce overhead expenses
- Better decision-making
- Increase business process agility
- Cost-effective process
How do Apache Airflow and Azure Data Factory help businesses?
Here it will discuss some customer stories and their view to justify how ADF and Airflow change their business and helps them to reach their goals:
Apache Airflow
Case 1:
Problem: The organization needs to create workflow orchestration for solving some tasks in game dev. They didn’t have any suitable tools with built-in functions to orchestrate the process manually and from scratch every time. As a result, it increases complexity in managing dependencies and monitoring processes in complex workflows. They need a centralized tool to tell them logs, retries, and performance time at one location. Moreover, they are lacking in backfilling historical data and restarting the failed tasks.
Solution: Airflow provides some built-in solutions having integrative ones also. With their vast feature, Apache airflow simplifies the process of building complex workflows. DAG models avoid errors and follow general patterns while building workflows. It allows them to run their game development processes such as processing messages to support the team, working with churn rate, sorting bank offers, and other similar issues to run efficiently.
Case 2:
Problem: Big data systems require sophisticated data pipelines that connect to a variety of backend services in order to support complex operations. These workflows must be deployed, monitored, and executed regularly or in response to external events. Organization’s Experience Platform component services designed and developed an orchestration service that allows users to author, schedule, and monitor complex hierarchical workflows for Apache Spark and non-Spark jobs. While working with various applications and managing them, organizations face several issues due to its complexity.
Solution: Apache Airflow allows Organisations Experience Platform to create smooth orchestration services to meet customer requirements. It is built on guiding principles to leverage an off-the-shelf, open-source orchestration engine abstracted to other services via an API and extendable to any application via a pluggable framework. The platform uses the Apache Airflow execution engine for scheduling and executing various workflows. Moreover, it provides insight related to workflows.
ADF
Case 3:
Problem: The organization creates a Saas data solution that organizations can use to make transformative, data-driven decisions. As the data warehouse grew, the maintenance of existing data increasingly required updates to accommodate changes to the data feeds. Keeping updating ETL processes, and data models is a big maintenance effort; therefore, there is a need for a more intelligent approach. To solve this problem they use Microsoft technologies that automatically generates data warehouses and performs ETL process for customer specs. This process has drastically reduced the development cost and time.
What is the key feature of Apache Airflow and Azure Data Factory?
Feature | Azure Data Factory | Apache Airflow |
Focus | ETL | Orchestration, scheduling, workflows |
Database replication | Full table; Incremental via custom “SELECT” query | Only via plugins |
SaaS | About 20, with several more in preview | Only via plugins |
Ability to new data sources | No | Yes |
Connects to data warehouses / Data lakes? | Yes/Yes | Yes/Yes |
Support SLAs | Yes | No |
Compliance, governance, and security certifications | HIPAA, GDPR, ISO 27001, others | None |
Data sharing | No | Yes, via plugins |
Developer tools | REST API, .Net and Python SDKs | Experimental REST API |
Apache Airflow Vs. Azure Data Factory: Comparison
Let’s deep dive to compare ADF and Airflow based on some features:
Transformations
- Azure Data Factory: It supports both pre and post-transformations with a wide range of transformation functions. Transformations can be applied using GUI or Power Query Online in which coding is required,
- Apache Airflow is a tool for authoring, scheduling, and monitoring workflows as directed acyclic graphs of tasks (DAG). DAG is a topological representation that explains how data flows within a system. Apache Airflow manages the execution dependencies among jobs in DAG and supports job failures, retirements, and alerts. Data can be transformed as an action in the workflow using Python.
Connectors: Data sources and Destinations
These tools support a variety of data sources and Destinations
- Azure Data Factory: ADF could integrate with about 80 data sources, including SaaS platforms, SQL and NoSQL databases, generic protocols, and several file types. Moreover, It supports approximately 20 cloud and on-premises data warehouses and database destinations.
- Apache Airflow: Apache Airflow orchestrates workflow for ETL and stores data. It can run tasks, which are sets of activities, via operators and templates for tasks that Python functions or scripts can create. These operators can be created for any source or destination. Moreover, it also supports plugins to implement operators and hooks (interfaces to external platforms). It has some built-in plugins for databases and SaaS platforms.
Support, documentation, and training
Working with these services can be complex, such as data integration; therefore, to support their customer, they offer some support via documentation, forums, and training.
- Azure Data Factory: ADF provides support by an online request form and forums. It gives official comprehensive documentation. Customers can also contact via phones and Emails. It also offers digital training materials that can be completed.
- Apache Airflow: Apache Airflow offers documentation with a quick start and how-to guide. It also supports the Slack community and provides some tutorials on its official website.
Pricing
Azure Data Factory: Pricing of Azure Data Factory
Azure Data Factory v1: The pricing for Data Factory usage is calculated based on the following factors:
- Frequency of activities: Based on the frequency such as high or low. Low-frequency activity does not execute more than once in a day rather than high-frequency activity can execute more than once in a day.
- Pipeline activity: It checks whether the pipeline is active or not.
- Place where activity is running: It tracks where the activity is running, such as on cloud or on-premise.
- Re-running activities: Activities can be re-run. The cost of rerunning depends on the place where the activity is running.
Azure Data Factory v2: The pricing of the data pipeline is calculated based on the following factors
- Pipeline orchestration and execution
- Data flow execution and debugging.
- Number of Data Factory operations such as creating and monitoring pipeline
Apache Airflow
Apache Airflow is free and open source. It is licensed under Apache License 2.0. Deploying Airflow to a robust and secure production environment has always been challenging. Therefore, several companies, consultants, and cloud services offer enterprise support for deploying and managing Airflow environments, such as AWS, Google, Astronomer, etc. So, its price may vary according to the company. The pricing table of AWS is shown below.
Azure Data Factory and Airflow Together
ADF is a service that is commonly used for constructing pipelines and jobs without writing tons of code. It can easily integrate with on-premise data sources and Azure services. However, it has some limitations when used alone:
- It isn’t easy to build and integrate custom tools.
- Limited integration with services outside of Azure.
- Limited orchestration capabilities.
- Custom packages and dependencies are complex to manage.
Conclusion
Here is the role of Airflow in overcoming these limitations. ADF and Airflow can be used together to leverage the best of both tools. ADF jobs can be run using Airflow DAG, giving the full capabilities of Airflow orchestration beyond the ADF. Thus organizations can use ADF to write their jobs comfortably and use Airflow as the control plane for the orchestration.
The main building blocks of Airflow are Hooks and Operators that can easily interact and execute the ADF pipelines.