Airflow Control System The first DAG

Airflow first DAG

Considered for a long time, you do not record things related to airflow, how should record the official document has a more detailed introduction, there are a variety of blog, I need to have a copy of your notes?

The answer from the beginning of this article.

This article from the perspective of a strange start cognitive airflow, incidentally outlines a step by step how to build our data scheduling system.

It is early September 9102, the most recent version of the Airflow is 1.10.5.

ps. check the information I found myself a lot of articles were crawling away, for the author, so took the contents of the will randomly add some anti-counterfeit labels, can be ignored.

What data scheduling system?

This relatively recent concept in fire stations, among them one called the station data, the article data table in the end is what gives a concept.
I understand the rough is probably: Collect various scattered data, standardization, and service oriented, to provide a unified data services. and to do sorting and data processing, data necessarily involves scheduling, it needs a scheduling system. [This article comes from Ryan Miao]
data scheduling system can be synchronized with each other different heterogeneous data, you can go to perform according to plan data processing and task scheduling. Airflow is such a task scheduling platform.

Front Airflow1.10.4 Introduction and have
installed our airflow, can be used directly. This is the first task DAG chain.

Create a task Hello World

Goal: 8:00 every morning to perform a task - PrintHello World

On Linux, we can insert a record in the crontab:

Use Springboot, we can use @Scheduled(cron="0 0 8 * * ?")to regularly perform a method.

The use of quartz, we can create one CronTrigger, and then to perform the corresponding JobDetail.

 CronTrigger trigger = (CronTrigger)TriggerBuilder.newTrigger()
            .withIdentity("trigger1", "group1")
            .withSchedule(CronScheduleBuilder.cronSchedule("0 0 8 * * ?"))
            .build();

Use Airflow, almost similar.

In docker-airflow, we will mount as a disk dag, dag can now only need to write in the dag directory.

 volumes:
            - ./dags:/usr/local/airflow/dags

Createhello.py

"""
Airflow的第一个DAG
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime


default_args = {
    "owner": "ryan.miao",
    "start_date": datetime(2019, 9, 1)
}

dag = DAG("Hello-World", 
        description="第一个DAG",
        default_args=default_args, 
        schedule_interval='0 8 * * *')

t1 = BashOperator(task_id="hello", bash_command="echo 'Hello World, today is {{ ds }}'", dag=dag)

This is a Python script that defines two main variables.

DAY

It represents a directed acyclic graph, a task chain, which id globally unique. The DAG is the core concept airflow, the tasks loaded into the dag, the package into tasks dependent chain. The DAG decided to implement these rules tasks, such as execution time provided herein from the beginning of September 1, 8:00 every day to perform.

TASK

shows a specific task a task that unique id. There are different types of task in the DAG, distinguished by various Operator task type plug. Here is a BashOperator, airflow comes from the plug, i.e., airflow comes with a lot unpacking plug-use.

ds

built-in time variable airflow template, when rendering operator, the string will inject a current execution date. The later will be devoted to the implementation date.

[This article comes from Ryan Miao]

Deployment dag

The above hello.pychange dag upload directory, file Airflow automatically detects and parses the file py introduced dag defined to the database.

Access airflow address, refresh to see our dag.

Open dag, dag enter the definition, you can see yesterday's task has been executed.

Click the task instance, click view log can view the log

Our task on this machine to perform, and printed hello, note that the print date.

This is a basic unit of airflow task, and the task will be executed daily 8:00.

Understand the concept of scheduling system

Task Definition

The definition of a task specific content, such as this is printed Hello World,today is {{ ds }}.

Examples of tasks

Task run time is set, generates an instance of each run, i.e. dag-task-executiondate mark a task instance. Examples of tasks and task execution time represented by the current binding. In this demo, it generates a daily task instance.

Execution Date

Today is 2019-09-07, but the task execution date we log print is 2019-09-06.

Task time task execution date is represented by instances running, we usually called the execute-date or bizdate, similar hive table partition.

Why tasks performed today, task time is variable yesterday it?

Because the task instance is a period of time tasks, such as computing access per day, the day we passed in order to calculate the 6th day of the 6th of the total amount only. It can calculate this task at the earliest after 7 0:00, calculated visits between the No. 60 No. 7 pm to 0:00. so, this time the task on behalf of the task time data to be processed, is No. 6. real task execution time is not fixed, may 7, 8 may also, As long as the task execution interval calculated data is No. 6 on it.

Thus, the scheduling system ds (execution date) is typically a period of the past, i.e. to perform tasks on the present cycle period.

Task dependencies

The most typical task model etl (Extract & Transformation & Loading, namely data extract, transform, load) should be divided into at least three steps. For this goal to be the number of visits it every day, I have to extract the access log, find the amount of access field, calculating accumulated. there sequence between these three tasks, then a must be finished before the latter can perform. called task dependency. dependencies between different tasks in the airflow, by implementing the relevant task rely.

There rely on the same time a task. For example, new users to calculate the amount of data I need to know the day before yesterday and yesterday's data in order to calculate the incremental So, this task must rely on the task status yesterday in airflow in by setting depends_on_pastto decide.

Task makeup backfill

airflow where there is a feature called backfill, can perform tasks over time. We call this operation is called makeup or complement, to calculate the data is not calculated before.

Our task is performed in time, today created a task computing users per day, then tomorrow will run out of today's data. At this time, I want to know how to do daily subscriber growth over the past one month?

Write your own code, as long as the data query date range, and then calculate the like. But the scheduling task is fixed, we can only create instances of different tasks to perform in accordance with the date of the date to carry out these tasks. Backfill is to achieve this functionality of.

Heavy task to run

Let ran the task run again.

Sometimes, we need to re-run the task. For example, etl tasks, today suddenly found yesterday extracted data task in question, a little app to extract data, and calculate the amount of users that back is not accurate, we need to re-extract, re calculation.

In the airflow, the task instance by clicking on the clear button to delete the task instance, and then scheduling system will create and execute this example again.

About scheduling system to achieve this logic, we have an opportunity to view the source code behind understanding.

postscript

Not too substantial task specific description herein, but leads Hello World, Xianpao up, we are going to continue to improve our dag.

Guess you like

Origin www.cnblogs.com/woshimrf/p/airflow-first-dag.html