Linux内核sync流程

进程写文件时，文件并没有真正写到存储设备，而是写到了page cache中。文件系统会定期把脏页写到存储设备，进程也可以调用sync 这样的调用把脏页写回存储设设备。

数据结构

backing_dev_info

要理解这个结构体，得从它需要解决什么问题说起。早期版本的Linux采用pdflush线程将“脏“页面回写到磁盘。 pdflush 线程时”通用“的，也就是系统中所有的请求队列公用这些有限的资源，这样如果系统中存在多个队列，可能就会存在资源上的竞争，导致IO速率的下降。

我们希望每个请求队列都有一个 flush 线程，于是就有了backing_dev_info 。每个请求队列通过这个结构体管理自己的回写线程。其主要数据结构如下：

include/linux/backing-dev-defs.h
struct backing_dev_info {
........
    struct list_head bdi_list;
.......
    struct bdi_writeback wb;  /* the root writeback info for this bdi */
......
};

bdi_list : 用来把所有backing_dev_info 链接到全局链表bdi_list

wb: 用于控制回写行为的核心成员，后面详细介绍

bdi_writeback

bdi_writeback 的定义如下：

struct bdi_writeback {
...
    struct list_head b_dirty;   /* dirty inodes */
    struct list_head b_io;      /* parked for writeback */
...
    struct delayed_work dwork;  /* work item used for writeback */
  struct list_head work_list;
...
};

b_dirty: 用来存放文件系统中所有的脏页

b_io: 用来存放准备写回到存储设备的inode

dwork : 负责把脏页写回到存储设备。对应的函数时wb_workfn

work_list：每一个回写任务为一个work, 会被链接到这个链表上来

mark inode dirty

当inode的属性或者数据改变时，需要标记该inode dirty，把其放到bdi_writeback 的b_dirty 链表中。其主要函数是__mark_inode_dirty

fs/fs-writeback.c
void __mark_inode_dirty(struct inode *inode, int flags)
{
    struct super_block *sb = inode->i_sb;
     ...
    /*
     * Paired with smp_mb() in __writeback_single_inode() for the
     * following lockless i_state test.  See there for details.
     */
    smp_mb();

    if (((inode->i_state & flags) == flags) ||
        (dirtytime && (inode->i_state & I_DIRTY_INODE)))   /*       1      */
        return;

    if ((inode->i_state & flags) != flags) {
        const int was_dirty = inode->i_state & I_DIRTY;

        inode_attach_wb(inode, NULL);

        ....
        inode->i_state |= flags;                        /*       2      */

        .....

        /*
         * If the inode was already on b_dirty/b_io/b_more_io, don't
         * reposition it (that would break b_dirty time-ordering).
         */
        if (!was_dirty) {
            struct bdi_writeback *wb;
            struct list_head *dirty_list;
            bool wakeup_bdi = false;

            wb = locked_inode_to_wb_and_lock_list(inode);

            WARN((wb->bdi->capabilities & BDI_CAP_WRITEBACK) &&
                 !test_bit(WB_registered, &wb->state),
                 "bdi-%s not registered\n", bdi_dev_name(wb->bdi));

            inode->dirtied_when = jiffies;
            if (dirtytime)
                inode->dirtied_time_when = jiffies;

            if (inode->i_state & I_DIRTY)
                dirty_list = &wb->b_dirty;
            else
                dirty_list = &wb->b_dirty_time;

            wakeup_bdi = inode_io_list_move_locked(inode, wb,
                                   dirty_list);             /*       3      */

            spin_unlock(&wb->list_lock);
            trace_writeback_dirty_inode_enqueue(inode);

            /*
             * If this is the first dirty inode for this bdi,
             * we have to wake-up the corresponding bdi thread
             * to make sure background write-back happens
             * later.
             */
            if (wakeup_bdi &&
                (wb->bdi->capabilities & BDI_CAP_WRITEBACK))
                wb_wakeup_delayed(wb);               /*       4      */
            return;
        }
    }
out_unlock_inode:
    spin_unlock(&inode->i_lock);
}

(1) 判断flags 是否已经设置，如果是，直接放回。这里的flags是I_DIRTY或者是I_DIRTY_SYNC

(2) 设置inode 的i_state

(3) 将inode 链接到bdi_writeback 对应的链表里面

(4) 如果是b_dirty 链表为空，第三步wakeup_bdi 会返回true, 这时候会wakeup bdi_writeback 中的delay work, 开始执行回写操作。

周期回写

系统会定时去回写dirty page。这一节描述该具体流程

回写相关的入口都在 wb_workfn , mark inode dirty 或者sync触发之后，都会调用到这里来。

fs/fs-writeback.c
void wb_workfn(struct work_struct *work)
{
    struct bdi_writeback *wb = container_of(to_delayed_work(work),
                        struct bdi_writeback, dwork);
    long pages_written;

    set_worker_desc("flush-%s", bdi_dev_name(wb->bdi));
    current->flags |= PF_SWAPWRITE;

    if (likely(!current_is_workqueue_rescuer() ||
           !test_bit(WB_registered, &wb->state))) {
        /*
         * The normal path.  Keep writing back @wb until its
         * work_list is empty.  Note that this path is also taken
         * if @wb is shutting down even when we're running off the
         * rescuer as work_list needs to be drained.
         */
        do {
            pages_written = wb_do_writeback(wb);            /*          1         */
            trace_writeback_pages_written(pages_written);
        } while (!list_empty(&wb->work_list));
    } else {
        /*
         * bdi_wq can't get enough workers and we're running off
         * the emergency worker.  Don't hog it.  Hopefully, 1024 is
         * enough for efficient IO.
         */
        pages_written = writeback_inodes_wb(wb, 1024,
                            WB_REASON_FORKER_THREAD);
        trace_writeback_pages_written(pages_written);
    }

    if (!list_empty(&wb->work_list))                /*          2         */
        wb_wakeup(wb);
    else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
        wb_wakeup_delayed(wb);

    current->flags &= ~PF_SWAPWRITE;
}

(1) 调用wb_do_writeback 执行后续的操作

(2) 如果还有新的work(wb->work_list) 或者有新的io(wb_has_dirty_io(wb)) , 继续调度自己。

wb_do_writeback 是处理回写的关键函数，其定义如下

static long wb_do_writeback(struct bdi_writeback *wb)
{
    struct wb_writeback_work *work;
    long wrote = 0;

    set_bit(WB_writeback_running, &wb->state);
    while ((work = get_next_work_item(wb)) != NULL) {    /*           1          */
        trace_writeback_exec(wb, work);
        wrote += wb_writeback(wb, work);
        finish_writeback_work(wb, work);
    }

    /*
     * Check for a flush-everything request
     */
    wrote += wb_check_start_all(wb);

    /*
     * Check for periodic writeback, kupdated() style
     */
    wrote += wb_check_old_data_flush(wb);               /*           2          */
    wrote += wb_check_background_flush(wb);             /*           3          */
    clear_bit(WB_writeback_running, &wb->state);

    return wrote;
}

(1) 处理当前的work。sync等同步调用就会给wb添加一个wb_writeback_work ，然后在这里之行

(2) 周期回写的入口，检查是否有dirty page到期

(3) 后台回写的入口，如果此时的dity page 超过了系统的限制，会进行回写。

下面我们主要看第二步，周期回写的逻辑。

fs/fs-writeback.c
wb_workfn->wb_do_writeback->wb_check_old_data_flush
static long wb_check_old_data_flush(struct bdi_writeback *wb)
{
    unsigned long expired;
    long nr_pages;

    /*
     * When set to zero, disable periodic writeback
     */
    if (!dirty_writeback_interval)
        return 0;

    expired = wb->last_old_flush +
            msecs_to_jiffies(dirty_writeback_interval * 10);
    if (time_before(jiffies, expired))             /*             1           */
        return 0;                     

    wb->last_old_flush = jiffies;
    nr_pages = get_nr_dirty_pages();

    if (nr_pages) {
        struct wb_writeback_work work = {
            .nr_pages   = nr_pages,
            .sync_mode  = WB_SYNC_NONE,
            .for_kupdate    = 1,
            .range_cyclic   = 1,
            .reason     = WB_REASON_PERIODIC,
        };

        return wb_writeback(wb, &work);        /*                2              */
    }

    return 0;
}

(1) 检查回写周期是否到期。这个是一个可配的参数dirty_writeback_interval的配置在 /proc/sys/vm/dirty_writeback_centisecs。单位是10ms, 默认配置是500，即超时时间是5s。

(2) 构建wb_writeback_work，调用wb_writeback 执行后续的操作。

wb_writeback

无论是周期回写，后台回写，还是sync调用，其本质上都是构建一个wb_writeback_work ，然后调用wb_writeback去执行。

我们首先看下wb_writeback_work 的定义

fs/fs-writeback.c
struct wb_writeback_work {
    long nr_pages;
    struct super_block *sb;
    enum writeback_sync_modes sync_mode;
    unsigned int tagged_writepages:1;
    unsigned int for_kupdate:1;
    unsigned int range_cyclic:1;
    unsigned int for_background:1;
    unsigned int for_sync:1;    /* sync(2) WB_SYNC_ALL writeback */
    unsigned int auto_free:1;   /* free on completion */
    enum wb_reason reason;      /* why was writeback initiated? */

    struct list_head list;      /* pending work list */
    st

nr_pages: 此次work需要回写的总页数

writeback_sync_modes：是否是同步模式

for_kupdate：是否是周期回写。周期回写时，这里会置1

for_background：是否是后台回写。后台回写时，这里会置1

for_sync ：是否是sync调用。sync调用时，这里会置1

wb_writeback的具体流程如下：

fs/fs-writeback.c
wb_workfn->wb_do_writeback->wb_check_old_data_flush->wb_writeback
static long wb_writeback(struct bdi_writeback *wb,
             struct wb_writeback_work *work)
{
    unsigned long wb_start = jiffies;
    long nr_pages = work->nr_pages;
    unsigned long dirtied_before = jiffies;
    struct inode *inode;
    long progress;
    struct blk_plug plug;

    blk_start_plug(&plug);
    spin_lock(&wb->list_lock);
    for (;;) {
        /*
         * Stop writeback when nr_pages has been consumed
         */
        if (work->nr_pages <= 0)    /*         1        */
            break;

        /*
         * Background writeout and kupdate-style writeback may
         * run forever. Stop them if there is other work to do
         * so that e.g. sync can proceed. They'll be restarted
         * after the other works are all done.
         */
        if ((work->for_background || work->for_kupdate) &&
            !list_empty(&wb->work_list))         /*         2        */
            break;

        /*
         * For background writeout, stop when we are below the
         * background dirty threshold
         */
        if (work->for_background && !wb_over_bg_thresh(wb))       /*         3        */
            break;

        /*
         * Kupdate and background works are special and we want to
         * include all inodes that need writing. Livelock avoidance is
         * handled by these works yielding to any other work so we are
         * safe.
         */
        if (work->for_kupdate) {                     /*         4        */
            dirtied_before = jiffies -
                msecs_to_jiffies(dirty_expire_interval * 10);
        } else if (work->for_background)
            dirtied_before = jiffies;

        trace_writeback_start(wb, work);
        if (list_empty(&wb->b_io))
            queue_io(wb, work, dirtied_before);          /*         5        */
        if (work->sb)
            progress = writeback_sb_inodes(work->sb, wb, work);  /*         6        */
        else
            progress = __writeback_inodes_wb(wb, work);
        trace_writeback_written(wb, work);

    }
    spin_unlock(&wb->list_lock);
    blk_finish_plug(&plug);

    return nr_pages - work->nr_pages;
}

(1) 如果回写任务已经完成，退出循环

(2) 如果当前是周期回写或者后台回写，此时如果来了其他的work，直接返回，避免其他更高优先级的work被阻塞。

(3) 如果是后台回写，且当前没有达到后台回写的阈值，直接返回。

(4) 如果是周期回写，设置超时时间。如果是其他情况， dirtied_before设置的是当前时间，也就是所有inode 肯定会超时。

(5) 把超时的inode转移到b_io链表中。

(6) 调用writeback_sb_inodes 或者__writeback_inodes_wb 执行后续的工作。

writeback_sb_inodes和__writeback_inodes_wb 两者的逻辑类似，都是遍历wb->b_io，执行__writeback_single_inode。

fs/fs-writeback.c
static int
__writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
{
    struct address_space *mapping = inode->i_mapping;
    long nr_to_write = wbc->nr_to_write;
    unsigned dirty;
    int ret;

    WARN_ON(!(inode->i_state & I_SYNC));

    trace_writeback_single_inode_start(inode, wbc, nr_to_write);

    ret = do_writepages(mapping, wbc);                                                  /*        1       */

    /*
     * Make sure to wait on the data before writing out the metadata.
     * This is important for filesystems that modify metadata on data
     * I/O completion. We don't do it for sync(2) writeback because it has a
     * separate, external IO completion path and ->sync_fs for guaranteeing
     * inode metadata is written back correctly.
     */
    if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {           /*        2       */
        int err = filemap_fdatawait(mapping);
        if (ret == 0)
            ret = err;
    }

  .....
        int err = write_inode(inode, wbc);                                                  /*        3       */
        if (ret == 0)
            ret = err;

    trace_writeback_single_inode(inode, wbc, nr_to_write);
    return ret;
}

(1) 调用 do_writepages 回写文件的数据

(2) 等待数据写入完成

(3) 调用write_inode写入文件的元数据。可以看到这里是保证数据是先于元数据写入的。这里会调用到s_op->write_inode

后台回写

当dirty page 的数量超过后台回写的阈值时，系统开始执行后台回写。

这个阈值是 /proc/sys/vm/dirty_backgroud_ratio, 是脏页占整体可用内存的比例。或者是/proc/sys/vm/dirty_backgroud_bytes, 脏页最大的字节数，这两个是互斥的关系。

在wb_do_writeback 中，会调用wb_check_background_flush 执行后台回写

static long wb_check_background_flush(struct bdi_writeback *wb)
{
    if (wb_over_bg_thresh(wb)) {

        struct wb_writeback_work work = {
            .nr_pages   = LONG_MAX,
            .sync_mode  = WB_SYNC_NONE,
            .for_background = 1,
            .range_cyclic   = 1,
            .reason     = WB_REASON_BACKGROUND,
        };

        return wb_writeback(wb, &work);
    }

    return 0;
}

这个逻辑和周期回写一样，构建一个wb_writeback_work，然后调用wb_writeback执行work。

这里通过wb_over_bg_thresh判断当前是否超过了后台回写阈值。

系统调用sync

用户层可以主调调用 sync, 将系统中所有的脏数据写回。 sync的入口如下。

fs/sync.c
SYSCALL_DEFINE0(sync)
{
    ksys_sync();
    return 0;
}

void ksys_sync(void)
{
    int nowait = 0, wait = 1;

    wakeup_flusher_threads(WB_REASON_SYNC);             /*    1    */
    iterate_supers(sync_inodes_one_sb, NULL);               /*    2    */
    iterate_supers(sync_fs_one_sb, &nowait);              /*    3    */
    iterate_supers(sync_fs_one_sb, &wait);                /*    4    */
    iterate_bdevs(fdatawrite_one_bdev, NULL);             /*    5    */
    iterate_bdevs(fdatawait_one_bdev, NULL);            /*    6    */
    if (unlikely(laptop_mode))
        laptop_sync_completion();
}

(1) 唤醒所有的bdi 。这里的逻辑是如果bdi 里有dirty_page，就唤醒它让它去执行回写。

(2) 遍历所有sb, 执行sync_inodes_one_sb，这里会构建wb_writeback_work，添加到对应wb的work_list中，并等待其执行3完毕

(3) 遍历所有sb , 执行sync_fs，不等待执行完。

(4) 遍历所有sb, 执行sync_fs, 等待执行完

(5)(6) 遍历所有block_device, 将其缓存写入