Zookeeper-持久化

注:本文中的请求和事务是同一个含义,表示来自客户端的写请求

背景

Zookeeper虽然是内存数据库,但为了保证高可靠性,其同时提供了持久化功能,通过快照和事务日志将数据保存在磁盘中.

事务日志

每个执行的事务都会写入到事务日志中,其存储位置由dataLogDir配置,当未配置dataLogDir时,使用dataDir作为存储目录,由于事务日志的写入速度较为影响Zookeeper的性能,可以将dataLogDir单独配置到一块磁盘上
由于事务日志要不断的写入,会触发底层磁盘I/O为文件开辟新的磁盘块,为了减少分配新磁盘块对写入的影响,Zookeeper使用预分配策略,默认每次分配新文件或扩容时,一次分配64MB
扩容事务日志文件时机:初始化事务日志文件时为其分配64MB,当写入事务日志的过程中,发现剩余可写入空间小于4KB时,进行扩容,依然是为事务日志文件增加64MB
生成新事务日志文件时机:即使当前事务日志文件可写空间较少,也只会进行扩容,不会生成新的事务日志文件.在经过snapCount次事务后,会生成快照文件,但同时将当前事务日志的输出流置null,这样下次写事务日志时自动创建新的事务日志文件
为了便于快速根据zxid找到存储该zxid对应事务的事务日志文件,事务日志文件的命名是有意义的,事务日志文件的命名为log.{zxid},后缀是该日志文件存储的第一个事务的zxid

快照

生成快照文件时机:经过snapCount次事务后,会生成快照文件
和事务日志文件一样,快照文件的命名也是有意义的,命名为snapShot.{zxid},后缀时该快照文件生成时已执行的最新的事务的zxid,即[1,zxid]的所有事务已应用到DataTree

数据恢复

总流程

在QuorumPeerMain启动ZookeeperServer的过程中,需要从磁盘中恢复数据,恢复数据共有两个步骤

从快照中恢复DataTree,返回通过快照恢复的数据的最大zxid
从事务日志中获取大于zxid的所有日志,将其应用到步骤1中初步恢复的DataTree中

    /**
     * this function restores the server database after reading from the snapshots and transaction logs
     *
     * @param dt       the datatree to be restored
     * @param sessions the sessions to be restored
     * @param listener the playback listener to run on the
     *                 database restoration
     * @return the highest zxid restored
     * @throws IOException
     */
    public long restore(DataTree dt, Map<Long, Integer> sessions,
                        PlayBackListener listener) throws IOException {
        //1.解析快照文件,同时更新dt.lastProcessedZxid
        long deserializeResult = snapLog.deserialize(dt, sessions);
        //2.处理事务日志
        FileTxnLog txnLog = new FileTxnLog(dataDir);
        boolean trustEmptyDB;
        File initFile = new File(dataDir.getParent(), "initialize");
        if (Files.deleteIfExists(initFile.toPath())) {
            LOG.info("Initialize file found, an empty database will not block voting participation");
            trustEmptyDB = true;
        } else {
            trustEmptyDB = autoCreateDB;
        }
        if (-1L == deserializeResult) {
            /* this means that we couldn't find any snapshot, so we need to
             * initialize an empty database (reported in ZOOKEEPER-2325) */
            if (txnLog.getLastLoggedZxid() != -1) {
                throw new IOException(
                        "No snapshot found, but there are log entries. " +
                                "Something is broken!");
            }

            if (trustEmptyDB) {
                /* TODO: (br33d) we should either put a ConcurrentHashMap on restore()
                 *       or use Map on save() */
                save(dt, (ConcurrentHashMap<Long, Integer>) sessions, false);

                /* return a zxid of 0, since we know the database is empty */
                return 0L;
            } else {
                /* return a zxid of -1, since we are possibly missing data */
                LOG.warn("Unexpected empty data tree, setting zxid to -1");
                dt.lastProcessedZxid = -1L;
                return -1L;
            }
        }
        return fastForwardFromEdits(dt, sessions, listener);
    }

上面是恢复DataTree的总步骤,包含了一些错误处理代码,目前还不清楚何时会出现错误?自然不了解错误处理代码是如何处理错误的?因此只介绍正常情况下恢复数据的步骤

从快照中恢复

    /**
     * deserialize a data tree from the most recent snapshot
     * 反序列化快照文件
     * <p>
     * 副作用:修改了{@link DataTree#lastProcessedZxid}
     * <p>
     * 若最新的有效的快照文件名为snapShot.n,则[1,n]的所有事务的执行结果都在快照文件中,此时返回n
     *
     * @return the zxid of the snapshot(快照数据保存的最后处理的zxid)
     */
    @Override
    public long deserialize(DataTree dt, Map<Long, Integer> sessions)
            throws IOException {
        // we run through 100 snapshots (not all of them)
        // if we cannot get it running within 100 snapshots
        // we should  give up
        //获取至多100个快照文件(已按zxid逆序排序,即越新的越在前面)
        List<File> snapList = findNValidSnapshots(100);
        if (snapList.size() == 0) {
            return -1L;
        }
        File snap = null;
        boolean foundValid = false;
        //但若最新的快照文件通过正确性校验,则只解析最新的一个文件;
        //若100个快照文件都是无效的,则认为无法从快照中恢复数据
        for (File aSnapList : snapList) {
            snap = aSnapList;
            LOG.info("Reading snapshot " + snap);
            try (InputStream snapIS = new BufferedInputStream(new FileInputStream(snap));
                 CheckedInputStream crcIn = new CheckedInputStream(snapIS, new Adler32())) {
                InputArchive ia = BinaryInputArchive.getArchive(crcIn);
                //反序列化
                deserialize(dt, sessions, ia);
                long checkSum = crcIn.getChecksum().getValue();
                long val = ia.readLong("val");
                //验证checksum
                if (val != checkSum) {
                    throw new IOException("CRC corruption in snapshot :  " + snap);
                }
                foundValid = true;
                break;
            } catch (IOException e) {
                LOG.warn("problem reading snap file " + snap, e);
            }
        }
        if (!foundValid) {
            throw new IOException("Not able to find valid snapshots in " + snapDir);
        }
        dt.lastProcessedZxid = Util.getZxidFromName(snap.getName(), SNAPSHOT_FILE_PREFIX);
        return dt.lastProcessedZxid;
    }

上述代码是从快照恢复数据,最多获取100个最新的快照文件,但若最新的快照文件通过正确性校验,则只解析最新的一个文件;若100个快照文件都是无效的,则认为无法从快照中恢复数据.

从事务日志中恢复

    /**
     * 从事务日志中恢复数据,由于{@link DataTree#lastProcessedZxid}已经在{@link #restore(DataTree, Map, PlayBackListener)}中修改,因此不用传入该参数
     *
     * @param dt       the datatree to write transactions to.
     * @param sessions the sessions to be restored.
     * @param listener the playback listener to run on the
     *                 database transactions.
     * @return the highest zxid restored.
     * @throws IOException
     */
    public long fastForwardFromEdits(DataTree dt, Map<Long, Integer> sessions,
                                     PlayBackListener listener) throws IOException {
        //获取比lastProcessedZxid大的所有事务日志
        TxnIterator itr = txnLog.read(dt.lastProcessedZxid + 1);
        long highestZxid = dt.lastProcessedZxid;
        TxnHeader hdr;
        try {
            do {
                hdr = itr.getHeader();
                if (hdr == null) {
                    return dt.lastProcessedZxid;
                }
                if (hdr.getZxid() < highestZxid && highestZxid != 0) {
                    LOG.error("{}(highestZxid) > {}(next log) for type {}",
                            highestZxid, hdr.getZxid(), hdr.getType());
                } else {
                    highestZxid = hdr.getZxid();
                }
                try {
                    //事务应用
                    processTransaction(hdr, dt, sessions, itr.getTxn());
                } catch (KeeperException.NoNodeException e) {
                    throw new IOException("Failed to process transaction type: " +
                            hdr.getType() + " error: " + e.getMessage(), e);
                }
                //唤醒监听器
                listener.onTxnLoaded(hdr, itr.getTxn());
            } while (itr.next());
        } finally {
            if (itr != null) {
                itr.close();
            }
        }
        return highestZxid;
    }

上述代码就是依次获取不包含在快照中的事务日志,将其应用在DataTree上.除此之外,还要唤醒监听器,这里的监听器将事务操作记录转换为Proposal,保存到ZKDatabasecommittedLog中,以便Follower进行快速同步.
在processTransaction()中有如下一段注释:

        /**
         * Snapshots are lazily created. So when a snapshot is in progress,
         * there is a chance for later transactions to make into the
         * snapshot. Then when the snapshot is restored, NONODE/NODEEXISTS
         * errors could occur. It should be safe to ignore these.
         */

快照文件是延迟创建的(快照文件创建过程见下文持久化部分).所以在快照执行过程中,有可能将其后的事务的运行结果也持久化到快照中.因此,在快照恢复时,NONODE/NODEEXISTS就可能发生,此时可忽略此类错误.

持久化

Zookeeper启动时创建请求处理链处理客户端请求,单机模式下请求处理链为:PrepRequestProcessor->SyncRequestProcessor->FinalRequestProcessor.其中,SyncRequestProcessor主要完成两个工作

将事务请求记录到事务日志文件中去

为了提高事务日志持久化的性能,Zookeeper使用批处理策略,并不是每一个request都立即持久化到磁盘中,而且持久化到磁盘的优先级较低.只有当没有待处理的request或者积攒了1000个待刷新的request时,才会执行flush()

触发Zookeeper进行数据快照

为了防止集群中所有机器在同一时刻进行数据快照,对是否进行数据快照增加随机因素
进行数据快照时同时将当前事务日志的输出流置null,这样下次写事务日志时创建新的事务日志文件
启动一个线程并行执行快照任务,不会阻塞正常的处理流程
若上一次快照任务尚未完成,则此次快照任务不会执行

SyncRequestProcessor继承了Thread,因此其也是一线程,我们看下这个线程在执行何种操作

    @Override
    public void run() {
        try {
            //记录上次生成快照文件和事务日志文件之后发生的事务次数
            int logCount = 0;
            //防止集群中所有机器在同一时刻进行数据快照,对是否进行数据快照增加随机因素
            int randRoll = r.nextInt(snapCount / 2);
            while (true) {
                Request si;
                if (toFlush.isEmpty()) {
                    //没有要刷到磁盘的请求
                    //消费请求队列(此方法会阻塞)
                    si = queuedRequests.take();
                } else {
                    //有需要刷盘的请求
                    si = queuedRequests.poll();
                    if (si == null) {
                        //如果请求队列的当前请求为空就刷到磁盘
                        // 可以看出,刷新request的优先级不高,只有在queuedRequests为空时才刷新
                        flush(toFlush);
                        continue;
                    }
                }
                //调用shutdown()时,将requestOfDeath放入queuedRequest队列中
                if (si == requestOfDeath) {
                    break;
                }
                if (si != null) {
                    //将request添加至日志文件,注意,此时并没有持久化到磁盘上
                    if (zks.getZKDatabase().append(si)) {
                        logCount++;
                        //1.确定是否需要进行数据快照
                        if (logCount > (snapCount / 2 + randRoll)) {
                            randRoll = r.nextInt(snapCount / 2);
                            // roll the log
                            //2.事务日志滚动到另外一个文件(即将当前事务日志关联的输出流置null)
                            zks.getZKDatabase().rollLog();
                            if (snapInProcess != null && snapInProcess.isAlive()) {
                                //若上一次进行快照的任务尚未执行完成,则此次快照任务不会执行
                                LOG.warn("Too busy to snap, skipping");
                            } else {
                                //3.创建数据快照异步线程
                                snapInProcess = new ZooKeeperThread("Snapshot Thread") {
                                    @Override
                                    public void run() {
                                        try {
                                            zks.takeSnapshot();
                                        } catch (Exception e) {
                                            LOG.warn("Unexpected exception", e);
                                        }
                                    }
                                };
                                snapInProcess.start();
                            }
                            logCount = 0;
                        }
                    }
                    //看了源码后,上面的"zks.getZKDatabase().append(si)"一定返回true,个人认为此分支是多余的
                    else if (toFlush.isEmpty()) {
                        // optimization for read heavy workloads
                        // iff this is a read, and there are no pending
                        // flushes (writes), then just pass this to the next
                        // processor
                        if (nextProcessor != null) {
                            nextProcessor.processRequest(si);
                            if (nextProcessor instanceof Flushable) {
                                ((Flushable) nextProcessor).flush();
                            }
                        }
                        continue;
                    }
                    //添加至刷新队列
                    toFlush.add(si);
                    //积攒了过多待刷新请求,直接刷新
                    if (toFlush.size() > 1000) {
                        flush(toFlush);
                    }
                }
            }
        } catch (Throwable t) {
            handleException(this.getName(), t);
        } finally {
            running = false;
        }
        LOG.info("SyncRequestProcessor exited!");
    }

进行数据快照的逻辑较为清晰,这里我们看下事务日志的批处理是如何实现的.
首先,将事务日志通过FileTxnLog.append()追加到输出流中(此时并没有持久化到磁盘)

    /**
     * 1.确认是否有事务日志可写
     * 2.确定事务日志文件是否需要扩容
     * 3.事务序列化
     * 4.生成checksum
     * 5.写入事务日志文件流(由于使用的是BufferedOutuptStream,因此写入的数据并非真正被写入磁盘)
     *
     * @param hdr 事务头 the header of the transaction
     * @param txn 事务体 the transaction part of the entry
     *            returns true iff something appended, otw false
     */
    @Override
    public synchronized boolean append(TxnHeader hdr, Record txn)
            throws IOException {
        if (hdr == null) {
            return false;
        }
        if (hdr.getZxid() <= lastZxidSeen) {
            LOG.warn("Current zxid " + hdr.getZxid()
                    + " is <= " + lastZxidSeen + " for "
                    + hdr.getType());
        } else {
            lastZxidSeen = hdr.getZxid();
        }
        //确认是否有事务日志可写
        if (logStream == null) {
            if (LOG.isInfoEnabled()) {
                LOG.info("Creating new log file: " + Util.makeLogName(hdr.getZxid()));
            }
            //新建文件进行写入
            logFileWrite = new File(logDir, Util.makeLogName(hdr.getZxid()));
            fos = new FileOutputStream(logFileWrite);
            logStream = new BufferedOutputStream(fos);
            oa = BinaryOutputArchive.getArchive(logStream);
            FileHeader fhdr = new FileHeader(TXNLOG_MAGIC, VERSION, dbId);
            fhdr.serialize(oa, "fileheader");
            logStream.flush();
            //返回已写入文件的大小
            filePadding.setCurrentSize(fos.getChannel().position());
            streamsToFlush.add(fos);
        }
        filePadding.padFile(fos.getChannel());
        //事务序列化
        byte[] buf = Util.marshallTxnEntry(hdr, txn);
        if (buf == null || buf.length == 0) {
            throw new IOException("Faulty serialization for header " +
                    "and txn");
        }
        //生成checksum
        Checksum crc = makeChecksumAlgorithm();
        crc.update(buf, 0, buf.length);
        oa.writeLong(crc.getValue(), "txnEntryCRC");
        //写入事务日志文件流
        Util.writeTxnBytes(oa, buf);
        return true;
    }

在达到上述说的两个条件之一时(没有待处理的request或积攒了1000个待刷新的reqeust),会调用SyncRequestProcessor.flush()

    /**
     * 批处理的思想，把事务日志刷到磁盘，让下一个处理器处理
     *
     * @param toFlush 待刷新的request
     * @throws IOException
     * @throws RequestProcessorException
     */
    private void flush(LinkedList<Request> toFlush)
            throws IOException, RequestProcessorException {
        if (toFlush.isEmpty()) {
            return;
        }
        //先将事务日志持久化到磁盘
        zks.getZKDatabase().commit();
        while (!toFlush.isEmpty()) {
            Request i = toFlush.remove();
            if (nextProcessor != null) {
                //交由下一个RequestProcessor处理
                nextProcessor.processRequest(i);
            }
        }
        if (nextProcessor instanceof Flushable) {
            ((Flushable) nextProcessor).flush();
        }
    }

首先通过FileTxnLog.commit()将事务日志持久化到磁盘

    /**
     * 由于{@link #logStream}是{@link BufferedOutputStream},因此调用{@link #append(TxnHeader, Record)}后数据并未真正写入磁盘中,调用该方法,将数据强制写入磁盘
     * commit the logs. make sure that everything hits the disk
     */
    @Override
    public synchronized void commit() throws IOException {
        if (logStream != null) {
            logStream.flush();
        }
        for (FileOutputStream log : streamsToFlush) {
            //调用此方法将FileOutputStream写入的字节刷新到操作系统,若存在操作系统级别的缓存,此时尚未写入磁盘
            log.flush();
            //强制将数据持久化到磁盘
            if (forceSync) {
                ...
                FileChannel channel = log.getChannel();
                channel.force(false);
                ...
            }
        }
        //只保留一个待刷新的FileOutputStream
        while (streamsToFlush.size() > 1) {
            streamsToFlush.removeFirst().close();
        }
    }

可以看出,为了保证写入的性能,除了Java提供了一层缓存(BufferOutputstream)外,操作系统还提供了一层缓存,即使调用FileOutputStream.flush()后,也只不过将数据刷新至操作系统的缓存,为了真正持久化,还要调用channel.force(false)(个人推测是调用fsync系统调用).
将事务日志持久化之后,调用下一个RequestProcessor,也即FinalRequestProcessor处理request.
请各位思考一下一个问题:

截断日志

总结

在介绍了Zookeeper的数据存储后,请读者思考一下几个问题:

Zookeeper处理请求时,是先持久化事务日志还是先将请求应用到DataTree?
从上面的介绍中可以看出,在SyncRequestProcessor.flush()中是先将事务日志持久化后,才会调用下一个RequestProcessor,也即FinalRequestProcessor处理request.在FinalRequestProcessor中,才将请求应用到DataTree,向客户端发送响应.因此是先持久化事务日志再将请求应用到DataTree,这样才能保证绝对不会有任何应用到DataTree的事务丢失.
若在Zookeeper进行快照的过程中,接收了客户端的请求,此时会将该请求应用到DataTree吗?若会,这会出现什么问题?如何解决?
Zookeeper是调用zks.takeSnapshot()生成快照文件的,这个方法及其底层的方法并没有对DataTree加锁,因此生成快照文件并不是一个原子性的操作,所以快照执行开始到快照执行结束期间发生的事务也会应用到DataTree中,也会持久化到快照文件中,也即说明即使快照后缀名为n,此快照文件也有可能包含n+1,n+2这些事务的执行结果.
设想这样一种场景,某快照文件后缀名为n,但是生成快照文件期间,Zookeeper处理了zxid为n+1的事务,该事务删除节点/test,且该事务的执行结果也包含在快照文件中.在启动阶段进行数据恢复时,首先从快照文件恢复DataTree,但在执行zxid为n+1的事务时便会提示NODEEXISTS error,因为/test节点已经被删除了.但是并不会对数据完整性,一致性产生影响,直接忽视这个错误即可.
上面假设的场景是快照中多包含一个事务,此时并不会对数据恢复产生影响.但是若快照中多包含半个事务呢?比如修改了/test的数据,但是没有修改/test的mzxid,这会对数据恢复产生影响吗?
Zookeeper为了避免这种情况的发生,在修改DataNode的数据或持久化时对DataNode加锁,避免数据不一致.
在测试的过程中,遇到了一种比较奇怪的情况,新的快照文件和事务日志文件的后缀名相差不止1,且都是快照文件后缀名小于事务日志文件后缀名?
正常情况下,由于生成新的事务日志文件和快照文件是在同一个判断中执行的,快照文件后缀名应该比事务日志文件后缀名小1(因为快照文件后缀名是当前已经应用到DataTree的事务zxid,事务日志文件后缀名是下一个事务的zxid),即使因为生成快照文件是在单独的线程中启动,生成快照文件的时间较生成事务日志文件的时间更晚一些,也应该是快照文件的后缀名大于事务日志文件的后缀名才对(生成快照文件的时间较晚,则可能在生成事务日志文件后又处理了几个事务,生成快照文件时已处理的事务zxid也就越大),但是为什么却恰恰相反呢?
其实是因为写入事务日志和将事务应用到DataTree是不同步的,是先写入事务日志,再将事务应用到DatatTree.事务日志文件的后缀是写入事务日志的第一个事务的zxid,而快照文件的后缀是将事务应用到DataTree的最大事务的zxid,自然是落后于事务日志文件的后缀.

背景

事务日志

快照

相关类

数据恢复

总流程

从快照中恢复

从事务日志中恢复

持久化

截断日志

总结

参考

猜你喜欢