LevelDB源码解读——MemTable和sstable

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接: https://blog.csdn.net/puliao4167/article/details/102757221

在前几章中,我们已经熟悉了LevelDB中的创建、读数据、写数据等基本操作,现在应该仔细来看看存储数据的结构体了,一开始我们已经看了skiplist的实现,其实MemTable中基本上就是依靠skiplist来实现的。MemTable是在内存中的数据存储结构,一些基本的读取操作都是会先对其做操作,而sstable则是磁盘上的存储结构。这一节的内容是也是LevelDB的精华所在。

MemTable

MemTable的结构较为简单。对其的get的put操作都转换为跳表上的操作即可,还要注意一点,就是MemTable中有一个内置的引用计数,作用和智能指针相似,只有ref=0的时候才能析构 ,不同的地方时之类的引用计数需要手动增加,每申请一个MemTable对象,都有手动调用Ref(),要析构前调用Unref()。

class MemTable {
 public:
  // MemTables are reference counted.  The initial reference count
  // is zero and the caller must call Ref() at least once.
  explicit MemTable(const InternalKeyComparator& comparator);
  MemTable(const MemTable&) = delete;
  MemTable& operator=(const MemTable&) = delete;
  // Increase reference count.
  void Ref() { ++refs_; }
  // Drop reference count.  Delete if no more references exist.
  void Unref() {
    --refs_;
    assert(refs_ >= 0);
    if (refs_ <= 0) {
      delete this;
    }
  }
  // Add an entry into memtable that maps key to value at the
  // specified sequence number and with the specified type.
  // Typically value will be empty if type==kTypeDeletion.
  void Add(SequenceNumber seq, ValueType type, const Slice& key,
           const Slice& value);
  // If memtable contains a value for key, store it in *value and return true.
  // If memtable contains a deletion for key, store a NotFound() error
  // in *status and return true.
  // Else, return false.
  bool Get(const LookupKey& key, std::string* value, Status* s);
  ...
  
 private:
  typedef SkipList<const char*, KeyComparator> Table;
  ~MemTable();  // Private since only Unref() should be used to delete it
  KeyComparator comparator_;
  int refs_;
  Arena arena_;
  //这里可以看到MemTable中就是依靠一个快表进行存储的,
  //注意这里没有放指针,直接放了对象,构造MemTable的时候直接调用SkipList的构造函数即可
  Table table_; 
  ...
};

以上的接口都解释的很清楚了,接下来看一下Add函数的具体实现。大致流程:首先计算出每一个key-value对的格式长度,格式如下图。然后申请内存,在内存空间中将数据按照格式填充进去,最后插入快表中。
MemTable中的结构

void MemTable::Add(SequenceNumber s, ValueType type, const Slice& key,
                   const Slice& value) {
  // Format of an entry is concatenation of:
  //  key_size     : varint32 of internal_key.size()
  //  key bytes    : char[internal_key.size()]
  //  value_size   : varint32 of value.size()
  //  value bytes  : char[value.size()]
  size_t key_size = key.size();
  size_t val_size = value.size();
  size_t internal_key_size = key_size + 8;
  //整体大小为encoded_len,格式为如上所示
  const size_t encoded_len = VarintLength(internal_key_size) +
                             internal_key_size + VarintLength(val_size) +
                             val_size;
  //这里是申请encoded_len大小的空间
  char* buf = arena_.Allocate(encoded_len);
  //把key长度赋值到内存中
  char* p = EncodeVarint32(buf, internal_key_size);
  //把key放到内存中,这里先放key的值,然后再放type(8字节的type)
  memcpy(p, key.data(), key_size);
  p += key_size;
  EncodeFixed64(p, (s << 8) | type);
  p += 8;
  //放value的size
  p = EncodeVarint32(p, val_size);
  memcpy(p, value.data(), val_size);
  assert(p + val_size == buf + encoded_len);
  //最后把这个内存块插入快表中
  table_.Insert(buf);
}

接下来再看看查找的过程。首先根据key初始化一个Iterator对象,用于查找,当没有找到就直接返回false,如果找到则按照如上的格式解析出来,这里还有一点,在解析出type的时候判断其是否已经被删除(tag == kTypeDeletion),如果没有则赋值参数value并返回true。

bool MemTable::Get(const LookupKey& key, std::string* value, Status* s) {
  Slice memkey = key.memtable_key();
  //这里利用一个iterator来查找key,如果找到则返回true,找不到返回false
  Table::Iterator iter(&table_);
  iter.Seek(memkey.data());
  if (iter.Valid()) {
    // entry format is:
    //    klength  varint32
    //    userkey  char[klength]
    //    tag      uint64
    //    vlength  varint32
    //    value    char[vlength]
    // Check that it belongs to same user key.  We do not check the
    // sequence number since the Seek() call above should have skipped
    // all entries with overly large sequence numbers.
    const char* entry = iter.key();
    uint32_t key_length;
    const char* key_ptr = GetVarint32Ptr(entry, entry + 5, &key_length);
    if (comparator_.comparator.user_comparator()->Compare(
            Slice(key_ptr, key_length - 8), key.user_key()) == 0) {
      // Correct user key
      const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);
      switch (static_cast<ValueType>(tag & 0xff)) {
        // kTypeValue表示put进去的
        case kTypeValue: {
          Slice v = GetLengthPrefixedSlice(key_ptr + key_length);
          value->assign(v.data(), v.size());
          return true;
        }
        // kTypeDeletion表示已经删除的,deleted
        case kTypeDeletion:
          *s = Status::NotFound(Slice());
          return true;
      }
    }
  }
  return false;
}

sstable

sstable是磁盘上的文件,用于可持久化的存储Immutable MemTable的数据,TableBuilder类就是对其的操作。其中有一个关键的结构体Rep。
sstable中存放数据根据用途可以分成以下几类:DataBlock、FilterBlock、MetaIndexBlock、IndexBlock、Footer。DataBlock存储的是数据key-value对,按照key值的增序列排序,格式三个字段分别是data、type、crc。FilterBlock这里先不考虑,MetaIndexBlock存储的是Filter的索引,IndexBlock存放的是数据的索引,Footer存储的是索引的索引,里面有MetaIndexBlock的索引和indexBlock的索引。

struct TableBuilder::Rep {
  Rep(const Options& opt, WritableFile* f)
      : options(opt),
        index_block_options(opt),
        file(f),
        offset(0),
        data_block(&options),
        index_block(&index_block_options),
        num_entries(0),
        closed(false),
        filter_block(opt.filter_policy == nullptr
                         ? nullptr
                         : new FilterBlockBuilder(opt.filter_policy)),
        pending_index_entry(false) {
    index_block_options.block_restart_interval = 1;
  }
  Options options;
  Options index_block_options;
  WritableFile* file;           //写入的.sst文件
  uint64_t offset;
  Status status;
  BlockBuilder data_block;      //数据block
  BlockBuilder index_block;     //索引block
  std::string last_key;         //上一次插入的key,确保sstable中key是有序的
  int64_t num_entries;
  bool closed;                  // 是否关闭
  FilterBlockBuilder* filter_block;
  bool pending_index_entry;     //datablock中是否有数据,datablock为空的时候为true
  BlockHandle pending_handle;   //记录datablock在文件中的偏移量和大小
  std::string compressed_output;//是否datablock压缩了
};

将数据写入的sstable的过程是:调用Add() -> Flush() -> WriteBlock() -> WriteRawBlock(),接下来对每个函数做的工作总结。

  1. Add()函数首先判断新插入的key-value键值对的key是否大于lastkey(这样做的话就是保证sstable中的数据是key有序递增的),如果datablock中没有数据则还要在indexblock中插入,最后在datablock中插入数据并更新lastkey值和number值,如果datablock的size没有超过阈值(默认4kb)则结束,如果超过阈值则Flush到磁盘中。
  2. Flush函数中调用WriteBlock()将datablock中的数据写入低层磁盘文件中。
  3. WriteBlock()函数首先将datablock中的数据取出来,然后规范成如下的格式,然后再调用WriteRawBlock()

block_data: uint8[n] (数据)
type: uint8 (是否压缩)
crc: uint32 (校验)

  1. WriteRawBlock()函数就是真正将数据append到文件中。
void TableBuilder::Add(const Slice& key, const Slice& value) {
  Rep* r = rep_;
  assert(!r->closed);
  if (!ok()) return;
  if (r->num_entries > 0) {
    //当datablock中已经有数据了,这时候插入的key值要比lastkey大才行,保证datablock中的key值是有序的
    assert(r->options.comparator->Compare(key, Slice(r->last_key)) > 0);
  }

  if (r->pending_index_entry) {
    //如果之前datablock中没有数据,则首次插入数据要先在pending_handle中记录datablock的偏移量和大小
    //同时indexblock中也要插入
    assert(r->data_block.empty());
    r->options.comparator->FindShortestSeparator(&r->last_key, key);
    std::string handle_encoding;
    r->pending_handle.EncodeTo(&handle_encoding);
    r->index_block.Add(r->last_key, Slice(handle_encoding));
    r->pending_index_entry = false;
  }

  if (r->filter_block != nullptr) {
    r->filter_block->AddKey(key);
  }
  //datablock中插入,并更新num值和lastkey值
  r->last_key.assign(key.data(), key.size());
  r->num_entries++;
  r->data_block.Add(key, value);

  //当datablock的size超过阈值(默认4kb)则flash到磁盘中
  const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();
  if (estimated_block_size >= r->options.block_size) {
    Flush();
  }
}

void TableBuilder::Flush() {
  Rep* r = rep_;
  assert(!r->closed);
  if (!ok()) return;
  if (r->data_block.empty()) return;
  assert(!r->pending_index_entry);
  WriteBlock(&r->data_block, &r->pending_handle);
  if (ok()) {
    r->pending_index_entry = true;
    r->status = r->file->Flush();
  }
  if (r->filter_block != nullptr) {
    r->filter_block->StartBlock(r->offset);
  }
}

void TableBuilder::WriteBlock(BlockBuilder* block, BlockHandle* handle) {
  // File format contains a sequence of blocks where each block has:
  //    block_data: uint8[n]
  //    type: uint8
  //    crc: uint32
  assert(ok());
  Rep* r = rep_;
  Slice raw = block->Finish();  //这里是block.Finish()

  Slice block_contents;
  CompressionType type = r->options.compression;
  // TODO(postrelease): Support more compression options: zlib?
  switch (type) {
    case kNoCompression:
      block_contents = raw;
      break;

    case kSnappyCompression: {
      std::string* compressed = &r->compressed_output;
      if (port::Snappy_Compress(raw.data(), raw.size(), compressed) &&
          compressed->size() < raw.size() - (raw.size() / 8u)) {
        block_contents = *compressed;
      } else {
        // Snappy not supported, or compressed less than 12.5%, so just
        // store uncompressed form
        block_contents = raw;
        type = kNoCompression;
      }
      break;
    }
  }
  WriteRawBlock(block_contents, type, handle);
  //写入磁盘完成后,datablock清空
  r->compressed_output.clear();
  block->Reset();
}

void TableBuilder::WriteRawBlock(const Slice& block_contents,
                                 CompressionType type, BlockHandle* handle) {
  Rep* r = rep_;
  handle->set_offset(r->offset);
  handle->set_size(block_contents.size());
  //把数据插入底层file文件,首先插入blockcontent,然后插入type
  r->status = r->file->Append(block_contents);
  if (r->status.ok()) {
    char trailer[kBlockTrailerSize];
    trailer[0] = type;    //第一个字节为type,后面为crc
    uint32_t crc = crc32c::Value(block_contents.data(), block_contents.size());
    crc = crc32c::Extend(crc, trailer, 1);  // Extend crc to cover block type
    EncodeFixed32(trailer + 1, crc32c::Mask(crc));
    r->status = r->file->Append(Slice(trailer, kBlockTrailerSize)); //插入type和crc
    if (r->status.ok()) {
      r->offset += block_contents.size() + kBlockTrailerSize;
    }
  }
}

最后还有一个Finish()函数,在TableBuilder结束的时候,用于将datablock、filterblock、metaindexblock、indexblock、footer等数据都写入磁盘文件。

参考博客:

  1. https://www.cnblogs.com/ym65536/p/7751229.html

猜你喜欢

转载自blog.csdn.net/puliao4167/article/details/102757221