在前几章中,我们已经熟悉了LevelDB中的创建、读数据、写数据等基本操作,现在应该仔细来看看存储数据的结构体了,一开始我们已经看了skiplist的实现,其实MemTable中基本上就是依靠skiplist来实现的。MemTable是在内存中的数据存储结构,一些基本的读取操作都是会先对其做操作,而sstable则是磁盘上的存储结构。这一节的内容是也是LevelDB的精华所在。
MemTable
MemTable的结构较为简单。对其的get的put操作都转换为跳表上的操作即可,还要注意一点,就是MemTable中有一个内置的引用计数,作用和智能指针相似,只有ref=0的时候才能析构 ,不同的地方时之类的引用计数需要手动增加,每申请一个MemTable对象,都有手动调用Ref(),要析构前调用Unref()。
class MemTable {
public:
// MemTables are reference counted. The initial reference count
// is zero and the caller must call Ref() at least once.
explicit MemTable(const InternalKeyComparator& comparator);
MemTable(const MemTable&) = delete;
MemTable& operator=(const MemTable&) = delete;
// Increase reference count.
void Ref() { ++refs_; }
// Drop reference count. Delete if no more references exist.
void Unref() {
--refs_;
assert(refs_ >= 0);
if (refs_ <= 0) {
delete this;
}
}
// Add an entry into memtable that maps key to value at the
// specified sequence number and with the specified type.
// Typically value will be empty if type==kTypeDeletion.
void Add(SequenceNumber seq, ValueType type, const Slice& key,
const Slice& value);
// If memtable contains a value for key, store it in *value and return true.
// If memtable contains a deletion for key, store a NotFound() error
// in *status and return true.
// Else, return false.
bool Get(const LookupKey& key, std::string* value, Status* s);
...
private:
typedef SkipList<const char*, KeyComparator> Table;
~MemTable(); // Private since only Unref() should be used to delete it
KeyComparator comparator_;
int refs_;
Arena arena_;
//这里可以看到MemTable中就是依靠一个快表进行存储的,
//注意这里没有放指针,直接放了对象,构造MemTable的时候直接调用SkipList的构造函数即可
Table table_;
...
};
以上的接口都解释的很清楚了,接下来看一下Add函数的具体实现。大致流程:首先计算出每一个key-value对的格式长度,格式如下图。然后申请内存,在内存空间中将数据按照格式填充进去,最后插入快表中。
void MemTable::Add(SequenceNumber s, ValueType type, const Slice& key,
const Slice& value) {
// Format of an entry is concatenation of:
// key_size : varint32 of internal_key.size()
// key bytes : char[internal_key.size()]
// value_size : varint32 of value.size()
// value bytes : char[value.size()]
size_t key_size = key.size();
size_t val_size = value.size();
size_t internal_key_size = key_size + 8;
//整体大小为encoded_len,格式为如上所示
const size_t encoded_len = VarintLength(internal_key_size) +
internal_key_size + VarintLength(val_size) +
val_size;
//这里是申请encoded_len大小的空间
char* buf = arena_.Allocate(encoded_len);
//把key长度赋值到内存中
char* p = EncodeVarint32(buf, internal_key_size);
//把key放到内存中,这里先放key的值,然后再放type(8字节的type)
memcpy(p, key.data(), key_size);
p += key_size;
EncodeFixed64(p, (s << 8) | type);
p += 8;
//放value的size
p = EncodeVarint32(p, val_size);
memcpy(p, value.data(), val_size);
assert(p + val_size == buf + encoded_len);
//最后把这个内存块插入快表中
table_.Insert(buf);
}
接下来再看看查找的过程。首先根据key初始化一个Iterator对象,用于查找,当没有找到就直接返回false,如果找到则按照如上的格式解析出来,这里还有一点,在解析出type的时候判断其是否已经被删除(tag == kTypeDeletion),如果没有则赋值参数value并返回true。
bool MemTable::Get(const LookupKey& key, std::string* value, Status* s) {
Slice memkey = key.memtable_key();
//这里利用一个iterator来查找key,如果找到则返回true,找不到返回false
Table::Iterator iter(&table_);
iter.Seek(memkey.data());
if (iter.Valid()) {
// entry format is:
// klength varint32
// userkey char[klength]
// tag uint64
// vlength varint32
// value char[vlength]
// Check that it belongs to same user key. We do not check the
// sequence number since the Seek() call above should have skipped
// all entries with overly large sequence numbers.
const char* entry = iter.key();
uint32_t key_length;
const char* key_ptr = GetVarint32Ptr(entry, entry + 5, &key_length);
if (comparator_.comparator.user_comparator()->Compare(
Slice(key_ptr, key_length - 8), key.user_key()) == 0) {
// Correct user key
const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);
switch (static_cast<ValueType>(tag & 0xff)) {
// kTypeValue表示put进去的
case kTypeValue: {
Slice v = GetLengthPrefixedSlice(key_ptr + key_length);
value->assign(v.data(), v.size());
return true;
}
// kTypeDeletion表示已经删除的,deleted
case kTypeDeletion:
*s = Status::NotFound(Slice());
return true;
}
}
}
return false;
}
sstable
sstable是磁盘上的文件,用于可持久化的存储Immutable MemTable的数据,TableBuilder类就是对其的操作。其中有一个关键的结构体Rep。
sstable中存放数据根据用途可以分成以下几类:DataBlock、FilterBlock、MetaIndexBlock、IndexBlock、Footer。DataBlock存储的是数据key-value对,按照key值的增序列排序,格式三个字段分别是data、type、crc。FilterBlock这里先不考虑,MetaIndexBlock存储的是Filter的索引,IndexBlock存放的是数据的索引,Footer存储的是索引的索引,里面有MetaIndexBlock的索引和indexBlock的索引。
struct TableBuilder::Rep {
Rep(const Options& opt, WritableFile* f)
: options(opt),
index_block_options(opt),
file(f),
offset(0),
data_block(&options),
index_block(&index_block_options),
num_entries(0),
closed(false),
filter_block(opt.filter_policy == nullptr
? nullptr
: new FilterBlockBuilder(opt.filter_policy)),
pending_index_entry(false) {
index_block_options.block_restart_interval = 1;
}
Options options;
Options index_block_options;
WritableFile* file; //写入的.sst文件
uint64_t offset;
Status status;
BlockBuilder data_block; //数据block
BlockBuilder index_block; //索引block
std::string last_key; //上一次插入的key,确保sstable中key是有序的
int64_t num_entries;
bool closed; // 是否关闭
FilterBlockBuilder* filter_block;
bool pending_index_entry; //datablock中是否有数据,datablock为空的时候为true
BlockHandle pending_handle; //记录datablock在文件中的偏移量和大小
std::string compressed_output;//是否datablock压缩了
};
将数据写入的sstable的过程是:调用Add() -> Flush() -> WriteBlock() -> WriteRawBlock(),接下来对每个函数做的工作总结。
- Add()函数首先判断新插入的key-value键值对的key是否大于lastkey(这样做的话就是保证sstable中的数据是key有序递增的),如果datablock中没有数据则还要在indexblock中插入,最后在datablock中插入数据并更新lastkey值和number值,如果datablock的size没有超过阈值(默认4kb)则结束,如果超过阈值则Flush到磁盘中。
- Flush函数中调用WriteBlock()将datablock中的数据写入低层磁盘文件中。
- WriteBlock()函数首先将datablock中的数据取出来,然后规范成如下的格式,然后再调用WriteRawBlock()
block_data: uint8[n] (数据)
type: uint8 (是否压缩)
crc: uint32 (校验)
- WriteRawBlock()函数就是真正将数据append到文件中。
void TableBuilder::Add(const Slice& key, const Slice& value) {
Rep* r = rep_;
assert(!r->closed);
if (!ok()) return;
if (r->num_entries > 0) {
//当datablock中已经有数据了,这时候插入的key值要比lastkey大才行,保证datablock中的key值是有序的
assert(r->options.comparator->Compare(key, Slice(r->last_key)) > 0);
}
if (r->pending_index_entry) {
//如果之前datablock中没有数据,则首次插入数据要先在pending_handle中记录datablock的偏移量和大小
//同时indexblock中也要插入
assert(r->data_block.empty());
r->options.comparator->FindShortestSeparator(&r->last_key, key);
std::string handle_encoding;
r->pending_handle.EncodeTo(&handle_encoding);
r->index_block.Add(r->last_key, Slice(handle_encoding));
r->pending_index_entry = false;
}
if (r->filter_block != nullptr) {
r->filter_block->AddKey(key);
}
//datablock中插入,并更新num值和lastkey值
r->last_key.assign(key.data(), key.size());
r->num_entries++;
r->data_block.Add(key, value);
//当datablock的size超过阈值(默认4kb)则flash到磁盘中
const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();
if (estimated_block_size >= r->options.block_size) {
Flush();
}
}
void TableBuilder::Flush() {
Rep* r = rep_;
assert(!r->closed);
if (!ok()) return;
if (r->data_block.empty()) return;
assert(!r->pending_index_entry);
WriteBlock(&r->data_block, &r->pending_handle);
if (ok()) {
r->pending_index_entry = true;
r->status = r->file->Flush();
}
if (r->filter_block != nullptr) {
r->filter_block->StartBlock(r->offset);
}
}
void TableBuilder::WriteBlock(BlockBuilder* block, BlockHandle* handle) {
// File format contains a sequence of blocks where each block has:
// block_data: uint8[n]
// type: uint8
// crc: uint32
assert(ok());
Rep* r = rep_;
Slice raw = block->Finish(); //这里是block.Finish()
Slice block_contents;
CompressionType type = r->options.compression;
// TODO(postrelease): Support more compression options: zlib?
switch (type) {
case kNoCompression:
block_contents = raw;
break;
case kSnappyCompression: {
std::string* compressed = &r->compressed_output;
if (port::Snappy_Compress(raw.data(), raw.size(), compressed) &&
compressed->size() < raw.size() - (raw.size() / 8u)) {
block_contents = *compressed;
} else {
// Snappy not supported, or compressed less than 12.5%, so just
// store uncompressed form
block_contents = raw;
type = kNoCompression;
}
break;
}
}
WriteRawBlock(block_contents, type, handle);
//写入磁盘完成后,datablock清空
r->compressed_output.clear();
block->Reset();
}
void TableBuilder::WriteRawBlock(const Slice& block_contents,
CompressionType type, BlockHandle* handle) {
Rep* r = rep_;
handle->set_offset(r->offset);
handle->set_size(block_contents.size());
//把数据插入底层file文件,首先插入blockcontent,然后插入type
r->status = r->file->Append(block_contents);
if (r->status.ok()) {
char trailer[kBlockTrailerSize];
trailer[0] = type; //第一个字节为type,后面为crc
uint32_t crc = crc32c::Value(block_contents.data(), block_contents.size());
crc = crc32c::Extend(crc, trailer, 1); // Extend crc to cover block type
EncodeFixed32(trailer + 1, crc32c::Mask(crc));
r->status = r->file->Append(Slice(trailer, kBlockTrailerSize)); //插入type和crc
if (r->status.ok()) {
r->offset += block_contents.size() + kBlockTrailerSize;
}
}
}
最后还有一个Finish()函数,在TableBuilder结束的时候,用于将datablock、filterblock、metaindexblock、indexblock、footer等数据都写入磁盘文件。
参考博客:
- https://www.cnblogs.com/ym65536/p/7751229.html