Designing DIA note 15 -- LSM-tree, B-tree

3.1.2.1 Constructing and maintaining SSTables

Maintaining a sorted structure in memory is easy, we can use trees (red-black / AVL).
To make our storage engine work:

For writes -- add to an in-memory balanced tree ==> memtable
When the memtable > a few MB, write it out to disk as an SSTable file (most recent segment)
While writing SSTable to disk, writes can continue to a new memtable instance.
For read -- first find in memtable, then in the most recent on-disk segment, etc
Run merging/compaction process in the background frequently

Problem : if the DB crashes, the most recent writes (in memtable) are lost.
Solution : keep a separate unsorted append-only log on disk for recovery, every time the memtable is written to an SSTable, discard the corresponding log.

3.1.2.2 making an LSM-tree out of SSTables

Storage engines that are based on this principle of merging and compacting sorted files are often called LSM (Log-Structured Merge) storage engines.

Lucene -- an indexing engine for full-text search used by Elasticsearch and Solr, uses a similar method for storing its term dictionary.

Idea of a full-text index -- given a word in a search query, find all the documents that mention the word, implemented with a key-value structure where the key is a word (a term) and the value is the list of IDs of all the documents that contain the word (the postings list)

3.1.2.3 performance optimizations

Problem 1
LSM-tree algo can be slow when looking up non-existing keys (need to check all)

Solution
Use Bloom filters -- a memory-efficient data structure for approximating the contents of a set, it can tell you if a key does not appear in the DB, and thus saves many unnecessary disk reads for nonexistent keys.

Problem 2
How to determine the order and timing of how SSTables are compacted and merged?

Solution
size-tiered compaction -- merge newer & smaller SSTables into older & larger ones (LevelDB, RocksDB, Cassandra)
leveled compaction -- the key range is split up into smaller SSTables and older data is moved into separate "levels" ==> proceed more incrementally with less disk space (HBase, Cassandra)

Basic idea of LSM-tree -- keeping a cascade of SSTables that are merged in the background
Even for large dataset, LSM-tree continues to work well
- READ -- you can efficiently perform range queries since data is stored in sorted order
- WRITE -- LSM-tree can support remarkably high write throughput because the disk writes are sequential

3.1.3 B-Trees

B-tree -- the most widely used indexing structure, standard in almost all relational DB and many nonrelational DB.

Index structure	similarity	differenct
B-tree	keep key-value pairs sorted by key ==> efficient lookups & range queries	break down into fixed-size blocks/pages (4KB), read/write 1 page at a time
LSM-tree	keep key-value pairs sorted by key ==> efficient lookups & range queries	break down into variable-size (several MB) segments, written sequentially

Each page can be identified using an address or location, which allows 1 page to refer to another (pointer on disk).

Figure 3-7. Looking up a key using a B-tree index

branching factor -- the number of references to child pages in 1 page of the B-tree (typically several 100 in practice)

To look up a key in the index, you start from the root page and go down to leaf pages.
To update, you search for the leaf page with given key, change the value and write the page back to disk
To add, you find the page with the right range and add. If the space isn't enough, split it and update the parent page key ranges

Figure 3-8. Growing a B-tree by splitting a page
this algo ensures balance -- a B-tree with n keys always has a depth of O(log n)
most DB can fit into a B-tree within 3 or 4 levels deep
a 4-level tree of 4KB pages with a branching factor of 500 can store up to 256TB

Reference
Designing Data-Intensive Applications by Martin Kleppman

转载于:https://www.jianshu.com/p/3fc426e4fa9c