3.1.2.1 Constructing and maintaining SSTables
Maintaining a sorted structure in memory is easy, we can use trees (red-black / AVL).
To make our storage engine work:
- For writes -- add to an in-memory balanced tree ==> memtable
- When the memtable > a few MB, write it out to disk as an SSTable file (most recent segment)
- While writing SSTable to disk, writes can continue to a new memtable instance.
- For read -- first find in memtable, then in the most recent on-disk segment, etc
- Run merging/compaction process in the background frequently
Problem : if the DB crashes, the most recent writes (in memtable) are lost.
Solution : keep a separate unsorted append-only log on disk for recovery, every time the memtable is written to an SSTable, discard the corresponding log.
3.1.2.2 making an LSM-tree out of SSTables
Storage engines that are based on this principle of merging and compacting sorted files are often called LSM (Log-Structured Merge) storage engines.
Lucene -- an indexing engine for full-text search used by Elasticsearch and Solr, uses a similar method for storing its term dictionary.
Idea of a full-text index -- given a word in a search query, find all the documents that mention the word, implemented with a key-value structure where the key is a word (a term) and the value is the list of IDs of all the documents that contain the word (the postings list)
3.1.2.3 performance optimizations
Problem 1
LSM-tree algo can be slow when looking up non-existing keys (need to check all)
Solution
Use Bloom filters -- a memory-efficient data structure for approximating the contents of a set, it can tell you if a key does not appear in the DB, and thus saves many unnecessary disk reads for nonexistent keys.
Problem 2
How to determine the order and timing of how SSTables are compacted and merged?
Solution
size-tiered compaction -- merge newer & smaller SSTables into older & larger ones (LevelDB, RocksDB, Cassandra)
leveled compaction -- the key range is split up into smaller SSTables and older data is moved into separate "levels" ==> proceed more incrementally with less disk space (HBase, Cassandra)
- Basic idea of LSM-tree -- keeping a cascade of SSTables that are merged in the background
- Even for large dataset, LSM-tree continues to work well
- READ -- you can efficiently perform range queries since data is stored in sorted order
- WRITE -- LSM-tree can support remarkably high write throughput because the disk writes are sequential
3.1.3 B-Trees
B-tree -- the most widely used indexing structure, standard in almost all relational DB and many nonrelational DB.
Index structure | similarity | differenct |
---|---|---|
B-tree | keep key-value pairs sorted by key ==> efficient lookups & range queries | break down into fixed-size blocks/pages (4KB), read/write 1 page at a time |
LSM-tree | keep key-value pairs sorted by key ==> efficient lookups & range queries | break down into variable-size (several MB) segments, written sequentially |
Each page can be identified using an address or location, which allows 1 page to refer to another (pointer on disk).
branching factor -- the number of references to child pages in 1 page of the B-tree (typically several 100 in practice)
- To look up a key in the index, you start from the root page and go down to leaf pages.
- To update, you search for the leaf page with given key, change the value and write the page back to disk
-
To add, you find the page with the right range and add. If the space isn't enough, split it and update the parent page key ranges
- this algo ensures balance -- a B-tree with n keys always has a depth of O(log n)
- most DB can fit into a B-tree within 3 or 4 levels deep
- a 4-level tree of 4KB pages with a branching factor of 500 can store up to 256TB
Reference
Designing Data-Intensive Applications by Martin Kleppman
转载于:https://www.jianshu.com/p/3fc426e4fa9c