Source Address: http://peopleyun.com/?p=910

Although the legendary Donald Knuth classmate once said, "Premature optimization is the root of all evil" (premature optimization is the root of all evil), but in the product code basically stable, do some optimization, is very helpful, for example, I was through the use of multi-threading technology to an otherwise take 30 minutes to process optimization to get just 30 seconds, and, although codebase between Windows 7 and Vista are very similar, but because Windows 7 on the basis of Vista do many optimizations, so Windows 7 while maintaining its excellent performance is brilliant special effects, thus ensuring that it can successfully replace Windows XP. Speaking BigTable, Google's engineers also used some for its performance optimization techniques, this paper will be based on BigTable paper to analyze in detail the performance of these optimization techniques, before the cut to the chase, if you have any of BigTable or YunTable familiar with the case, you can click here to read all the articles in this series before.

Locality Groups

By combining a plurality of locality into Column Family group (Locality Group), and the system of the locality Tablet Each group generates a separate SSTable. By this mechanism locality group, able to compare similar Column Family multiple integrated together, this has two advantages: One is to reduce the read data, for example, it has a population of about a few hundred Column large table, the application responsible for the address read only designed for local groups to address settings without having to read the other information Column, and because the collection of these data on a Tablet, it is possible to reduce the participation of the entire query machine number; the other is to enhance the processing speed, for example, App Engine Datastore is achieved by using a local transaction group the mechanism, thereby avoiding weakness of the traditional mechanisms in a 2PC transaction performance is not suitable as BigTable. Also, the unit can be set some useful debugging parameters to local groups. For example, a locality group can be set to all stored in memory. Tablet in accordance with an inert server policy loaded into the set of memory locality group SSTable loaded into memory. After loading is complete, the access group belonging to the locality of the column when the family will not have to read the hard drive. This feature is especially useful for small blocks of data require frequent access: inside Bigtable, we use this feature to improve the access speed METADATA table column group having a position relevance.

compression

In the era of relational databases, because the compression rate at the end, but also in the performance increase is also lower, making the compression technology for relational databases, it can only be regarded as a finishing touch, but based Column for the database, since it is Column Column one or several data stored together on the approximation, it is very surprising in the above compression ratio, or even to 1: 9, especially in today's era far better than the CPU speed and memory capacity of the transmission speed of I / O, to significantly reduce the time to read and transmit data through a small increase in CPU time for decompression, which has a very significantly improved performance.

BigTable also used a compression mechanism, for example, the client can control whether a local group SSTable needs to be compressed; If you need to compress, then what format to compress. SSTable of each block are designated by the user using the compression format is compressed. If only a portion of the data read in SSTable, then only those portions of the data can be compressed without decompressing the entire file. BigTable adopted two rounds customizable compression. The first pass uses Bentley and McIlroy's approach, which compresses long common strings in a large scanning window; the second time is to use fast compression algorithm, namely to find duplicate data in a small scanning window of 16KB . Two compression algorithms very quickly, in about the year 2006 X86 hardware devices, compression rate mechanism set up to 100-200MB / s, extracting rate reached 400-1000MB / s.

Although Google engineers in the choice of compression algorithm important consideration is speed rather than the compression space, but this compression performance in space twice the compression ratio is amazing. For example, this model space compression ratio can reach up to 10: 1. This Gzip than conventional compressed at 3: 1 or 4: 1 compression space than good; "twice" compressed mode is so efficient due to the similarity data is clustered together, so as to obtain a higher compression rate, and when the store multiple versions of the same data in Bigtable in time, the compression efficiency will be higher.

Improve performance by caching the read operation

Cache performance optimization is always a silver bullet (Silver Bullet), although BigTable is not buffered as one of its core mechanism, but in order to improve the performance of read operations, Tablet servers use secondary cache strategy. A scan buffer is the first-level cache, main cache server Tablet acquired via the interface SSTable of Key-Value; Block buffer is a second level cache, the cache is SSTable Block read from GFS. For applications often repeatedly reading the same data, it is very effective scan buffer; for often near the read data just read application data, the cache Block useful, e.g., sequential read, or random access different columns in the row of a local group of hot.

Bloom filter

Bloom filter (the Filter) is a fast data is not determined in this set of mechanisms exists. A data read operation all BigTable SSTable constituting Tablet state must be read. If these SSTable not in memory, you need to repeatedly access the hard disk. At this time, you can reduce the number of hard disk access by accessing SSTable own Bloom filter, for example, we can use a query Bloom filter SSTable contains specific rows and columns of data. For certain applications, we only paid a small amount, the cost of memory for storing Bloom filter, read it in exchange for a significant reduction in the number of its required disk access. Use Bloom filter also implicitly reached when the purpose of the application to access a nonexistent row or column, most of the time does not have to be able to access the hard disk.

Commit to achieve logs

Because if each Commit log Tablet belongs to a separate file exists, then it will generate a lot of files, and these files are written in parallel GFS. The GFS scheme implemented in the underlying file system server, when these files should log file written to different disks, a large amount of random disk inefficient operation, which will greatly reduce the performance of the entire system BigTable. To avoid these problems, we have a unified server for each Tablet Commit log files, modify the logging operation is written with a log file is automatically added, so a real mix of log files for more Tablet modify logging .

Although the use of a single log can significantly improve system performance, but puts resume work complicated. Tablet when a server goes down, it loads the Tablet will be moved to many other Tablet servers: server load each Tablet Tablet very few original server. When the log information to modify the operation to restore a state when the Tablet, the new Tablet Tablet servers from the original server to write to extract and re-execute. However, these modifications Tablet logging operations are mixed in the same log file. One way is that although the new Tablet servers read the full Commit log files, but only repeat what it needs to perform the relevant changes to restore the Tablet. Using this method, if there are 100 Tablet servers, each loaded on a Tablet Tablet server fails, then the log file will be read 100 times (once for each server to read).

In order to avoid multiple reads the log file, the system will first log by keyword (table, row name, log sequence number) order. After sorting, a logging Tablet modification operations on the same continuously stored together, therefore, only one disk Seek operations, sequential read after it. For parallel sort, can first log into segments of 64MB, then sort of parallel segments Tablet different servers. This sort of work to co-processing by the Master server, and the server in a Tablet indicate when they need to begin Commit log files from the recovery Tablet.

Using the invariance

When using Bigtable, SSTable in addition to portions other than the SSTable cache generated is constant, we can use this to simplify the system. For example, when reading data from SSTable, the system do not have to synchronize file system access operation. Thus, it is possible to achieve very efficient operation of the parallel lines. memtable is the only variable that can be read and write operations simultaneously access the data structure. In order to reduce competition in the read operation, the memory table we can use COW (Copy-on-write) mechanism, which allows read and write operations in parallel.

Because SSTable is constant, therefore, we can permanently delete the data in question is marked as "deleted", translate into abandoned SSTable be the problem of garbage collection, Master servers using "mark - to delete" form of garbage collection delete SSTable waste collection SSTable.

Reproduced in: https: //www.cnblogs.com/licheng/archive/2010/09/09/1821918.html

YunTable Development Diary (11) - BigTable performance optimization (reprint)