JIMDB data persistence practice

 Jingdong Technology www.toutiao.im

 

 

background

JIMDB is a distributed cache and high-speed key-value storage service based on Redis developed by JD.com. It supports large-capacity cache, high data availability, multiple I/O strategies, automatic failover, and dynamic expansion. All business systems, including many important business systems, such as front-end product detail pages, trading platforms, advertising, search, instant messaging, etc., back-end order fulfillment, inventory management, delivery and logistics….

 

Redis is completely dependent on memory, and the memory is often not enough; when Redis starts, all data needs to be loaded into memory, and the startup speed is slow when the amount of data is large; planning always fails to keep up with business development, and the total amount of memory is constantly being broken through, constantly falling into expansion, and then expanding again. The nightmare of …. Because of this, JIMDB introduces RAM + SSD two-level storage, stores hot data in memory, and cold data is automatically swapped to disk to solve the problem of insufficient memory; when starting, it does not load all data into memory, but runs It can be loaded as needed to solve the problem of slow startup; because of the introduction of secondary storage, the storage capacity is usually relatively large, so there is no need for frequent expansion.

 

We choose LevelDB as the persistent storage engine. LevelDB is Google's open source persistent KV stand-alone database. It has high random write, sequential read/write performance, and is very suitable for storing small files, as well as some indexes that need to be persistent. For persistent asynchronous tasks, many open source projects use LevelDB as the underlying storage engine, such as Chromium, Taobao's Tair, SSDB, etc.

 

 

Technical solutions

Synchronous write or asynchronous write

Synchronous writing requires Redis to operate the disk every time the key is modified. Although LevelDB has high write performance, frequent disk operations still have a large impact on performance. In the end, we adopted the asynchronous writing method. Compared with synchronous writing, asynchronous writing can obtain higher write performance with the help of the batch writing function provided by LevelDB. In addition, it reduces the number of disk operations and further improves performance.

 

disk memory data exchange

When Redis looks for a key, it will first look for it in memory. Only if it is not found in memory, it will look for it in the disk. If it is still not found, it means that the key does not exist. If it is found in the disk, the pair of KV A dictionary added to Redis to load into memory.

 

After Redis modifies a key, we mark the key as a dirty key, that is, add a copy of the key to the dictionary dirty dict dedicated to saving dirty keys. In order to save space, we only store this key (value is NULL). We periodically execute the flush dirty key to disk task in the scheduled task. This task will write the key stored in the dirty dict to the disk. When writing to the disk, we need to serialize and encode the value of the key into a binary stream. When dirty After the keys in the dict are synchronized to the disk, empty the dirty dict.

 

 

master-slave synchronization

In the two-level storage scenario, it is compatible with the synchronization process of Redis, but the database snapshot needs to be generated by traversing the disk, which requires us to synchronize all the dirty keys in the memory to the disk before generating the snapshot. In addition, after the slave side receives the snapshot, it will no longer be loaded into the memory, but will be loaded to the disk by calling the LevelDB write interface.

 

It should be noted here that when Redis generates a database snapshot, it will start a separate background process for processing, but the LevelDB storage engine only allows one process to access at the same time, so we need to create a thread instead of a process to process snapshot generation.

 

Cluster node split/merge

When the Jimdb cluster nodes are split/merged, the KV on the specified slot is directly transmitted to the target Redis server through the network (different from Redis master-slave full synchronization, which needs to generate database snapshots), and the target Redis generates database snapshots and then loads them into memory. In the secondary storage mode, the disk needs to be scanned, and the KV on the specified slot is transmitted to the target redis server through the network. Before scanning, the dirty key in the memory must be synchronized to the disk. The target Redis loads the received database snapshot to the disk for persistence through the write interface of LevelDB.

 

Problems encountered and solutions

In the process of project practice, we have encountered many problems. Here we will introduce in detail some of the core problems and our solutions.

 

Not enough storage

The space of the disk is generally an order of magnitude larger than the memory, and all read and write operations will load the data on the disk into the memory, so that there will be insufficient memory after running for a period of time, although Redis will process each command before processing. Check the memory usage. When the configured maximum memory is exceeded, some keys will be eliminated according to the configured elimination strategy, but this does not meet our needs. Therefore, we need to eliminate some of the keys (dirty keys) in the memory according to a certain strategy when the memory is found to be insufficient. will not be eliminated) to free up valuable memory resources.

 

We adopt different memory elimination strategies according to the interval where the current memory usage (mem_used/max_mem) is located:

 

When the memory occupancy rate is greater than 85%, the random elimination strategy is adopted, and some keys are randomly eliminated until the memory occupancy rate drops below 75%.

When the memory occupancy rate is greater than 75%, the LRU elimination strategy is adopted. Here, referring to the elimination idea of ​​Redis, random sampling is adopted and then eliminated according to LRU.

 

However, both of these two elimination strategies may take a long time. If they are placed directly in the place where Redis checks the memory usage, it will inevitably affect the delay of a single request. Therefore, we put the above elimination strategy in a scheduled task for processing, and at the same time For the memory occupancy check of the original Redis, we adopt a quick check scheme, that is, when the memory occupancy rate is greater than 90%, several keys are randomly eliminated.

 

Key's DB attribute storage

In order to facilitate the splitting and merging of keys in the cluster, Jimdb modified Redis, divided it into 16384 slots, and only used db 0. In this way, in order to facilitate traversing the keys on the specified slot from the disk, we encode the keys stored in levelDB in the following format:

 



 

 

The first byte flag is described later, {slot_id} is the ID of the slot (0~16383), which is encoded into a string in hexadecimal, which occupies 4 bytes fixedly, because only db 0 is used, so db_id does not It needs to be stored, and the real key is behind it. In this way, when you need to traverse the keys in the specified slot, you can traverse the keys in the levelDB according to the fixed prefix (the KV stored in the levelDB is stored in the order of the keys, so the keys with the fixed prefix can be traversed sequentially).

 

Key expiration time storage

For key storage with expiration time set, because LevelDB can store binary-safe data, we directly splicing the expiration time (type long long) to the back of the serialized value, so that after getting KV from LevelDB, the reverse When serializing the value, the expiration time can be parsed together. If the expiration time is parsed, the expiration time will be added to the expire dict of Redis.

 

Flush dirty key blocks Redis

If all the dirty keys are synchronized to the disk every time the scheduled task is processed, when there are many dirty keys, the performance of Redis will be seriously affected after one processing. Therefore, we use the time slice method to process slowly, but we also do a little here. The optimization is to dynamically adjust the size of the time slice according to the ratio of the number of dirty keys to the current number of Redis dicts. If the ratio is high, it means that there are many dirty keys in Redis that need to be synchronized, so a larger time slice is allocated. If this ratio is very high, all dirty keys will be forced to be synchronized to the disk, because if the time slice method is still used in this scenario, it may cause a high memory usage rate, but because of the existence of a large number of dirty keys, the memory will be limited If it is not released, it will affect the availability of Redis server.

 

It should also be noted that when a database snapshot needs to be generated, a cluster is split/merged or the Redis server exits, all dirty keys must be forced to be synchronized to disk to avoid data loss.

 

Scan disk blocking Redis

Some actions of Redis will trigger scanning of the disk. When the amount of data stored on the disk is large, such operations will occupy a lot of CPU resources and affect the response of Redis to other clients. If the disk scanning takes a long time, it will cause clustering The sentinel system falsely reports the death of the instance, which triggers the failOver process. Therefore, for the disk scan operation, we allow the CPU to be appropriately transferred during the scan, that is, call aeProcessEvents(), so that Redis can process other event events.

 

Expired key scan

When a key has an expiration time set but is only stored on the disk, if the key has expired, it still occupies disk space, causing a waste of disk space. Therefore, we need to regularly scan the key stored on the disk with the expiration time set. Time out, delete on the fly to free up disk space.

 

In the previous chapter, we mentioned that the expiration time of the key is spliced ​​to the end of the serialized value. If we scan all the keys, we can determine whether the expiration time is set by deserializing its value parsing, which is obviously very inefficient, so we Store all keys with expiration time set separately as { key, expiretime }. The first byte in the storage format of the key mentioned above comes in handy. We agree that "0" means normal KV , "1" indicates the key for setting the expiration time, which only stores the expiration time, so that we can directly scan the key with the prefix "1" to quickly determine whether it has timed out.

 

We have added a task to periodically scan expired keys in the scheduled task, because this task is not very urgent, so the interval between two scans is relatively large (currently every 5 seconds), and each scan works for a fixed time slice , when the time is up, the scan will be exited.

 

The test results show that for an expired key of 20G (only stored in the disk), it only takes a few minutes to delete it from the disk.

 

performance

Because we use the asynchronous flush disk method, most operations can be completed only by operating memory, and the performance should not drop too much. The actual test results show that its performance is basically the same as that of JIMDB, but TP99 (<10ms) is slightly increased. However, within the acceptable range (the business side requires within 20ms), the project as a whole meets the design expectations.

 


Overview

The two-level storage solution greatly expands the capacity of JIMDB, saves valuable memory resources, and is fully compatible with Redis communication protocols and data types, which well meets the business needs of business parties.

 

 

The above content is the original WeChat public account IPDCHAT, if you need to reprint, please indicate the source~

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326487796&siteId=291194637