MySql crc32 && crc64 function to improve string query efficiency

1. Concept: CRC is called Cyclic Redundancy Check, or Cyclic Redundancy Check. CRC32 is a kind of CRC algorithm, often used to check the files transmitted on the network.

Two: How to use CRC32 in MySQL to speed up the query? ? ?

Basic characteristics of CRC32:

# 1. The range of the CRC32 function return value is 0-4294967296 (2 to the 32nd power minus 1)

# 2. Compared to MD5, the CRC32 function is easy to collide

CRC32 usage scenarios:

It can be seen from the above two basic characteristics that MySQL CRC32 generates integer results using bigint storage, while MD5 requires varchar to store. But CRC32 is easy to collide, is this suitable for indexing?

Scenario: We are doing a crawler. For a URL, first go to the database to check whether it exists, if not, insert it into the database. Everyone knows that this type of application table will grow very quickly, if simple

SELECT * FROM urls WHERE url = 'http://wwww.shopperplus.com';

Will scan the whole table every time, the efficiency is very low. If you add an index to the url column, it will be faster, but because the url is a varchar type, the storage space of the field itself and the storage space occupied by the index are relatively large.

SELECT * FROM urls WHERE crc_url = 907060870 AND url = 'hello';

In this way, most queries still only need to scan a row to get the result. For records with few collisions, you only need to scan a few more lines to get the results correctly. The optimization of the url scene from varchar to bigint is actually not particularly obvious. Another example is text. If we have a text-type field (article content, comment, Weibo, etc.), we must determine whether this content exists in the database before each insertion. If you use the crc32 technique, there is still a lot of room for improvement.

3. The shortcomings of crc32 are prone to collision, is there a better solution? The answer is yes-crc64

 crc64 () This function complements MySQL's crc32 () function, and the result is uneven distribution over a large number of values. The crc64 () algorithm relies on MD5 as the underlying mechanism.

 

 

 

Guess you like

Origin www.cnblogs.com/Mr-Echo/p/12730797.html