There are now 100,000 words. Please find the ten most repeated words.

String, the maximum length is 4 G, multiplied by 100,000...
Idea 1
We can't directly operate in the memory. It
can be divided into n multiple files
to distinguish by length, and different lengths are placed in different folders
to distinguish by initial letters. Put the first letter in different folders
to distinguish between letters, and put different end letters in different files. In
this way, the length of each data in each file is the same and the first letter and the end letter are also the same.
Then start from the file with the largest number of elements. If the most repeated element is longer than the remaining file element length (the element length can be hidden in the file name), the string will be returned directly (the longest one).
If not, the most repeated string of each file will be recorded in another A file is
finally searched for the file to find the most repeated strings.
First, the algorithm of sorting from left to right is used.
Each time the letter with the largest number of occurrences is screened out, the other letter words are discarded. The
cycle until the end character is encountered, then the key is recorded. Store the string and its number of occurrences in another file in the form of value pairs.
Finally, find the top 10 strings with the most occurrences. The second
idea is to
use the bucket idea, but for 100,000 strings, it may require 100,000 in the worst case. Buckets, therefore, we need to put them in different directories to avoid 100,000 buckets of computers in a single directory from getting stuck. For example, we have 26 subdirectories az in each directory, and create directories from left to right by string, which is less than 100,000. Folder, a 3 bytes, so the amount of data becomes less than 300,000 bytes, and finally named the file with the word string | times, each time the string is traversed, the number of times is +1, and finally 10 nodes are generated. Store up to 10 strings and their number of occurrences
Use ordered singly linked list filtering (refer to my last 100 node filtering 1E numbers https://editor.csdn.net/md/?articleId=113199734) to filter out the longest 10 strings

Guess you like

Origin blog.csdn.net/weixin_43158695/article/details/113663365