[Go version] Algorithmic customs clearance village 15th level silver - a popular algorithm problem in the scene of massive data

Table of contents

In 海量数据, at this time, ordinary structures such as arrays, linked lists, Hash, trees, etc. are invalid, because the memory space cannot fit. The conventional recursion, sorting, backtracking, greedy and dynamic programming ideas are also invalid, because the execution will time out, and we must find another way. How to deal with such problems? Here are three very typical ideas:

  1. 使用位存储, the biggest advantage of using bit storage is that the space occupied is 1/8 of that of simply storing integers . For example, an array of 4 billion integers requires about 16GB of space if integer storage is used, and 2GB of space can be used if bit storage is used, so many problems can be solved.

  2. If the file is too large to fit in the memory, you need to consider dividing the large file into several small blocks, process each block first, and finally get the desired result gradually . This method is also called 外部排序. This requires traversing the entire sequence at least twice, which is a typical method of exchanging time for space .

  3. , if you are looking for the Kth largest, Kth smallest, K largest, and K smallest in very large data, it is especially suitable to use the heap . Moreover, it is also possible to replace super-large data with streaming data, and it is almost the only way. The mantra is "use large piles for small checks, and small piles for big checks".
    Common sense supplement: 1 billion ≈ 1G, 1 million ≈ 1M

Title: Generate a non-existent integer from 4 billion

Requirements: Assuming you have 1GB of RAM for this task

For the memory analysis required for the storage of 4 billion data:

//直接存储 需要15GB,不满足要求
4000000000*4/1024/1024/1024 ≈ 15GB

//使用位图存储 需要0.5GB,满足要求
4000000000/8/1024/1024/1024 ≈ 0.5GB

So, it’s just the same as the scheme used in the previous section 位存储. Without further ado, just upload the code directly.

Go code

Source code address: GitHub-golang version (including unit test code)

func FindNoExistNumBy1G(arr []int) int {
    
    
	N := 4000000000
	bitmap := make([]int, N/32+1)
	for _, num := range arr {
    
    
		num0 := num - 1
		index := num0 / 32
		offset := num0 % 32
		mark := 1 << offset
		bitmap[index] |= mark
	}
	for index, v := range bitmap {
    
    
		for i := 0; i < 32; i++ {
    
    
			mark := 1 << i
			if v&mark == 0 {
    
    
				return index*32 + i + 1
			}
		}
	}
	return -1
}

Advanced: What to do if only 10MB of memory is available?

1. Considering 10MB of memory, what is the data capacity stored in the bitmap?

10*1024*1024*8 = 83886080(bit)

2. Considering 4 billion data, each block uses 10MB of memory, how many blocks are needed?

For 4 billion data, use block processing, if 每块用10MB内存使用位图存储only 4000000000/83886080 ≈ 48 块need to.

Generally speaking, we use exponential multiples of 2 for division, so it is more reasonable to divide it into 64 blocks here.

2^2=4
2^3=8
2^4=16
2^5=32
2^6=64
2^26=67108864
2^32=4294967296

Supplement: The reason why it is recommended to divide the number of blocks into an exponential multiple of 2

  1. Fault tolerance : By dividing the number of blocks into an integer multiple of 2, you can better deal with the growth and changes of the data volume . If you need to process larger data sets in the future, you don't have to redesign the size of the partition blocks, just adjust the number of blocks.
  2. Adaptability : The integer multiple of 2 is used as the number of division blocks, which is more suitable for the memory and hardware storage structure of the computer . Powers of 2 sizes are often easier for computers to handle because this matches the unit of operation for memory, cache, and disk.
  3. Performance balance : In external sorting, there is a balance between the size of partition blocks and the number of blocks. While dividing more blocks may result in less data within each block, too many blocks will increase the overhead of sorting and merging. Choosing a moderate number of partition blocks can strike a balance between time and space efficiency.

3. After specifying the number of blocks, what should I do next?

Continuing from the above, here we divide the 4 billion data into 64 blocks for processing, namely countArr := make([]int, 64).

1. Find out which block the non-existing integer is in

Traverse 4 billion data, clarify the block corresponding to each data, count the total number falling on each block, and then find out 判断哪个块的总数<平均数which block the non-existing integer is in. (here there must be at least one block whose total number < average number)

What is the average number per block?

  • The first method: You can divide the total number of data by 4 billion by the number of blocks, that is4000000000 / 64 = 62500000
  • The second method: You can also use 2^32=4294967296 (greater than 4 billion) to divide by the number of blocks4294967296 / 64 = 67108864

Using the second method as an example: will 4294967296 / 64 = 67108864get the average storage number of each block, that is

第  0  块 :67108864 * 0  ~ (67108864 * (0 + 1) - 1)  <==> 0 		 ~ 67108863
第  1  块 :67108864 * 1  ~ (67108864 * (1 + 1) - 1)  <==> 67108864 	 ~ 134217727
第  x  块 :67108864 * x  ~ (67108864 * (x + 1) - 1) 
第  63 块 :67108864 * 63 ~ (67108864 * (63+ 1) - 1)  <==> 4227858432 ~ 4294967295

For example, how to find the block corresponding to each data: for example, the current number is 3422552090

对应块的索引:3422552090 / 67108864 = 51

Then let the total number of the block + 1, that is: countArr[51]++
the last traversal countArr, judge which block's total number < 67108864, and find the integer that does not exist in this block.

2. Clear the block where it is located, just find the value that does not exist

Assuming that the total number on the 37th block is found < 67108864, then perform a second traversal on the 4 billion data to filter out the numbers that are not on the 37th block.
For the data on block 37, continue to use 位存储the solution in the previous section. Project each data to the corresponding position in the bitmap, and set the value to 1. Finally, the bitmap is traversed, and when the position value is found to be 0, the non-existent value can be calculated according to the value
at this time .位偏移位图索引
举例:不存在的值=位偏移 + 位图索引*32 + 37*67108864

Go code

Source code address: GitHub-golang version (including unit test code)

func FindTargetIndex(arr []int) (targerIndex int) {
    
    
	N := 67108864
	countArr := make([]int, 64) //大约占用 0.25K 内存空间
	// 遍历40亿数据,明确每个数据对应的块,统计落在每个块上的总数量
	for _, num := range arr {
    
    
		countArr[num/N]++
	}
	for i, count := range countArr {
    
    
		if count < N {
    
    
			return i
		}
	}
	return -1
}
func FindNoExistNumBy10M(arr []int) (res int) {
    
    
	N := 67108864
	targetIndex := FindTargetIndex(arr)
	bitmap := make([]int, N/32+1) //大约占用 8M 内存空间
	for _, num := range arr {
    
    
		// 如果数据是属于目标块的
		if num/N == targetIndex {
    
    
			val := num - targetIndex*N
			val0 := val - 1
			index := val0 / 32
			offset := val0 % 32
			mark := 1 << offset
			bitmap[index] |= mark
		}
	}
	for index, intval := range bitmap {
    
    
		for i := 0; i < 32; i++ {
    
    
			mark := 1 << i
			// 找到不存在的值
			if intval&mark == 0 {
    
    
				res = i + index*32 + targetIndex*N + 1
				return res
			}
		}
	}
	return -1
}

Topic: Use 2GB of memory to find the number with the most occurrences among 2 billion integers

Requirement: There is a large file containing 2 billion all 32-bit integers, and find the number with the most occurrences in it.
required, the memory limit is 2GB.

Hash Table Statistics

When finding the number with the most occurrences, the general practice is to use a hash table, where the key is an integer and the value is the number of occurrences of the corresponding integer.

Determine the data type of the storage structure (considering the value overflow problem)

The maximum value of int is: 2^31-1 = 2147483647
2147483647 > 2 billion, so the int type can be used to represent key and val (extreme cases: if there are 2 billion integers, or 2 billion different integers, neither there will be overflow issues)

Therefore, the structure of the hash map is: map[int]int.
Since each integer is 32 bits, a key requires 4 bytes and a value also requires 4 bytes, so each key-value pair requires 8 bytes .

The hash function splits the file (consider how many values ​​​​can be stored in 2GB of memory)

2GB = 2 1024 1024*1024 = 2147483648(byte)
2147483648 / 8 = 268435456
200 million< 268435456 < 300 million

Therefore, divide 2 billion into 200 million blocks for processing, a total of 10 blocks , so that the memory of each block is < 2GB memory, which meets the meaning of the question.

But generally speaking, we use exponential multiples of 2 for division, so it is more reasonable to divide it into 16 blocks here.

Make sure to assign the same number to the same small file

The role of the hash function is to map the input data to a small range, such as 0 to 15. In this way, the same integer will be mapped to the same range, thus splitting the whole large file into 16 small files.

Statistics in small files

For each small file, use a hash table to count the number of occurrences of each number. In this way, each small file contains its own key-value pair, where the key is an integer and the value is the number of occurrences of the corresponding integer.

Find the first name of each small file

For each small file, find the number with the most occurrences (first), and record the number of occurrences. In this way, the respective first names and corresponding times in the 16 small files are obtained.

Select the global first place from 16 small files

Finally, choose the one with the most occurrences from the first place of these 16 small files, which is the number with the most occurrences globally.

Title: The Problem of Finding from 10 Billion URLs

Requirement: There is a large file containing 10 billion URLs, assuming each URL occupies 64B, please find all duplicate URLs in it.

Answer: The solution to the original problem uses a conventional method for solving big data problems: distribute large files to machines through hash functions, or split large files into small files through hash functions, and keep doing this division until the division is complete. The result satisfies the resource constraints . First of all, you have to ask the interviewer what are the limitations on resources, including memory, computing time and other requirements. After clarifying the restriction requirements, each URL can be distributed to several machines or split into several small files through the hash function, and the "several" here is calculated according to the specific resource restriction.

For example, a large file of 10 billion bytes is distributed to 100 machines through a hash function, and then each machine counts whether there are duplicate URLs among the URLs allocated to itself, and the nature of the hash function determines the same URL. It is impossible to assign URLs to different machines; or split a large file into 1000 small files through a hash function on a single machine, and use the hash table to traverse each small file to find duplicate URLs; After the machine or the files are disassembled, sort them, and then check whether there are duplicate URLs after sorting. In short, keep in mind that many big data problems are inseparable from shunting, either using a hash function to distribute the contents of a large file to different machines, or using a hash function to split a large file into small files, and then process each Collection of small quantities.

Supplementary question: A search company has a huge amount of user search terms in a day (tens of billions of data), please design a feasible method to find the top 100 popular words every day.

Supplementary questions were initially handled with the idea of ​​hash distribution , which distributed vocabulary files containing tens of billions of data to different machines. The specific number of machines was determined by the interviewer or by more restrictions. For each machine, if the amount of data allocated is still large, such as insufficient memory or other problems, the hash function can be used to split the split file of each machine into smaller files for processing. When processing each small file, the hash table is used to count each word and its frequency. After the hash table record is established, the hash table is traversed. In the process of traversing the hash table, a small root heap with a size of 100 is used to Select the Top 100 of each small file (the overall unsorted Top 100). Each small file has its own small root pile of word frequency (the overall unsorted Top 100), and the words in the small root pile are sorted according to word frequency, and the sorted Top 100 of each small file is obtained. Then sort the Top 100 after sorting each small file or continue to use the small root heap to select the Top 100 on each machine. The Top 100 among different machines is then sorted out or continue to use the small root heap, and finally the Top 100 in the entire tens of billions of data is obtained. For the Top K problem, in addition to using the hash function to divide the stream and using the hash table for word frequency statistics, heap structure and external sorting are often used to deal with it.

Summarize the skills and methods of processing large-scale data: hash split, hash table, heap sort, bitmap

Title: Find the number that appears twice among 4 billion non-negative integers

Requirements: The range of 32-bit unsigned integers is 0 to 4 294 967 295. Now there are 4 billion unsigned integers. You can use up to 1GB of memory to find all numbers that appear twice.

This question can be regarded as an advanced question of the first question, where the number of occurrences is limited to two.

  1. First, a bitmap array bitArr is needed, the length of the array is 4,294,967,295 × 2, and each number needs two positions (bits) to indicate its occurrence. Since each bit occupies 1/8 byte (1 byte), the length of the array is 4,294,967,295 × 2 / 8 bytes, occupying about 1GB of memory space.
  2. Iterate over the 4 billion unsigned numbers. For each number num, we check its corresponding position in bitArr, bitArr[num * 2] and bitArr[num * 2 + 1].
    • If bitArr[num * 2] and bitArr[num * 2 + 1] are both 0, it means that this is the first time the number is encountered, and we set it to 01 (01 means it has appeared once).
    • If bitArr[num * 2] and bitArr[num * 2 + 1] are 01, it means that this is the second time the number is encountered, and we set it to 10 (10 means it appears twice).
    • If bitArr[num * 2] and bitArr[num * 2 + 1] are already 10, it means that this is the third time the number has been encountered and we will keep it as 10 without changing it.
  3. After the traversal is complete, we iterate over bitArr again. For each number i, if the value of bitArr[i * 2] and bitArr[i * 2 + 1] is found to be 10, it means that this number appears twice, and we can add i to the result set.

Go code

func findDuplicates(arr []uint32) []uint32 {
    
    
    bitArrLen := uint64(4294967295) * 2
    bitArr := make([]byte, (bitArrLen/8)+1)

    duplicates := []uint32{
    
    }

    for _, num := range arr {
    
    
        index := num * 2
        if (bitArr[index/8] & (1 << (index % 8))) == 0 {
    
    
            bitArr[index/8] |= 1 << (index % 8)
        } else if (bitArr[(index+1)/8] & (1 << ((index + 1) % 8))) == 0 {
    
    
            bitArr[(index+1)/8] |= 1 << ((index + 1) % 8)
            duplicates = append(duplicates, num)
        }
    }
    
    return duplicates
}

Guess you like

Origin blog.csdn.net/trinityleo5/article/details/132528924