出现次数的Top K问题

【题目】

给定String类型的数组strArr，再给定整数k，请严格按照排名顺序打印出现次数前k名的字符串。

【要求】

如果strArr长度为N，时间复杂度请达到O(Nlogk)，

【测试用例】

strArr = ["1", "2", "3", "4"]，k = 2

No.1：1， times：1

No.2：2， times：1

这种情况下，所有的字符串都出现一样多，随便打印任何两个字符串都可以。

strArr = ["1", "1", "2", "3"]，k = 2

输出：

No.1：1， times：2

No.2：2， times：1

或者输出：

No.1：1， times：2

No.2：3， times：1

【难度】
☆☆

【解答】

首先遍历strArr并统计字符串的词频，例如，strArr = ["a"，"b"，"b"，"a"，"c"]，遍历后可以生成每种字符串及其相关词频的哈希表如下：

key（字符串）	value（相关词频）
“a”	2
“b”	2
“c”	1

用哈希表的每条信息可以生成Node类的实例，Node类如下：

public class Node {
    public String str;
    public int times;
    
    public Node(String s, int t) {
        str = s;
        t = times;
    }
}

哈希表中有多少信息，就建立多少Node类的实例，并且依次放入堆中，具体过程为：

1.建立一个大小为k的小根堆，这个堆放入的是Node类的实例；

2.遍历哈希表的每条记录，假设一条记录为（s,t），s表示一种字符串，s的词频为t，则生成Node类的实例，记为（str，times）。

（1）如果小根堆没有满，则直接将（str，times）加入堆，然后进行建堆调整（heapInsert调整），堆中Node类实例之间都以词频（times）来进行比较，词频越小，位置越往上。

（2）如果小根堆已满，说明此时小根堆已经选出k个最高词频的字符串，那么整个小根堆的堆顶自然代表已经选出的k个最高词频的字符串中，词频最低的那个。堆顶的元素记为（headStr，minTimes）。如果minTimes < times，说明字符串str有资格进入当前k个最高词频字符串的范围。而headStr应该被移出这个范围，所以把当前的堆顶（headStr， minTimes）替换为（str，times），然后从堆顶的位置进行堆的调整（heapify）。如果minTimes>=times，说明字符串str没有资格将进入当前k个最高词频字符串的范围，因为str的词频还不如目前选出的k个最高词频字符串中词频最少的那个，所以什么也不做。

3.遍历完strArr之后，小根堆里就是所有字符串中k个最高词频的字符串，但要求严格按排名打印，所以还需要根据词频从大到小完成k个元素间的排序。

遍历strArr简历哈希表的过程为O(N)。哈希表中记录的条数最多为N条，每条记录进堆时，堆的调整时间复杂度为O（logk），所以根据记录更新小根堆的过程为O(Nlogk)。k条记录排序的时间复杂度为O（klogk）。所以总的时间复杂度为O(N) + O(Nlogk) + O(klogk)，即O(Nlogk）。具体过程见如下printTopKAndRank方法。

public void printTopKAndRank(String[] arr, int topK) {
	if (arr == null || topK < 1) return;
	HashMap<String, Integer> map = new HashMap<String, Integer>();
	//生成哈希表（字符串词频）
	for (int i = 0; i != arr.length; i++) {
		String cur = arr[i];
		if (!map.containsKey(cur)) {
			map.put(cur, 1);
		} else {
			map.put(cur, map.get(cur)+1);
		}
	}
	
	Node[] heap = new Node[topK];
	int index = 0;
	//遍历哈希表，决定每条信息是否进堆
	for (Entry<String, Integer> entry: map.entrySet()) {
		String str = entry.getKey();
		int times = entry.getValue();
		Node node = new Node(str, times);
		if (index != topK) {
			heap[index] = node;
			heapInsert(heap, index++);
		} else {
			if (heap[0].times < node.times) {
				headp[0] = node;
				heapify(heap, 0, topK);
			}
		}
	}
	//把小根堆的所有元素按词频从大到小排序
	for(int i = index - 1; i != 0; i--) {
		swap(heap, 0, i);
		heapify(heap, 0, i);
	}
	//严格按照排名打印k条记录
	for(int i = 0; i != heap.length; i++) {
		if(heap[i] == null) {
			break;
		} else {
			System.out.print("No." + (i + 1) +": ");
			System.out.print(heap[i].str + ", times:");
			System.out.print(heap[i].times);
		}
	}
}

public void heapInsert(Node[] heap, int index) {
	while(index != 0) {
		int parent = (index - 1) / 2;
		if (heap[index].times <heap[parent].times) {
			swap(heap, parent, index);
			index = parent;
		} else {
			break;
		}
	}
}

public void heapify(Node[] heap, int index, int heapSize) {
	int left = index * 2 + 1;
	int right = index *2 + 2;
	int smallest = index;
	while (left < heapSize) {
		if (heap[left].times  < heap[index].times) {
			smallest = left;
		} 
		if (right < heapSize && heap[right].times < heap[smallest].times) {
			smallest = right;
		}
		if (smallest != index) {
			swap(heap, smallest, index);
		} else {
			break;
		}
		index = smallest;
		left = index * 2 + 1;
		right  = index * 2 + 2;
	}
}

public void swap(Node[] heap, int index1, int index2) {
	Node tmp = heap[index1];
	heap[index1] = heap[index2];
	headp[index2] = tmp;
}

出现次数的Top K问题

猜你喜欢