Data structures and algorithms (Huffman tree, Huffman tree, compression software)

One: Thinking

        1. Telegraph sending: During World War II, everyone knew that telegraphs were commonly used at that time. If you were asked to design a telegram sending code, how would you design it?

        2. Compression algorithm: Given a file of 10,000 characters (each character is 1btye, which is 8bit), how can you store it to save space as much as possible?

        I believe that one idea that everyone can definitely think of is to use a certain character to replace (mapping). For example, in the compression algorithm, we can use binary instead. Suppose the characters are abcd 4 types. Then we assume a=000 b=001 c=010 d=100. In this way, each of our characters becomes a 3-bit binary, so 10,000 characters. It is 30000bit. Compared with the original 80000bit, does it reduce the storage space a lot?

        Shrunk by nearly 3 times.

        100000001: dab Abcdaaaaaaaaa: n*3 bits

        A:0

        B:101

        C:110

        D:100

Abcdaaaaaaaaaa: 010111010000000000=>abcdaaaaaa Aaa, prefix. But what's the problem with doing this? Is there a better way?

        Question: Will decoding failure occur? How to determine whether it is a decoded character.

        Solution: Huffman coding, also called prefix coding 

  2: Introduce the optimal binary tree (the sum of the weighted path lengths of the binary tree is the smallest)

        Full binary tree: Except for leaf nodes, everything else has two child nodes. Nodes like 1 2 4 8 have 2^n points.

        Complete binary tree: Except for the bottom layer, there are two child nodes, and the leaf nodes are continuous to the left.

        2.1 Calculate the sum of the weighted path lengths of the following three binary trees:

        The weight of each point is: a:7 b:5 c:2 d:4

        WPL(a):7*2+5*2+2*2+4*2=36()

        WPL(b):7*3+5*3+2*1+4*2=46()

        WPL(c):7*1+5*2+2*3+4*3=35()

        Given N weights as N leaf nodes, construct a binary tree. If the weighted path length of the tree reaches the minimum, such a binary tree is called the optimal binary tree, also known as Huffman Tree. The Huffman tree is the tree with the shortest weighted path length, and nodes with larger weights are closer to the root. So what does this Huffman tree have to do with compression?

        Binary tree: Two forks, at this time you have to think of binary, two forks are divided into left and right.

        The edge of the left node is set to 0 and the edge of the right node is set to 1

 Three: How to achieve it? (Greedy algorithm: optimal solution (that is, sorting))

        Core idea: Greedy algorithm: use local optimality to derive the global optimal, use short codes to represent the most frequent codes, and use longer codes if the frequency is less frequent. Moreover, the encoding of any character is not a prefix of another. When decompressing, we will read as long a decompressible binary string as possible each time, so there will be no ambiguity during decompression.

        Specific implementation ideas:

                1. Take the two nodes with the smallest values ​​each time and form them into a subtree.

                2. Remove the original two points

                3. Then put the composed subtree into the original sequence

                4. Repeat 1 2 3 until there is only the last point left

        Example: a:3 b:24 c:6 d:20 e:34 f:4 g:12 Implement Huffman tree based on the above weights (code implementation)

        Node data structure:

package tree.哈夫曼;

public class HfmNode implements Comparable<HfmNode>{		//优先队列,小的我把你优先级调高
	
	String chars;		//节点里面的字符
	int fre;		//表示是频率,也是权重
	HfmNode left;
	HfmNode right;
	HfmNode parent;	//用来找上层的

	/**
	 * 用于在优先队列中比较节点的权重。
	 * @param o the object to be compared.
	 * @return
	 */
	@Override
	public int compareTo(HfmNode o) {
		return this.fre - o.fre;
	}
	
}

        Huffman: (implement encoding, decoding, and Huffman generation)

               A JDK PriorityQueue is used here . In Java, this class is part of the Java collection framework and is used to create a priority queue data structure. A priority queue is a specialized queue in which elements are ordered according to their priority. The element with the highest priority is always at the front of the queue and is the first element to be removed.

package tree.哈夫曼;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.PriorityQueue;

public class HuffmenTree {

	HfmNode root;
	List<HfmNode> leafs; // 叶子节点
	Map<Character, Integer> weights; // 叶子节点的权重, a,b,c,d,e

	public HuffmenTree(Map<Character, Integer> weights) {
		this.weights = weights;
		leafs = new ArrayList<HfmNode>();
	}

	public String decode(Map<Character, String> code, String encodedStr) { // 解码 不会给你们写的,留给课后作业
		StringBuilder decodedStr = new StringBuilder();
		HfmNode currentNode = root;
		for (int i = 0; i < encodedStr.length(); i++) {
			char c = encodedStr.charAt(i);
			if (c == '0') {
				currentNode = currentNode.left;
			} else if (c == '1') {
				currentNode = currentNode.right;
			}
			if (currentNode.left == null && currentNode.right == null) {
				decodedStr.append(currentNode.chars);
				currentNode = root;
			}
		}
		return decodedStr.toString();
	}

	public void encode() { // 解码 不会给你们写的,留给课后作业

	}

	// 叶子节点进行编码
	public Map<Character, String> code() {

		Map<Character, String> map = new HashMap<Character, String>();
		for (HfmNode node : leafs) {
			String code = "";
			Character c = new Character(node.chars.charAt(0)); // 叶子节点肯定只有一个字符
			HfmNode current = node; // 只有一个点
			do {
				if (current.parent != null && current == current.parent.left) { // 说明当前点是左边
					code = "0" + code;
				} else {
					code = "1" + code;
				}
				current = current.parent;
			} while (current.parent != null); // parent == null就表示到了根节点
			map.put(c, code);
			System.out.println(c + ":" + code);
		}
		return map;

	}

	public void creatTree() {
		Character keys[] = weights.keySet().toArray(new Character[0]); // 拿出所有的点
		PriorityQueue<HfmNode> priorityQueue = new PriorityQueue<HfmNode>(); // jdk底层的优先队列
		for (Character c : keys) {
			HfmNode hfmNode = new HfmNode();
			hfmNode.chars = c.toString();
			hfmNode.fre = weights.get(c); // 权重
			priorityQueue.add(hfmNode); // 首先把我们的优先队列初始化进去
			leafs.add(hfmNode);
		}

		int len = priorityQueue.size();
		for (int i = 1; i <= len - 1; i++) { // 每次找最小的两个点合并
			HfmNode n1 = priorityQueue.poll(); //
			HfmNode n2 = priorityQueue.poll(); // 每次取优先队列的前面两个 就一定是两个最小的

			HfmNode newNode = new HfmNode();
			newNode.chars = n1.chars + n2.chars; // 我们把值赋值一下,也可以不复制
			newNode.fre = n1.fre + n2.fre; // 把权重相加

			// 维护出树的结构
			newNode.left = n1;
			newNode.right = n2;
			n1.parent = newNode;
			n2.parent = newNode;

			priorityQueue.add(newNode);
		}
		root = priorityQueue.poll(); // 最后这个点就是我们的根节点
		System.out.println("构建完成");
	}

	public static void main(String[] args) {
		// a:3 b:24 c:6 d:20 e:34 f:4 g:12
		Map<Character, Integer> weights = new HashMap<Character, Integer>();
		//一般来说:动态的加密,最开始是不知道里面有什么内容的。我们需要一个密码本,往往就是某个字典。如果是英文就用英文字典,统计次数。
		//换密码本
		//静态的文件。针对性的做编码.图像加密,没有特性的。hash加密(MD5)
		weights.put('a', 3);
		weights.put('b', 24);
		weights.put('c', 6);
		weights.put('d', 1);
		weights.put('e', 34);
		weights.put('f', 4);
		weights.put('g', 12);

		HuffmenTree huffmenTree = new HuffmenTree(weights);
		huffmenTree.creatTree();
		Map<Character, String> code = huffmenTree.code();
		String str = "aceg";
		System.out.println("编码后的:");
		String decode = huffmenTree.decode(code,"00");
		System.out.println(decode);
		char s[] = str.toCharArray();
	}
/*
 a:10110
b:01
c:1010
d:00
e:11
f:10111
g:100

 * *
 */
}

        Four: Summary

        After learning the Huffman tree, we can now return to our thinking questions. Are these two problems easily solved?

        Telegraph design:

                1. After the telegram is encrypted, the shorter it is, the better it can be sent quickly.

                2. Difficult to crack

                3.Easy to decode

                4. Change the encryption tree quickly.

                5. Reversible. What is irreversible: Hash encryption such as MD5, simple MD5 has been cracked, exhaustive: a large library. Md5(paswd)=-> The password encryption of the interface is transmitted from the front end to the back end. I want to encrypt the password and use the Hash function. I’ll talk about it later, what should I do when the data comes back during transmission?

        Therefore, many of our current digital communications use Huffman coding. I believe everyone can understand compression better. As long as you understand the compression algorithm of Huffman coding, it is actually very easy. I believe you can implement it soon.

Guess you like

Origin blog.csdn.net/qq_67801847/article/details/132803223