DFA algorithm using the text filter

A, DEA algorithm Introduction

In the text filter algorithm, DFA is the only good algorithm.

DFA full name: Deterministic Finite Automaton, namely to determine the finite automaton. Wherein: there is a finite set of states and a number of leading from one state to another state side, each side marked with a symbol, wherein a state is the initial state, a final state of certain states. But unlike uncertain finite automaton, no signs starting from two sides of the same state with the same reference numerals in DFA.

Simple point that is, it is the event by the current state and the next state obtained, i.e. event + state = nextstate. It is understood as a plurality of nodes in the system, by passing the incoming Event, to determine which route to go to another, and the node is limited.

Two, DEA practice sensitive word filtering algorithm

1. sensitive thesaurus construction

Son of a bitch and sensitive to Wangbagaozi two words to describe, first constructed sensitive thesaurus, the thesaurus name SensitiveMap, binary tree structure of these two words is:

A hash table configured to:

{
    "王":{
        "isEnd":"0",
        "八":{
            "羔":{
                "子":{
                    "isEnd":"1"
                },
                "isEnd":"0"
            },
            "isEnd":"0",
            "蛋":{
                "isEnd":"1"
            }
        }
    }
}

How to achieve this data structure it in code?

    /**
     * 读取敏感词库,将敏感词放入HashSet中,构建一个DFA算法模型
     *
     * @param keyWordSet 敏感词库
     */
    public Map<String, Object> addSensitiveWordToHashMap(Set<String> keyWordSet) {
        //初始化敏感词容器,减少扩容操作
        Map<String, Object> map = new HashMap(Math.max((int) (keyWordSet.size() / .75f) + 1, 16));
        //迭代keyWordSet
        for (String aKeyWordSet : keyWordSet) {
            Map nowMap = map;
            for (int i = 0; i < aKeyWordSet.length(); i++) {
                //转换成char型
                char keyChar = aKeyWordSet.charAt(i);
                //获取
                Object wordMap = nowMap.get(keyChar);
                //如果存在该key,直接赋值
                if (wordMap != null) {
                    nowMap = (Map) wordMap;
                } else {     //不存在则,则构建一个map,同时将isEnd设置为0
                    Map<String, String> newWorMap = new HashMap<>(3);
                    newWorMap.put("isEnd", "0");
                    nowMap.put(keyChar, newWorMap);
                    nowMap = newWorMap;
                }
                //判断最后一个
                if (i == aKeyWordSet.length() - 1) {
                    nowMap.put("isEnd", "1");
                }
            }
        }
        return map;
    }

2. Sensitive word filtering

In the above example schematic configuration out SensitiveMap sensitive thesaurus, keywords entered here is assumed: good bastard flowchart follows:

How to achieve this flow chart logic of it in code?

    /**
     * 查找字符串中是否包含敏感字符
     *
     * @param txt 输入的字符串
     * @return 如果存在,则返回敏感字符串;不存在,则返回空字符串
     */
    public static String findSensitiveWord(String txt) {
        SensitiveWordInit sensitiveWordInit = SpringContextHolder.getBean(SensitiveWordInit.class);
        Map<String, Object> sensitiveWordMap = sensitiveWordInit.getSensitiveWordMap();
        StringBuilder sensitiveWord = new StringBuilder();
        // 敏感词结束标志位,表示匹配到了最后一位
        boolean flag = false;
        for (int i = 0; i < txt.length(); i++) {
            char word = txt.charAt(i);
            // 获取指定 key
            sensitiveWordMap = (Map) sensitiveWordMap.get(word);
            // 不存在,直接返回没有敏感词
            if (sensitiveWordMap == null) {
                break;
            }
            //存在,存储该敏感词,并判断是否为最后一个
            sensitiveWord.append(word);
            //如果为最后一个匹配规则,结束循环
            if ("1".equals(sensitiveWordMap.get("isEnd"))) {
                flag = true;
                break;
            }
        }
        // 表示匹配到了完整敏感词
        if (flag == true) {
            return sensitiveWord.toString();
        }
        return "";
    }

Third, optimization ideas

For the word 'king && * eight eggs "in the middle filled with meaningless characters to confuse, when we do sensitive word search, a filter should do the same meaningless word, when the circulation to such meaningless characters skipping to avoid interference.

Guess you like

Origin www.cnblogs.com/jmcui/p/11925777.html