JAVA小练习之英文文本词频统计(三)

写在开头:本次的小练习主要会运用一些字典的排序,由于对JAVA不是很熟悉,所以有的地方会不太能够解读,搜索了部分网上的资料。

英文文本词频统计

任务目标:统计英文文本中出现频率最高的5个单词,需要处理介词、时态和复数
任务自述:这个任务首先需要对文本进行分隔,然后还需要处理介词时态和复数、并转化为字典进行统计,最后再排序输出,对于我这样的小白来说着实难度挺大的,不过今天死磕下来,查阅了不少小函数的资料,整体是能跑出来了,不过可以更好的优化,特别是对于排序的地方。
任务实现:
笔者就不对每段函数单独进行讲解了,我们直接上手完整的程序,笔者会在完整的程序中添加注释和说明。代码如下,可能代码横向比较宽,可以拖过去看一下,


package test4;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import java.util.SortedMap;
import java.util.TreeMap;

public class NewClass {
    public static void main(String[] args) {
        //加载数据    
        String words = "The Delegation of Sichuan Tourism University Visited SWUFE\n"
                + "On May 31st, the delegation of Sichuan Tourism University, including YAN Qipeng, Secretary of the Party Committee, BAI Jie, Vice President, and representatives of related departments visited SWUFE. ZHAO Dewu, Chairman of the University Council, LI Yongqiang, Vice President as well as representatives of President's Office, Office of Organization and Personnel, Office of Academic Affairs, School of Accounting, School of Marxism, and School of Economic Information Engineering had a discussion in the meeting room 604 of Tengxiang Building.\n"
                + " ZHAO Dewu extended a sincere welcome to YAN's visit. He introduced the general situation, historical development, development status, strategic objectives and development ideas of SWUFE. ZHAO Dewu pointed out that with Chinese Socialism's entrance into a new era, financial and economic higher education has also entered the \"new financial and economic\" era. In this context, profound changes have taken place in the discipline, organizational members and university functions. Colleges and universities should actively adapt to the changes and enhance their subjective initiatives. He hopes that both sides will further strengthen exchanges and cooperation and jointly promote development in the future.\n"
                + "YAN Qipeng expressed thanks for SWUFE's warm reception. He introduced the development history and faculty of Sichuan Tourism University, hoping to further learn from SWUFE's good practices and experience in discipline construction and talent training and further establish a long-term mechanism to strengthen in-depth exchanges and cooperation.\n"
                + "At the meeting, both sides had in-depth exchanges on talent training, grass-roots party building and comprehensive management etc.\n"
                + "After the meeting, the delegation visited the History Museum of SWUFE and the Museum of Money and Finance. (Office of the University Council)\n"
                + "Senior Diplomat of Ministry of Foreign Affairs ZHANG Limin Delivered A Lecture in SWUFE\n"
                + "In the afternoon of On May 28th, ZHANG Limin, a senior diplomat and former ambassador of the Ministry of Foreign Affairs, gave a lecture on the theme of \"China's diplomacy and the International Situation\" in the Conference Room 101 of Hongyuan Building in SWUFE. OU Bing, Vice Chairman of the University Council presided over the lecture. Student representatives from the fifth training course for college student cadres as well as Head Start classrooms of talent training for international organizations presented the lecture.\n"
                + "Before the lecture, OU Bing gave a brief introduction ZHANG. On behalf of SWUFE, he expressed a warm welcome to ZHANG Limin. OU sincerely hoped that students would cherish the opportunity of face-to-face communication with ZHANG. SWUFE students are expected to understand the national conditions, have a global view and strive to become innovative talents who are familiar with international rules.\n"
                + "In the lecture, on the basis of his nearly four decades of diplomatic experience, ZHANG introduced some basic knowledge, diplomatic etiquette in the international diplomacy, the main tasks and work contents of embassies and consulates abroad. He further expounded the significance of diplomacy and the basic requirements of personal quality for a diplomat. He also combed the history of China's diplomatic development, the current international situations and China's foreign policy. He then shared his insights on recent hot topics. He stressed that students should pay attention to the changes in world economy as well as science and technology and to analyze problems by using professional knowledge. Meanwhile, he encouraged students to participate in international organizations' internship and employment to make China heard throughout the world.\n"
                + "In the Q&A session, ZHANG patiently responded to the questions from students. Students said that the lecture has greatly extended their horizon. They have been keenly aware of the patriotic feelings, and realized the glory of working as a diplomat for the country.\n"
                + "ZHANG Limin is a senior diplomat of the Ministry of Foreign Affairs, a former Ambassador of the Chinese Embassy in Italy, Consul General of the Consulate General in Milan, former Chinese Ambassador to Guyana, and now the member of the Council for Promoting South-South Cooperation as well as the special consultant to the Research Center for Latin America, SWUFE. (Student Affairs Department, School of International Business)";
        
        String wordsL = words.toLowerCase();    //将文本转换为小写
        String regex = "[\\p{Punct}\\s]+";     //定义正则表达式,排除控制和标点
        String[] s = wordsL.split(regex);      //对文本进行词的分隔储存
        
        //删掉介词
        //思路:将介词换为空值
        String str[] = {"of","the","and","in","a","to","as","s","for","on","with","out"}; //介词表
        for(int i = 0; i < str.length; i++){   
            for(int j = 0; j < s.length;  j++){
                if(str[i].contains(s[j])){
                    s[j] = "";   //将存在的介词换为空
                }
            }
        }
        
        //处理复数与时态(删除所有后缀为es和s处理复数,删掉ed和ing处理时态)
        //存在问题:无法识别不符合这种规律的单词,并且删除后无法得知单词原始状态,但不影响单词意思
        //思路:删除所有后缀为es、s、ed、ing
        String end[] = {"es","s","ed","ing"}; //复数时代后缀表
        int kk[] = {2,1,2,3};//控制位数
        for(int i = 0; i < s.length; i++){
            for(int j = 0; j < end.length; j++){
                if(s[i].endsWith(end[j])){
                    s[i] = s[i].substring(0, s[i].length()-kk[j]); //提取不包含后缀的词
                }
            }
        }
        
        //定义字典储存单词出现个数
        //这里需要使用TreeMap,因为TreeMap是二叉树型能够有序储存数据,而HashMap无序处理起来相对麻烦
        Map<String, Integer> map = new TreeMap(); //<>表示泛型,用于说明数据类型并控制数据类型
        
        //对单词出现次数进行统计并储存
        for (String s1: s){
            if(s1 == ""){ //对于空值的排除,这里的空值来源于前面对介词的删除
                continue;
            }else{
                int value =(int)(map.getOrDefault(s1, 0));// 有key返回value,无key返回0
                value = value + 1;
                map.put(s1, value); //写入字典
            }
        } 
        
        //对字典进行排序,首先需要将其转换为List然后利用Collection的方法排序
        List<Map.Entry<String, Integer>> list = new ArrayList<Map.Entry<String, Integer>>(map.entrySet());//表示一个set集合,返回一个Map.Entry对象 
        
        Collections.sort(list,new Comparator<Map.Entry<String, Integer>>() {
            public int compare(Map.Entry<String, Integer> v1, Map.Entry<String, Integer> v2) {
                return v2.getValue().compareTo(v1.getValue()); //v2大于v1就是从大到小排列
            }
        });
        
        //输出频率最高的5个单词
        System.out.println("使用频率最高的5个单词为:");
        for (Map.Entry<String, Integer> entry : list.subList(0, 5)) { //字典对于列表的循环
            System.out.println(entry.getKey()+"====>"+entry.getValue());
        }
    }  
}

run:
使用频率最高的5个单词为:
swufe====>11
student====>9
zhang====>8
international====>7
lecture====>7

这样就完成了整个文档中出现次数前5的单词,在排序上我目前只发现了对TreeMap这种的Collection.sort排序方法,后续会继续改进。


结语
以上就是对于英文文本词频统计的完成,有其他疑问的话可以私信笔者,我会在看到的时候第一时间答复。也感谢查看参考资料的前辈们。请自行调成package哈。
谢谢阅读。

发布了22 篇原创文章 · 获赞 81 · 访问量 3873

猜你喜欢

转载自blog.csdn.net/qq_35149632/article/details/104784617