Python刷OJ———UVa :10815 Andy's First Dictionary

题干：

Andy, 8, has a dream - he wants to produce his very own dictionary. This is not an easy task for him, as the number of words that he knows is, well, not quite enough. Instead of thinking up all the words himself, he has a briliant idea. From his bookshelf he would pick one of his favourite story books, from which he would copy out all the distinct words. By arranging the words in alphabetical order, he is done! Of course, it is a really time-consuming job, and this is where a computer program is helpful. You are asked to write a program that lists all the different words in the input text. In this problem, a word is defined as a consecutive sequence of alphabets, in upper and/or lower case. Words with only one letter are also to be considered. Furthermore, your program must be CaSe InSeNsItIvE. For example, words like “Apple”, “apple” or “APPLE” must be considered the same.

Input

The input file is a text with no more than 5000 lines. An input line has at most 200 characters. Input
is terminated by EOF.

Output

Your output should give a list of different words that appears in the input text, one in a line. The
words should all be in lower case, sorted in alphabetical order. You can be sure that he number of
distinct words in the text does not exceed 5000.
Sample Input
Adventures in Disneyland
Two blondes were going to Disneyland when they came to a fork in the
road. The sign read: “Disneyland Left.”
So they went home.
Sample Output
a
adventures
blondes
came
disneyland
fork
going
home
in
left
read
road
sign
so
the
they
to
two
went
were
when
————————————————————————————————————————————
题目的基本意思就是，给你输入一段英文故事，然后不区分大小写，把里面出现的所有单词都提取出来，打印在屏幕上，重复出现的单词只打印一次。

说一下思路：
我们将要对这段文本做的有

（1）去除里面标点符号的影响
（2）把文本全部变小写
（3）提取单词，把重复的只留下一个
——————————————————

方法1

（1）首先题目中说，文本是一行行输入的，为了提高效率，我们可以先把每行文本做为一个元素
收集到一个列表中，然后写一个处理单行文本的函数，再map并发处理。

（2）对于每行文本，可以用string模块中的punctuation（英文标点符号）属性，把标点洗掉（循环遍历一遍）
实际上这里有两种方式：

import string

a = "fddasfads,asdfas.fasdfasdfw ?"
b = a.translate(str.maketrans('', '', string.punctuation))

print(b)  # fddasfadsasdfasfasdfasdfw

这样的写法会直接去掉标点符号，而题目中说，“本题中单词的定义是紧密连接的字母组”，所以显然这种方式会导致错误。

循环遍历法：

import string

a = "fddasfads,asdfas.fasdfasdfw ?"
for i in string.punctuation:
    a = a.replace(i, ' ')


print(a)  # fddasfads asdfas fasdfasdfw

显然更加理想。

（3）处理标点符号后，直接；lower（）全部小写，然后split（‘space’）。python中其实字符串就基本是列表
而集合自动去重复，再来一个set（），就得到本行文本的单词集合，处理完毕。

（4）最后这些集合将会汇集在一个列表中返回（map），我们再循环遍历列表把所有集合合并，就得到目标单词的集合，集合可以直接用sorted（）排序

上代码

import string


def wash_data(s):
    for k in string.punctuation:
        s = s.replace(k, " ")
    for n in range(10):
        s = s.replace(str(n), " ")
    list_1 = s.lower().split(' ')
    set_1 = set(list_1)
    return set_1


list_str = []  # 用于收集字符串
set_all = set()  # 先建立里一个空集合，后续合并时使用
while True:
    try:
        str_ = input()
        list_str.append(str_)
    except EOFError:
        break
list_set = map(wash_data, list_str)
for i in list_set:
    set_all = set_all | i
for j in sorted(set_all):
    print(j)

————————————————————————————————————————————

方法2

相较方法一更加高级，使用python强大的正则表达式
先上代码

from re import split
from sys import stdin


def Dict():
    str_1 = stdin.read()
    str_2 = str_1.lower()
    str_3 = set(split(r'[^a-z]', str_2))
    for i in sorted(str_3):
        if i:
            print(i)


Dict()

引入re模块的split（），这个split和python内置的区别很大，强大很多，（别问我oj为什么能用re模块，我也是问了别的大佬才知道）
（1）我们直接一次性从键盘读入所有文本

（2）说一下正则表达式，[^a-z]意思是匹配除了小写字母以外的所有字符，然后以他们为标准，切分字符串。正则表达式的语法和功能十分繁多，不再赘述，可以自己行百度了解。

（3）最后输出的时候，我们要先判断一下字符串是不是空的，因为在方法1中，str的split是不会认空格的数量的，全部切掉。而正则表达式对于多个空格，将会切掉一个，保留其他。

对于以上方法，如有改进意见，请在评论中告诉我

CxsGhost

发布了28 篇原创文章 · 获赞 74 · 访问量 1647

私信关注