自然语言处理作业A1

任务1：把HTML格式转为JSON数据，再用python的JSON包，把JSON数据转为python能使用的数据结构(dicts, lists…)（chaos2json.py）

Your implementation should have at least one regular expression (to extract the textual content of each line), and use NLTK’s word_tokenize function as the tokenizer. You may also use built-in string methods/operations and write your own helper functions.
The word_tokenize function does not separate hyphens, but this text uses hyphens in place of dashes, so your code should separate them.

Hint 1: The HTML contains (nonstandard) tags like at the beginning of each line. The number is the line within the stanza (between 1 and 4). Ignore ellipsis lines indicating removed stanzas.
Hint 2: When converting to JSON, use the indent argument to make it more human-readable.
(This script should not take extremely long to implement, but it will probably take you longer than you expect.)

from urllib import request
from bs4 import BeautifulSoup
from nltk import word_tokenize
import re
import json

url = 'file:///E:/学习文档/数据集/a1/chaos.html'

# 打开URL，返回HTML信息
def open_url(url):
    # 根据当前URL创建请求包
    req = request.Request(url)
    # 添加头信息，伪装成浏览器访问
    req.add_header('User-Agent',
                   'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36')
    # 发起请求
    response = request.urlopen(req)
    # 返回请求到的HTML信息
    return response.read()

# 用正则定位
def find_tag(url, regex = '<xxx.>(.*?)(<br>|</p>)'):
    # (.*?)
    # .是除了\n的任意字符
    # *是取之前字符的0个或者n个
    # ？是去之前字符的0个或者1个；也可以解释为非贪婪模式
    # （）圆括号，举例说明，eg： a(b)c，在这个例子中，用abcac来进行匹配的话，可以得到ac，abc两个结果，意思是小括号中的内容在能匹配
    #     的情况下是需要匹配的，匹配不到内容也可以跳过。
    # 0个或者任意个不是\n的任意字符
    html = open_url(url).decode('utf-8')
    # hyphens Filter
    # 把 Recipe, pipe, studding-sail, choir 变成 Recipe, pipe, studding sail, choir;
    html= re.compile('-').sub(' ', html)
    result = re.findall(regex, html)
    return result

# 处理rhymeWord，有的结尾是一个标点，则不是rhymeWord，要跳过
def find_rhymeWord(tokens):
    length = len(tokens)
    for i in range(length):
        if tokens[length-1-i] in '.,-!?@#$%^&*()\"\" ;\'\'':
            pass
        else:
            return tokens[length-1-i]



tag = find_tag(url)
setDict = []
numStanza = 1
count = 1 # 计算句子编号的，其实应该用xxx.这部分，但是我懒
switch = 0;
for i in range(len(tag)):
    # "stanza" = i；段首
    if tag[i][0][0:3] == '<p>':
        if switch == 1:
            setDict.append(dictionary)
            numStanza += 1
            switch = 0
        count = 1
        dictionary = dict()
        dictionary['stanza'] = numStanza
        # 处理text时要去掉<tt>(.*?)</tt>,里面都是一些html的转义符号,用re.sub去掉
        text = re.sub('\\xa0','',BeautifulSoup(tag[i][0], "lxml").get_text())
        tokens = word_tokenize(text)

        dictionary["lines"] = [{"lineId":'{}-{}'.format(numStanza, count), "lineNum": count, "text" : text,
                                "tokens": tokens, "rhymeWord" : find_rhymeWord(tokens)}]
        pass
    else:
        switch = 1
        count += 1
        text = re.sub('\\xa0', '', BeautifulSoup(tag[i][0], "lxml").get_text())
        tokens = word_tokenize(text)
        dictionary["lines"].append({"lineId": '{}-{}'.format(numStanza, count), "lineNum": count, "text": text,
                                "tokens": tokens, "rhymeWord": find_rhymeWord(tokens)})


js = json.dumps(setDict, indent=4)
print(js)

任务2：查找cmudict 中的每个rhyming word，并把他们可能的发音添加到JSON数据中（allpron.py）

How many rhyming words are NOT found in cmudict (they are “out-of-vocabulary”, or “OOV”)? In your code, leave a comment indicating how many and give a few examples.

import cmudict
import json
# 发音表（元组+列表格式），和用于引索的列表格式数据
# index = words.index('apple')
# print(pron[index])
# > ('apple', ['AE1', 'P', 'AH0', 'L'])
# pron[index][1]就是我们需要的

pron = cmudict.entries()
words = cmudict.words()


# js 为上个实验的输出
setDict = json.loads(js)
list_OOV = []
for i in setDict:
    for j in i['lines']:
        # 可能cmudict没有收入
        try:
            j['rhymeProns'] = pron[words.index(j['rhymeWord'].lower())][1]
        except:
            j['rhymeProns'] = 0
            list_OOV.append(j['rhymeWord'])
            pass

print(list_OOV)

[‘Terpsichore’,
‘reviles’,
‘endeavoured’,
‘tortious’,
‘clangour’,
‘hygienic’,
‘inveigle’,
‘mezzotint’,
‘Cholmondeley’,
‘obsequies’,
‘dumbly’,
‘vapour’,
‘fivers’,
‘gunwale’]

任务3：用一个启发式的方法判断是否两个发音押韵与否，近似的押韵也不算（exact_rhymes.py）

How many pairs of lines that are supposed to rhyme actually have rhyming pronunciations according to your heuristic? For how many lines does having the rhyming line help you disambiguate between multiple possible pronunciations? What are some reasons that your heuristic is imperfect?

这题不大想做了，可能的思路是将每句诗押韵词最后的发音进行比对，但是是最后几个词呢？可以做一个规则，比如说从后往前数都一样，遇到不一样时候看是不是非元音（最后一个非元音也可押韵，比如s,z进行押韵…这当作另一个规则）

自然语言处理作业A1

猜你喜欢