jieba源碼研讀筆記(十八) - 關鍵詞提取之TF-IDF使用示例

jieba源碼研讀筆記 (十八)- 關鍵詞提取之TF-IDF使用示例

前言

jieba中除了給出TF-IDF算法實現外,還提供了它的使用示例。
使用示例在test這個資料夾底下,以下是它的目錄結構:

│  demo.py
│  extract_tags.py
│  extract_tags_idfpath.py
│  extract_tags_stop_words.py
│  extract_tags_with_weight.py
...
│
├─parallel
...
│
└─tmp
        MAIN_j1c2zuvtphwbv2n2.seg
        MAIN_WRITELOCK
        _MAIN_1.toc

test/extract_tags.py

extract_tags.py檔是關鍵詞提取的使用示例。

import sys
sys.path.append('../')

import jieba
import jieba.analyse
from optparse import OptionParser

"""
參考[Python2文檔 - 15.5. optparse — Parser for command line options]
(https://docs.python.org/2/library/optparse.html#creating-the-parser)
此處USAGE參數代表的是help message
"""
USAGE = "usage:    python extract_tags.py [file name] -k [top k]"
parser = OptionParser(USAGE)

"""
from https://docs.python.org/2/library/optparse.html#optparse.Option.dest:
If the option’s action implies writing or modifying a value somewhere, 
this tells optparse where to write it: 
dest names an attribute of the options object that optparse builds 
as it parses the command line.

總結一下,就是當使用者輸入-k xxx時,parser.parse_args()回傳的opt的topK屬性就會被設為xxx
"""
parser.add_option("-k", dest="topK")

"""
from https://docs.python.org/2/library/optparse.html#module-optparse:
parse_args() returns two values:
options, an object containing values for all of your options—e.g. if --file takes a single string argument, then options.file will be the filename supplied by the user, or None if the user did not supply that option
args, the list of positional arguments leftover after parsing options
parser.parse_args()會回傳options及args兩個物件
options代表可選參數,args代表位置參數
在本例中[file name]為位置參數,-k [top k]為可選參數

As it parses the command line, optparse sets attributes of the options 
object returned by parse_args() based on user-supplied command-line values.
parser.parse_args()會回傳一個opt物件,opt的topK屬性由使用者輸入的參數決定
"""
opt, args = parser.parse_args()


if len(args) < 1: #代表使用者沒有輸入位置參數[file name]
    print(USAGE)
    sys.exit(1)

file_name = args[0]

# 使用opt.topK來獲取可選參數topK
if opt.topK is None:
    topK = 10
else:
    topK = int(opt.topK)

content = open(file_name, 'rb').read()

tags = jieba.analyse.extract_tags(content, topK=topK)

print(",".join(tags))

註:optparse己deprecated,應用argparse取代之。

來看看魯迅的孔乙己中有哪些關鍵詞:

python extract_tags.py ./kongyiji_utf8.txt -k 50
"""
孔乙己,掌柜,茴香豆,没有,粉板,十九个,说道,长衫,打折,柜台,显出,一碗,主顾,四文,中秋,慢慢,快活,
半懂不懂,这手,壶子,回字,窃书,之乎者也,样子,长久,温酒,喝酒,一天,虽然,颓唐,不多,短衣,孩子,现钱,
一回,后来,一碟,一个,可是,讨饭,哄笑,碟子,大钱,还清,掌柜的,年关,专管,取笑,读过,一样
"""

test/extract_tags_with_weight.py

一併返回關鍵詞及其權重值之使用示例:

import sys
sys.path.append('../')

import jieba
import jieba.analyse
from optparse import OptionParser

USAGE = "usage:    python extract_tags_with_weight.py [file name] -k [top k] -w [with weight=1 or 0]"

parser = OptionParser(USAGE)
parser.add_option("-k", dest="topK")
parser.add_option("-w", dest="withWeight")
opt, args = parser.parse_args()


if len(args) < 1:
    print(USAGE)
    sys.exit(1)

file_name = args[0]

if opt.topK is None:
    topK = 10
else:
    topK = int(opt.topK)

# 使用者未輸入withWeight參數或輸入值非1則設為False
# 否則設為True
if opt.withWeight is None:
    withWeight = False
else:
    if int(opt.withWeight) is 1:
        withWeight = True
    else:
        withWeight = False

content = open(file_name, 'rb').read()

tags = jieba.analyse.extract_tags(content, topK=topK, withWeight=withWeight)

if withWeight is True:
    for tag in tags:
        print("tag: %s\t\t weight: %f" % (tag[0],tag[1]))
else:
    print(",".join(tags))

同樣在孔乙己上測試,這次讓它回傳各詞的TF-IDF值:

python extract_tags_with_weight.py ./kongyiji_utf8.txt -k 50 -w 1
"""
tag: 孔乙己              weight: 0.537404
tag: 掌柜                weight: 0.169361
tag: 茴香豆              weight: 0.089261
tag: 没有                weight: 0.084370
tag: 粉板                weight: 0.068216
tag: 十九个              weight: 0.063498
tag: 说道                weight: 0.051551
tag: 长衫                weight: 0.050578
tag: 打折                weight: 0.050289
tag: 柜台                weight: 0.049562
tag: 显出                weight: 0.044785
tag: 一碗                weight: 0.042612
tag: 主顾                weight: 0.042340
tag: 四文                weight: 0.039660
tag: 中秋                weight: 0.036606
tag: 慢慢                weight: 0.035596
tag: 快活                weight: 0.035562
tag: 半懂不懂            weight: 0.034548
tag: 壶子                weight: 0.034108
tag: 这手                weight: 0.034108
tag: 窃书                weight: 0.034108
tag: 回字                weight: 0.034108
tag: 之乎者也            weight: 0.033090
tag: 样子                weight: 0.032712
tag: 长久                weight: 0.032709
tag: 温酒                weight: 0.032570
tag: 喝酒                weight: 0.031496
tag: 一天                weight: 0.030704
tag: 虽然                weight: 0.030486
tag: 颓唐                weight: 0.030476
tag: 不多                weight: 0.030210
tag: 短衣                weight: 0.030052
tag: 孩子                weight: 0.029566
tag: 现钱                weight: 0.028799
tag: 一回                weight: 0.028752
tag: 后来                weight: 0.028486
tag: 一碟                weight: 0.028332
tag: 一个                weight: 0.028135
tag: 可是                weight: 0.027498
tag: 讨饭                weight: 0.027419
tag: 哄笑                weight: 0.027304
tag: 碟子                weight: 0.027230
tag: 大钱                weight: 0.027122
tag: 还清                weight: 0.027087
tag: 掌柜的              weight: 0.026821
tag: 年关                weight: 0.026436
tag: 专管                weight: 0.026147
tag: 取笑                weight: 0.025750
tag: 读过                weight: 0.025270
tag: 一样                weight: 0.025264
"""

參考連結

Python2文檔 - 15.5. optparse — Parser for command line options
Python2文檔 - 15.5. optparse — Option.dest
Python2文檔 - 15.5. optparse — parse_args
魯迅的孔乙己

发布了75 篇原创文章 · 获赞 9 · 访问量 5万+

猜你喜欢

转载自blog.csdn.net/keineahnung2345/article/details/88125594
今日推荐