python爬虫生成词云

python爬虫生成词云

只生成词云还是比较简单的,网上教程还是比较多的,在这作为爬虫菜鸟就稍稍献丑献丑,请勿多怪

一、首先,我们需要用到的库有 jieba、matplolib、wordcloud。

 jieba 是一个python实现的分词库,对中文有着很强大的分词能力。

了解请戳 https://www.cnblogs.com/jiayongji/p/7119065.html

Matplotlib是Python中最常用的可视化工具之一,可以非常方便地创建海量类型地2D图表和一些基本的3D图表。

了解请戳 https://www.cnblogs.com/TensorSense/p/6802280.html

wordcloud是基于Python的词云生成类库。

(了解请戳 https://blog.csdn.net/heyuexianzi/article/details/76851377

二、上代码(借鉴 https://www.cnblogs.com/franklv/p/6995150.html

扫描二维码关注公众号,回复: 1213688 查看本文章
text = open('name.txt').read()
wl = " ".join(text)
result=jieba.analyse.textrank(text,topK=100,withWeight=True)
# print result
keywords = dict()
for i in result:
    keywords[i[0]] = i[1]
# print keywords


color_mask = plt.imread("a.jpg")
cloud = WordCloud(
    font_path="C:\Windows\Fonts\simfang.ttf",
    background_color='white',
    mask=color_mask,
    max_words=1000,
    stopwords = STOPWORDS,
    random_state = 30,            # 设置有多少种随机生成状态,即有多少种配色方案
    scale=.5
    # max_font_size=40
)
word_cloud = cloud.generate_from_frequencies(keywords)
word_cloud.to_file("a2.png")
plt.imshow(word_cloud)
plt.axis('off')
plt.show()

三、注意

先准备一张背景图片,这张背景图片呢,要类似于这样,最好背景是空白的,这样才会有轮廓呦!当然,ps大神那就没什么顾虑啦,换个背景就行,可是不会的就自己翻翻找找啦。右图就是生成的词云图。

         

四、还有还有,这个词云的词语来源是闺蜜的空间说说呦(借鉴的人家的代码呦)

代码代码

# -*- coding:utf-8 -*-
import time
from selenium import webdriver
from lxml import etree
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )

driver = webdriver.Firefox()
driver.get("http://i.qq.com")
driver.maximize_window()
user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0'
headers = {'User-Agent': user_agent}


driver.switch_to.frame("login_frame")
driver.find_element_by_id("switcher_plogin").click()
time.sleep(2)
driver.find_element_by_id("u").send_keys('你的账号')
driver.find_element_by_id("p").send_keys("你的密码")
driver.find_element_by_id("login_button").click()
time.sleep(2)
driver.switch_to.default_content()
driver.get("http://user.qzone.qq.com/" + "朋友qq" +"/311")

next_num = 0
while True:
        for i in range(1,6):
            height = 20000*i
            strWord = "window.scrollBy(0,"+str(height)+")"
            driver.execute_script(strWord)
            time.sleep(4)

        driver.switch_to.frame("app_canvas_frame")
        selector = etree.HTML(driver.page_source)
        divs = selector.xpath('//*[@id="msgList"]/li/div[3]')


        with open('qq_word.txt','a') as f:
            for div in divs:
                qq_name = div.xpath('./div[2]/a/text()')
                qq_content = div.xpath('./div[2]/pre/text()')
                qq_time = div.xpath('./div[4]/div[1]/span/a/text()')
                qq_name = qq_name[0] if len(qq_name)>0 else ''
                qq_content = qq_content[0] if len(qq_content)>0 else ''
                qq_time = qq_time[0] if len(qq_time)>0 else ''
                print(qq_name,qq_time,qq_content)
                f.write(qq_content+"\n")

        if driver.page_source.find('pager_next_' + str(next_num)) == -1:
         break

        driver.find_element_by_id('pager_next_' + str(next_num)).click()
        next_num += 1
        driver.switch_to.parent_frame()

注意注意这个frame有些麻烦,可以试试这几种用法

driver.switch_to.frame(0)  # 1.用frame的index来定位,第一个是0
driver.switch_to.frame("frame1")  # 2.用id来定位
driver.switch_to.frame("myframe")  # 3.用name来定位
driver.switch_to.frame(driver.find_element_by_tag_name("iframe"))  # 4.用WebElement对象来定位

 
 


猜你喜欢

转载自blog.csdn.net/qq_40024605/article/details/80512392