爬虫应用---校园网搜索引擎系统设计

校园网搜索引擎系统设计

一.一般需要以下4个步骤：

（1）网络爬虫爬取这个网站，得到所有网页链接

（2）得到网页源代码，解析剥离出想要的内容

（3）把内容做成词条索引，一般采用倒序表索引

（4）搜索时，根据搜索词在词条索引中查询，按顺序/网页评价排名返回结果

所以，本系统主要由以下4个模块组成：

.信息采集模块----利用网络爬虫实现对校园网信息的抓取

.索引模块---负责对爬取标题，内容，作者分词并建立倒排词表

.用户搜索界面模块----负责用户关键字的输入返回结果

二.关键知识点

1.正则表达式：精确匹配用[]表范围

\w 匹配字母,数字及下划线

\s 匹配任何空白字符

\d匹配任意数字

\b匹配一个单词边界(开头和结尾)

2.re模块----提供正则功能

(1)match()方法---判断是否匹配

它的属性：

.string 匹配时使用的文本

.re 匹配时使用的pattern对象

.endpos 文本中正则开始搜索的索引

.lastindex 文本中正则结束搜索的索引

.lastgroup 最后一个被捕捉的分组的别名

（2）它的方法：分组

group() ----要提取的分组子串

group(0) 是原始字符串

group(1) 是第一个子串

如：m=re.match(r’^(\d{3})-(\d{3,8}$)’,’010-12345’)

m.group(0) ---得到’010-12345’

m.group(1)---得到’010’

(3)切分字符串----非常强大，可以识别空格,逗号,分号

如：re.split(r’[\s\,\;]+’,’a,b;c d’)

结果为---[‘a’,’b’,’c’,’d’]

(4)search()各findall()方法

re.match()总是从字符串”开头”去匹配，所以当用它去匹配非‘开头’部分字符串时会返回none.

如：str1=’Hello word’

print(re.match(r’world’,str1)) ----结果为none

因此,若想在字符串任意位置去匹配，就用re.search()和re.findall()

3.中文分词----jieba组件

就是将连续的字序列重新组合成词序列

它有3种模式：精确，全，搜索引擎模式

(1)jieba.cut()方法----用于分词

它返回的结构是一个可迭代的生成器，可用for循环来获得分词后得到的每一个词

如：seg_list=jieba.cut(“我来到北京清华大学”)

for word in seg_list:

print(word,end=’’)-->结果:我来到北京清华大学

搜索引擎模式：

seg_list=jieba.cut_for_search(‘我来到北京清华大学’)

print(‘搜索引擎模式:’, ’/’.join(seg_list))

结果为:我/来到/北京/清华/华大/大学/清华大学

(2)也可为jieba添加自定词典---jieba.load_userdict(”你自定义的文件”)

(3)文本的关键词提取----jieba.analyse.extract_tags() 获取关键词

如：

text=”故宫的著名景点包括乾清宫，太和殿和午门等，午门是紫禁城的正门，午门居中”

tags=jieba.analyse.extract_tags(text,topk=5)----获取5个关键词

pring(‘关键词：’ “”.join(tags))

结果：关键词：午门乾清宫著名景点太和殿正门

代码实现案例：

1.py文件主要实现爬虫信息采集和词条索引的词表建立

#爬虫信息采集和词条索引的词表建立

import sys

from collections import deque

import urllib

from urllib import request

import re

from bs4 import BeautifulSoup #用BeautifulSoup库来解析网页信息

import xml

import sqlite3

import jieba

url='http://www.zut.edu.cn/index/xwdt.htm'

#1创建数据库中两表

unvisited=deque() #创建双向队列,待爬取的列表

visited=set() #已访问的

unvisited.append(url)

conn=sqlite3.connect('viewsdu.db')

c=conn.cursor()

c.execute('create table doc (id int primary key,link text)')

c.execute('create table word (term varchar(25) primary key,list text)')

conn.commit()

conn.close()

#第一步：找到两个特殊类的a标签的特殊超链接地址添加到待爬列表中

print('........开始........')

cnt=0

while unvisited:

url=unvisited.popleft() #待爬的url

visited.add(url)

cnt+=1 #序号

print('开始抓取第',cnt,'个链接: ',url)

#爬取网页内容，并找到所有class类为c67214的<a>标签

try:

response=request.urlopen(url)

content=response.read().decode('utf-8')

except:

continue

#寻找可爬取的链接，因为搜索范围是网站内，所以对链接有格式要求，这个格式要剥离根据网站的编码格式定

#解析网页内容，有几种情况，这里也是根据些网站网页的具体情况写的

soup=BeautifulSoup(content,'lxml')#soup对象可以方便和浏览器中检查元素看到的内容建立联系

all_a=soup.find_all('a',{'class':"c67214"}) #找到本页面所有class类为c67214的<a>标签--find_all对子孙标签全部检索

#把所有class类为c67214的a标签有如下超链接地址值的加到待爬列表中爬取

for a in all_a:

x=a.attrs['href'] #获取a标签的href属性值即超链接地址

if re.match(r'http.+',x): #正则匹配

if not re.match(r'http\:\/\/www\.zut\.edu\.cn\/.+',x):

continue #意思就是只找http://www.zut.edu.cn/开头的超链接地址的a标签

if re.match(r'\/info\/\.+',x): #"/info/1046/20314.html"

x='http://www.zut.edu.cn'+x

elif re.match(r'info/.+',x): #"info/1046/20314.html

x='http://www.zut.edu.cn'+x

elif re.match(r'\.\.\/info/.+',x): #"../info/1046/20314.html"

x='http://www/zut.edu.cn'+x[2:]

elif re.mathc(r'\.\.\/\.\.\/info/.+',x): #”../../info/1046/20314.html“

x='http://www.zut.edu.cn'+x[5:]

if (x not in visited) and (x not in unvisited): #添加还没读取的超链接到待爬取列表

unvisited.append(x)

#获取下一页class类为Next的所有a标签的超链接满足如下两个url开头的地址到待爬列表中

a=soup.find('a',{'class':"Next"}) #只找下一页classi类的<a>标签并检索

if a!=None:

x=a.attrs['href'] #取出href值即超链接地址

if re.match(r'xwdt\/.+',x):

x='http://www.zut.edu.cn/index'+x

else:

x='http://www.zut.edu.cn/index/xwdt'+x

if (x not in visited) and (x not in unvisited):

unvisited.append(x)

#第二步：提取出满足上述条件的链接地址中的网页内容

#提取出的网页内容存在title,article,author,time中

title=soup.title

article=soup.find('div',class_='c67215_content',id='vsb_newscontent')

author=soup.find('span',class_='authorstyle67215') #作者

time=soup.find('span',class_='timestyle67215')

if title==None and article==None and author==None:

print('无内容的页面。')

continue

elif article==None and author==None:

print('只有标题。')

title=title.text

title=''.join(title.split())

article=''

author=''

elif article==None:

print('有标题有作者，缺失内容')

title=title.text

title=''.join(title.split())

article=''

author=author.get_text("",strip=True)

author=''.join(author.split())

elif author==None:

print('有标题有内容，缺失作者')

title=title.text

title=''.join(title.split())

article=article.get_text("",strip=True)

article=''.join(article.split())

author=''

else:

title=title.text

title=''.join(title.split())

article=article.get_text("",strip=True)

article=''.join(article.split())

author=author.get_text("",strip=True)

author=''.join(author.split())

print('网页标题：',title)

#第三步把提取出的三个网页内容并分别对它们进行中文分词

seggen = jieba.cut_for_search(title) #搜索引擎模式进行分词

seglist = list(seggen) #把分词结果转为列表

seggen = jieba.cut_for_search(article)

seglist += list(seggen)

seggen = jieba.cut_for_search(author)

seglist += list(seggen)

#第四步把序号和链接url地址存储到库doc表中

conn = sqlite3.connect("viewsdu.db") #创建连接

c = conn.cursor() #创建游标（一次从数据库表中读取一条记录）

c.execute('insert into doc values(?,?)', (cnt, url)) #插入doc表中序号和链接url地址

#第五步对每个分出的词建立词表

for word in seglist: #遍历三个词表

# print(word)

# 检验看看这个词是否已存在于数据库

c.execute('select list from word where term=?', (word,)) #查询word表中三个词条

result = c.fetchall()

# 如果不存在

if len(result) == 0:

docliststr = str(cnt)

c.execute('insert into word values(?,?)', (word, docliststr)) #将词条和序号插入word表中

# 如果已存在

else:

docliststr = result[0][0] # 得到字符串

docliststr += ' ' + str(cnt)

c.execute('update word set list=? where term=?', (docliststr, word))

conn.commit()

conn.close()

print('词表建立完毕=======================================================')

2.py文件：主要实现网页排名和搜索

import re #正则模块

import urllib #访问网页模块

from urllib import request #打开和读取url

from collections import deque ##双向队列

from bs4 import BeautifulSoup #从网页抓取数据解释器

import lxml #解析网页的库

import sqlite3 #数据库

import jieba #中文分词组件

import math #数学函数模块(专对浮点数的数学运算)

#第一步对输入内容进行分词

conn=sqlite3.connect("viewsdu.db")

c=conn.cursor()

c.execute('select count(*) from doc')

N=1+c.fetchall()[0][0] #文档总数

target=input('请输入搜索词：')

seggen=jieba.cut_for_search(target) #对输入的内容进行搜索引擎模式分词

#第二步搜索对应匹配的网页分词并进行统计

score={} #文档号：文档得分

for word in seggen: #遍历输入内容的分词

print('得到查询词：',word)

tf={} #文档号：文档数

c.execute('select list from word where term=?',(word,)) #查询word表中是否有用户输入的分词

result=c.fetchall() #列出所有匹配结果

if len(result)>0:

doclist=result[0][0] #匹配结果中第一条记录如'12 35 35 35 88 88'

doclist=doclist.split(' ') #结果如['12','35','35','35','88','88']

doclist=[int(x) for x in doclist]#把字符串转换为元素为int的list [12,35,35,35,88,88]

df=len(set(doclist)) #当前word对应的df数（即几篇文档中出现）---词频

idf=math.log(N/df) #逆文本频率计算公式

print('idf:',idf)

#tf文档数统计

for num in doclist:

if num in tf: #文档号

tf[num]=tf[num]+1 #文档数加1

else:

tf[num]=1

#开始计算score文档得分

for num in tf:

if num in score:

# 如果该num号文档已经有分数了，则累加

score[num]=score[num]+tf[num]*idf

else:

score[num]=tf[num]*idf

sortedlist=sorted(score.items(),key=lambda d:d[1],reverse=True) #得分列表

#第三步：判断搜索出匹配的结果

cnt=0

for num,docscore in sortedlist:

cnt=cnt+1

c.execute('select link from doc where id=?',(num,))

url=c.fetchall()[0][0]

print(url,'得分:',docscore)

try:

response=request.urlopen(url)

content=response.read().decode('utf-8')

except:

print('oops...读取网页出错')

continue

soup = BeautifulSoup(content, 'lxml')

title = soup.title

if title == None:

print('No title.')

else:

title = title.text

print(title)

if cnt > 20:

break

if cnt == 0:

print('无搜索结果')

爬虫应用---校园网搜索引擎系统设计

猜你喜欢