Public comment Comments crawling - crawling complete information encrypted Comment

Let us talk about how to do it encrypts it.

FIG follows: partial font is <span> tag contained, which is an actual svg background, with FIG css style display snow loading control, and can be seen his css background property, and can be changed to see the effect of their own, attention Font width: 14px, the later decryption helpful.

 

Let us talk about how to do it encrypts it.

FIG follows: partial font is <span> tag contained, which is an actual svg background, with FIG css style display snow loading control, and can be seen his css background property, and can be changed to see the effect of their own, attention Font width: 14px, the later decryption helpful.

 

Ideas:

1. For a complete HTML style comments section, the entire contents of a list kept up;

2. acquiring position css style, the style in the source code, as shown below, we need to span each tag class attribute values, the coordinate information because it corresponds to background.

3. The dynamic css styles from svg image links, generate the dictionary database, and then use css coordinate the second step is processed to find the word corresponding to the real value, and returns the final real comment.

step:

Figure 1. Each tag still need to know about the position.

First, view the source code and know where css styles link, save the link.

 

Two, css link style there will be a first step to get the background-image tag, which contains an encrypted path svg fonts, save the path, while the css file ***** {background: -. *, - *} save made up dictionary.

Knock on the blackboard: Photo mentioned in a font style width of 14px, so we get the x-coordinate of the background / 14, encryption is the ultimate location of the font in svg, svg string of each line can be converted to an array of storage, so We will be able to establish correspondence. and y coordinate values ​​svg background to the <path> in the comparison, he said follow-up.

 

Third, it is encrypted svg font file, note <path> tag, which corresponds to the id value href later <textPath> tag, it is very interesting value of d, the key is decrypted.

敲黑板了:步骤二中提到过background的y值,再和d列的M0后面值做比较就有意思了,比如
.gqi4j {background: -98.0px -130.0px;} 中的y:-130,取正数130,小于<path>标签中的d属性第二列的174这个值,则对应加密字库实际y轴为174,对应的id=4,就是<textPath>中的href标签,也就是加密字体的y轴坐标,而它的x就是98/14,对应的id=4,href="#4"行里的低98/14个的值,至此一个加密的字就取出来了。

 

最终效果:

如下图,完整评论内容,右侧部分没显示完部分,需要点开更多评论,原网页中有两个标签,一个完整的,一个局部的。

 

代码:

  1 #!/usr/bin/env python  
  2 # encoding: utf-8  
  3 """ 
  4 @version: v1.0 
  5 @author: W_H_J 
  6 @license: Apache Licence  
  7 @contact: [email protected] 
  8 @software: PyCharm 
  9 @file: dazhongdianping.py 
 10 @time: 2018/12/19 17:45 
 11 @describe: 大众点评评论抓取-解析
 12 """
 13 import sys
 14 import os
 15 import re
 16 import requests
 17 from pyquery import PyQuery as pq
 18  
 19 sys.path.append(os.path.abspath(os.path.dirname(__file__) + '/' + '..'))
 20 sys.path.append("..")
 21  
 22  
 23 header_pinlun = {
 24 'Host': 'www.dianping.com',
 25 'Accept-Encoding': 'gzip',
 26 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
 27 }
 28  
 29 header_css = {
 30 'Host': 's3plus.meituan.net',
 31 'Accept-Encoding': 'gzip',
 32 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
 33  
 34 }
 35  
 36  
 37 # 0-详情页
 38 def get_msg():
 39     """
 40     url: http://www.dianping.com/shop/+ 商铺ID +/review_all
 41     :return:
 42     """
 43     # url = "http://www.dianping.com/shop/110620927/review_all"
 44     url = "http://www.dianping.com/shop/96658933/review_all"
 45     # url = "https://www.dianping.com/shop/77307732/review_all"
 46     html = requests.get(url, headers=header_pinlun)
 47     print("1 ===> STATUS", html.status_code)
 48     doc = pq(html.text)
 49     # 解析每条评论
 50     pinglunLi = doc("div.reviews-items > ul > li").items()
 51     """
 52     调用评论里的css样式处理和加密字体svg处理
 53     :return:
 54     dict_svg_text: svg整个加密字库,以字典形式返回
 55     list_svg_y:svg背景中的<path>标签里的[x,y]坐标轴,以[x,y]形式返回
 56     dict_css_x_y:css样式中,每个加密字体的<span> 标签内容,用于匹配dict_svg_text 中的key,以字典形式返回
 57     """
 58     dict_svg_text, list_svg_y, dict_css_x_y = css_get(doc)
 59  
 60     for data in pinglunLi:
 61         # 用户名
 62         userName = data("div.main-review > div.dper-info > a").text()
 63         # 用户ID链接
 64         userID = "http://www.dianping.com" + data("div.main-review > div.dper-info > a").attr("href")
 65         # 用户评分星级[10-50]
 66         startShop = str(data("div.review-rank > span").attr("class")).split(" ")[1].replace("sml-str", "")
 67         # 用户描述:机器:非常好 环境:非常好 服务:非常好 人均:0元
 68         describeShop = data("div.review-rank > span.score").text()
 69         # 关键部分,评论HTML,待处理,评论包含隐藏部分和直接展示部分,默认从隐藏部分获取数据,没有则取默认部分。(查看更多)
 70         pinglun = data("div.review-words.Hide").html()
 71         try:
 72             len(pinglun)
 73         except:
 74             pinglun = data("div.review-words").html()
 75         # 该用户喜欢的美食
 76         loveFood = data("div.main-review > div.review-recommend").text()
 77         # 发表评论的时间
 78         pinglunTime = data("div.main-review > div.misc-info.clearfix > span.time").text()
 79         print("userName:", userName)
 80         print("userID:", userID)
 81         print("startShop:", startShop)
 82         print("describeShop:", describeShop)
 83         print("loveFood:", loveFood)
 84         print("pinglunTime:", pinglunTime)
 85         print("pinglun:", css_decode(dict_css_x_y, dict_svg_text, list_svg_y, pinglun))
 86         print("*"*100)
 87  
 88  
 89 # 1-评论隐含部分字体css样式, 获取svg链接,获取加密汉字background
 90 def css_get(doc):
 91     css_link = "http:"+doc("head > link:nth-child(11)").attr("href")
 92     background_link = requests.get(css_link, headers=header_css)
 93     r = r'background-image: url(.*?);'
 94     matchObj = re.compile(r, re.I)
 95     svg_link = matchObj.findall(background_link.text)[0].replace(")", "").replace("(", "http:")
 96     """
 97     svg_text() 方法:请求svg字库,并抓取加密字
 98     dict_svg_text: svg整个加密字库,以字典形式返回
 99     list_svg_y:svg背景中的<path>标签里的[x,y]坐标轴,以[x,y]形式返回
100     """
101     dict_avg_text, list_svg_y = svg_text(svg_link)
102     """
103     css_dict() 方法:生成css样式中background的样式库
104     dict_css: 返回css字典样式
105     """
106     dict_css = css_dict(background_link.text)
107     return dict_avg_text, list_svg_y, dict_css
108  
109  
110 # 2-字体库链接
111 def svg_text(url):
112     html = requests.get(url)
113     dict_svg, list_y = svg_dict(html.text)
114     return dict_svg, list_y
115  
116  
117 # 3-生成svg字库字典
118 def svg_dict(csv_html):
119     svg_text_r = r'<textPath xlink:href="(.*?)" textLength="(.*?)">(.*?)</textPath>'
120     svg_text_re = re.findall(svg_text_r, csv_html)
121     dict_avg = {}
122     # 生成svg加密字体库字典
123     for data in svg_text_re:
124         dict_avg[data[0].replace("#", "")] = list(data[2])
125     """
126     重点:http://s3plus.meituan.net/v1/mss_0a06a471f9514fc79c981b5466f56b91/svgtextcss/74d63812e5b327d850ab4a8782833d47.svg
127         svg <path> 标签里内容对应css样式中background的y轴参数,小于关系,
128         如果css样式中的background的y参数小于 svg_y_re 集合中最小的数,则向上取y轴,('18', 'M0', '748', 'H600'),
129         如.gqi4j {background: -98.0px -745.0px;} 中的y-745,取正数745,小于748,则对应加密字库实际y轴为748,对应的18就是<textPath>中的x轴
130     """
131     svg_y_r = r'<path id="(.*?)" d="(.*?) (.*?) (.*?)"/>'
132     svg_y_re = re.findall(svg_y_r, csv_html)
133     list_y = []
134     # 存储('18', 'M0', '748', 'H600') eg:(x坐标,未知,y坐标,未知)
135     for data in svg_y_re:
136         list_y.append([data[0], data[2]])
137     return dict_avg, list_y
138  
139  
140 # 4-生成css字库字典
141 def css_dict(html):
142     css_text_r = r'.(.*?){background:(.*?)px (.*?)px;}'
143     css_text_re = re.findall(css_text_r, html)
144     dict_css = {}
145     for data in css_text_re:
146         """
147         加密字库.gqi4j {background: -98.0px -745.0px;}与svg文件对应关系,x/14,就是svg文件加密字体下标
148         y,原样返回,需要在svg函数中做处理
149         """
150         x = int(float(data[1])/-14)
151         """
152         字典参数:{css参数名:(background-x,background-y,background-x/14,background-y)}
153         """
154         dict_css[data[0]] = (data[1], data[2], x, data[2])
155     return dict_css
156  
157  
158 # 5-最终评论汇总
159 def css_decode(css_html, svg_dict, svg_list, pinglun_html):
160     """
161     :param css_html: css 的HTML源码
162     :param svg_dict: svg加密字库的字典
163     :param svg_list: svg加密字库对应的坐标数组[x, y]
164     :param pinglun_html: 评论的HTML源码,对应0-详情页的评论,在此处理
165     :return: 最终合成的评论
166     """
167     css_dict_text = css_html
168     csv_dict_text, csv_dict_list = svg_dict, svg_list
169     # 处理评论源码中的span标签,生成字典key
170     pinglun_text = pinglun_html.replace('<span class="', ',').replace('"/>', ",").replace('">', ",")
171     pinglun_list = [x for x in pinglun_text.split(",") if x != '']
172     pinglun_str = []
173     for msg in pinglun_list:
174         # 如果有加密标签
175         if msg in css_dict_text:
176             # 参数说明:[x,y] css样式中background 的[x/14,y]
177             x = int(css_dict_text[msg][2])
178             y = -float(css_dict_text[msg][3])
179             # 寻找background的y轴比svg<path>标签里的y轴小的第一个值对应的坐标就是<textPath>的href值
180             for g in csv_dict_list:
181                 if y < int(g[1]):
182                     # print(g)
183                     # print(csv_dict_text[g[0]][x])
184                     pinglun_str.append(csv_dict_text[g[0]][x])
185                     break
186         # 没有加密标签
187         else:
188             pinglun_str.append(msg.replace("\n", ""))
189     str_pinglun = ""
190     for x in pinglun_str:
191         str_pinglun += x
192     # 处理特殊标签
193     dr = re.compile(r'</?\w+[^>]*>', re.S)
194     dr2 = re.compile(r'<img+[^;]*', re.S)
195     dr3 = re.compile(r'&(.*?);', re.S)
196     dd = dr.sub('', str_pinglun)
197     dd2 = dr2.sub('', dd)
198     pinglun_str = dr3.sub('', dd2)
199     return pinglun_str
200  
201  
202 if __name__ == '__main__':
203     get_msg()

Guess you like

Origin www.cnblogs.com/crawl-data/p/11665909.html