Splash 渲染引擎简介

Splash文档地址：http://splash.readthedocs.io/en/latest/api.html
Splash渲染引擎有以下功能
1.为用户返回经过渲染的HTML页面或页面截图
2.并发渲染多个页面
3.关闭图片加载，加速渲染
4.在页面中执行用户自定义的JS脚本代码
5.执行用户自定义的渲染脚本（lua）

Splash有多个服务端点
1.render.html端点

请求地址 http://localhost/render.html
请求方式 GET/POST
返回类型 html

参数	描述
url	需要渲染页面的url
timeout	渲染页面超时时间
proxy	代理服务器地址
wait	等待页面渲染的时间
image	是否下载图片，默认为1
js_sourse	用户自定义的JS代码，在渲染页面前执行

示例代码

import requests
from scrapy.selector import Selector
splash_url='http://localhost:8050/render.html'
args={'url':'http://quotes.toscrape.com/js','timeout':5,'image':0}
response=requests.get(splash_url,params=args)
sel=Selector(response)
sel.css('div.quote span.text').extract()
print(sel)

2.execute端点
再爬取页面的时候，可能需要和页面进行交互，比如下拉等操作，这个时候就可以用execute端点来执行一些用户自定义的js代码

参数	描述
lua_source	用户自定义的lua脚本
timeout	渲染页面超时时间
proxy	代理服务器地址

我们可以将execute端点的服务看作一个可用lua编程的浏览器。使用时需要传递一个用户自定义的lua脚本给Splash，该lua脚本包含想要模拟的浏览器行为，例如
*打开某个页面
*等待页面加载渲染
*执行JS代码
*获取HTTP响应头部
*获取cookie

示例代码

import requests
lua_script="
function main()
	splash:go('http://example.com')	#打开页面
	splash:wait(0.5)		#等待加载
	local title=splash:evaljs('document.title')	#执行js代码获取结果
	return {title=title}	#返回json形式的结果
"
splash_url='http://localhost:8050/execute'
headers={'content-type':'application/json'}
data=json.dumps({'lua_source':lua_script})
response=requests.get(splash_url,headers=heaers,data=data)
print(response.content)

splash对象常用的属性和方法
1.splash.args属性：用户传入参数的表，可以访问用户传入的参数，如splash.args.wait
2.splash.js_enabled属性：用于开启/禁止JS渲染，默认为开启
3.splash.images_enabled属性:开启/关闭图片加载，默认为开启
4.splash:go方法：例如splash:go{url,baseurl=nil,headers=nil,http_method=‘GET’，body=nil,formdata=nil}
5.splash:wait方法：等待页面渲染
6.splash：evaljs方法：执行一段js代码，并返回最后一句表达式的值
7.splash：runjs：运行JS代码，不返回值
8.splash：url
9.splash：html
10.splash：get——cookies（）：获取cookie信息

Splash 渲染引擎简介

猜你喜欢