requests-html

目录

一 介绍

  Python上有一个非常著名的HTTP库——requests,相信大家都听说过,用过的人都说非常爽!现在requests库的作者又发布了一个新库,叫做requests-html,看名字也能猜出来,这是一个解析HTML的库,具备requests的功能以外,还新增了一些更加强大的功能,用起来比requests更爽!接下来我们来介绍一下它吧。

# 官网解释
'''
This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.

If you’re interested in financially supporting Kenneth Reitz open source, consider visiting this link. Your support helps tremendously with sustainability of motivation, as Open Source is no longer part of my day job.

When using this library you automatically get:

- Full JavaScript support!
- CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
- XPath Selectors, for the faint at heart.
- Mocked user-agent (like a real web browser).
- Automatic following of redirects.
- Connection–pooling and cookie persistence.
- The Requests experience you know and love, with magical parsing abilities.
- Async Support
'''

  

  官网告诉我们,它比原来的requests模块更加强大,并且为我们提供了一些新的功能!

  • 支持JavaScript
  • 支持CSS选择器(又名jQuery风格, 感谢PyQuery)
  • 支持Xpath选择器
  • 可自定义模拟User-Agent(模拟得更像真正的web浏览器)
  • 自动追踪重定向
  • 连接池与cookie持久化
  • 支持异步请求

二 安装

  安装requests-html非常简单,一行命令即可做到。需要注意一点就是,requests-html只支持Python 3.6或以上的版本,所以使用老版本的Python的同学需要更新一下Python版本了。

 

三 如何使用?

四 介绍

五 介绍

 

教程和用法

使用请求向'python.org'发出GET请求:

>>> from  requests_html  import  HTMLSession 
>>> session  =  HTMLSession ()
>>> r  =  会话得到'https://python.org/' 

尝试异步并同时获取一些网站:

>>> from  requests_html  import  AsyncHTMLSession 
>>> asession  =  AsyncHTMLSession ()
>>> async  def  get_pythonorg ():
...    r  =  await  asession 得到'https://python.org/' 
>>> async  def  get_reddit ():
...    r  =  await  asession 得到'https://reddit.com/' 
>>> async  def  get_google ():
...    r  =  await  asession 得到'https://google.com/' 
>>> 结果 =  会话run get_pythonorg  get_reddit  get_google 

按原样获取页面上所有链接的列表(不包括锚点):

>>> r HTML 链接
{'//docs.python.org/3/tutorial/','/ about / apps /','https://github.com/python/pythondotorg/issues','/ accounts / login /','/ dev / peps /','/ about / legal /','// docs.python.org/3/tutorial/introduction.html#lists','/ download / alternatives','http://feedproxy.google。 com / ~r / PythonInsider /~3 / kihd2DW98YY / python-370a4-is-available-for-testing.html','/ download / other /','/ downloads / windows /','https:// mail。 python.org/mailman/listinfo/python-dev','/ doc / av','https://devguide.python.org/','/ about / success /#engineering','https:// wiki。 python.org/moin/PythonEventsCalendar#Submitting_an_Event','https://www.openstack.org','/ about / gettingstarted /','http://feedproxy.google.com/~r/PythonInsider/~3/ AMoBel8b8Mc /蟒-3。html','/ success-stories / industrial-light-magic-runs-python /','http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator' ,'/','http://pyfound.blogspot.com/','/ events / python-events / past /','/ downloads / release / python-2714 /','https://wiki.python .org / moin / PythonBooks','http://plus.google.com/+Python','https://wiki.python.org/moin/','https://status.python.org/' ,'/ community / workshops /','/ community / lists /','http://buildbot.net/','/ community / awards','http://twitter.com/ThePSF','https: //docs.python.org/3/license.html','/ psf / donations /','http://wiki.python.org/moin/Languages','/ dev /','/ events / python -user-group /','https://wiki.qt.io/PySide','/ community / sigs /','https://wiki.gnome.org/Projects/PyGObject','http://www.ansible.com','http://www.saltstack.com','http: //planetpython.org/','/ events / python-events','/ about / help /','/ events / python-user-group / past /','/ about / success /','/ psf -landing /','/ about / apps','/ about /','http://www.wxpython.org/','/ events / python-user-group / 665 /','https:// www.python.org/psf/codeofconduct/','/ dev / peps / peps.rss','/ downloads / source /','/ psf / sponsorship / sponsors /','http://bottlepy.org' ,'http://roundup.sourceforge.net/','http://pandas.pydata.org/','http://brochure.getpython.info/','https://bugs.python.org /','/ community / merchandise /','http:// tornadoweb。org','/ events / python-user-group / 650 /','http://flask.pocoo.org/','/ downloads / release / python-364 /','/ events / python-user- group / 660 /','/ events / python-user-group / 638 /','/ psf /','/ doc /','http://blog.python.org','/ events / python- events / 604 /','/ about / success /#government','http://python.org/dev/peps/','https://docs.python.org','http:// feedproxy。 google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html','/ users / membership /','/ about / success /#arts','https:// wiki.python.org/moin/Python2orPython3','/ downloads /','/ jobs /','http://trac.edgewall.org/','http://feedproxy.google.com/~r/ PythonInsider / ~3 / wh73_1A-N7Q / python-355rc1-and-python-348rc1-are-now.html','/ privacy /','https://pypi.python.org/','http://www.riverbankcomputing.co.uk/software/pyqt/intro','http://www.scipy.org', '/ community / forums /','/ about / success / #scientific','/ about / success / #software-development','/ shell /','/ accounts / signup /','http:// www .facebook.com / pythonlang?fref = ts','/ community /','https://kivy.org/','/ about / quotes /','http://www.web2py.com/', '/ community / logos /','/ community / diversity /','/ events / calendars /','https://wiki.python.org/moin/BeginnersGuide','/ success-stories /','/ doc / essays /','/ dev / core-mentorship /','http://ipython.org','/ events /','// dococs.python.org / 3 / tutorial / controlflow.html', '/约/成功/#教育','/ blogs /','/ community / irc /','http://pycon.blogspot.com/','// jobs.python.org','http://www.pylonsproject.org/', 'http://www.djangoproject.com/','/ downloads / mac-osx /','/ about / success / #business','http://feedproxy.google.com/~r/PythonInsider/~ 3 / x_c9D0S-4C4 / python-370b1-is-now-available-for.html','http://wiki.python.org/moin/TkInter','https://docs.python.org/faq/ ','//docs.python.org/3/tutorial/controlflow.html#defining-functions'}com / ~r / PythonInsider / ~3 / x_c9D0S-4C4 / python-370b1-is-now-available-for.html','http://wiki.python.org/moin/TkInter','https:// docs.python.org/faq/','// docs.python.org/3/tutorial/controlflow.html#defining-functions'}com / ~r / PythonInsider / ~3 / x_c9D0S-4C4 / python-370b1-is-now-available-for.html','http://wiki.python.org/moin/TkInter','https:// docs.python.org/faq/','// docs.python.org/3/tutorial/controlflow.html#defining-functions'}

以绝对形式获取页面上所有链接的列表(不包括锚点):

>>> r HTML absolute_links
{'https://github.com/python/pythondotorg/issues','https://docs.python.org/3/tutorial/','https://www.python.org/about/success/' ,'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html','https://www.python.org/dev/ peps /','https://mail.python.org/mailman/listinfo/python-dev','https://www.python.org/doc/','https://www.python.org/ ','https://www.python.org/about/','https://www.python.org/events/python-events/past/','https://devguide.python.org/' ,'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event','https://www.openstack.org','http://feedproxy.google.com/~r/PythonInsider/~3/ AMoBel8b8Mc / python-3.html','https://docs.python.org/3/tutorial/introduction。html#lists','http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator','http://pyfound.blogspot.com/','https ://wiki.python.org/moin/PythonBooks','http://plus.google.com/+Python','https://wiki.python.org/moin/','https:// www .python.org / events / python-events','https://status.python.org/','https://www.python.org/about/apps','https://www.python。 org / downloads / release / python-2714 /','https://www.python.org/psf/donations/','http://buildbot.net/','http://twitter.com/ThePSF ','https://docs.python.org/3/license.html','http://wiki.python.org/moin/Languages','https://docs.python.org/faq/' ,'https://jobs.python.org','https://www.python.org/about/success/#software-development','https://www.python.org/about/success/#education','https://www.python.org/community/logos/','https://www.python.org/doc/av',' https://wiki.qt.io/PySide','https://www.python.org/events/python-user-group/660/','https://wiki.gnome.org/Projects/PyGObject ','http://www.sansstack.com','http://www.python.org/dev/peps/peps.rss','http:/ /planetpython.org/','https://www.python.org/events/python-user-group/past/','https://docs.python.org/3/tutorial/controlflow.html#defining -functions','https://www.python.org/community/diversity/','https://docs.python.org/3/tutorial/controlflow.html','https://www.python。 org / community / awards','https://www.python.org/events/python-user-group/638/','https://www.python。org / about / legal /','https://www.python.org/dev/','https://www.python.org/download/alternatives','https://www.python.org/ downloads /','https://www.python.org/community/lists/','http://www.wxpython.org/','https://www.python.org/about/success/#政府','https://www.python.org/psf/','https://www.python.org/psf/codeofconduct/','http://bottlepy.org','http:// roundup.sourceforge.net/','http://pandas.pydata.org/','http://brochure.getpython.info/','https://www.python.org/downloads/source/' ,'https://bugs.python.org/','https://www.python.org/downloads/mac-osx/','https://www.python.org/about/help/', 'http://tornadoweb.org','http://flask.pocoo.org/','https://www.python。org / users / membership /','http://blog.python.org','https://www.python.org/privacy/','https://www.python.org/about/gettingstarted/ ','http://python.org/dev/peps/','https://www.python.org/about/apps/','https://docs.python.org','https:/ /www.python.org/success-stories/','https://www.python.org/community/forums/','http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00 /python-364-is-now-available.html','https://www.python.org/community/merchandise/','https://www.python.org/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3','http://trac.edgewall.org/','http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A- N7Q / python-355rc1-and-python-348rc1-are-now.html','https://pypi.python.org/','https://www.python.org/events/python-user-group/650/','http://www.riverbankcomputing.co.uk/software/pyqt/intro','https://www.python.org / about / quotes /','https://www.python.org/downloads/windows/','https://www.python.org/events/calendars/','http://www.scipy。 org','https://www.python.org/community/workshops/','https://www.python.org/blogs/','https://www.python.org/accounts/signup/ ','https://www.python.org/events/','https://kivy.org/','http://www.facebook.com/pythonlang?fref = ts','http:/ /www.web2py.com/','https://www.python.org/psf/sponsorship/sponsors/','https://www.python.org/community/','https:// www。 python.org/download/other/','https://www.python.org/psf-landing/','https://www.python。org / events / python-user-group / 665 /','https://wiki.python.org/moin/BeginnersGuide','https://www.python.org/accounts/login/','https: //www.python.org/downloads/release/python-364/','https://www.python.org/dev/core-mentorship/','https://www.python.org/about/ success / #business','https://www.python.org/community/sigs/','https://www.python.org/events/python-user-group/','http:// ipython .org','https://www.python.org/shell/','https://www.python.org/community/irc/','https://www.python.org/about/success /#engineering','http://www.pylonsproject.org/','http://pycon.blogspot.com/','https://www.python.org/about/success/#scientific', 'https://www.python.org/doc/essays/','http://www.djangoproject.com/','https:// www。python.org/success-stories/industrial-light-magic-runs-python/','http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now -available-for.html','http://wiki.python.org/moin/TkInter','https://www.python.org/jobs/','https://www.python.org/事件/蟒事件/ 604 /'}

选择带有CSS Selector的元素:

>>> about  =  r HTML find '#about'  first = True 

抓取元素的文本内容:

>>> 打印文本
关于
应用
行情
入门
救命
Python手册

反思Element的属性:

>>> 关于attrs 
{'id':'about','class':('tier-1','element-1'),'aria-haspopup':'true'}

渲染元素的HTML:

>>> 关于html 
'<li aria-haspopup =“true”class =“tier-1 element-1”id =“about”> \ n <a class="" href="/about/" title="">关于</ a> \ n <ul aria-hidden =“true”class =“subnav menu”role =“menu”> \ n <li class =“tier-2 element-1”role =“treeitem”> <a href =“ / about / apps /“title =”“>应用程序</a> </ li> \ n <li class =”tier-2 element-2“role =”treeitem“> <a href =”/ about / quotes / “title =”“>引用</a> </ li> \ n <li class =”tier-2 element-3“role =”treeitem“> <a href =”/ about / gettingstarted /“title =”“ >入门</a> </ li>

选择元素中的元素:

>>> 关于find 'a' 
[<Element'a'href ='/ about /'title =''class =''>,<Element'a'href ='/ about / apps /'title =''>,<元素'a'href ='/ about / quotes /'title =''>,<Element'a'href ='/ about / gettingstarted /'title =''>,<Element'a'href ='/ about / help /'title =''>,<Element'a'href ='http://brochure.getpython.info/'title =''>]

搜索元素中的链接:

>>> 关于absolute_links 
{'http://brochure.getpython.info/','https://www.python.org/about/gettingstarted/','https://www.python.org/about/','https: //www.python.org/about/quotes/','https://www.python.org/about/help/','https://www.python.org/about/apps/'}

在页面上搜索文字:

>>> r HTML 搜索'Python是一种{}语言' )[ 0 ] 
编程

更复杂的CSS Selector示例(从Chrome开发工具复制):

>>> r  =  会话get 'https://github.com/' 
>>> sel  =  'body> div.application-main> div.jumbotron.jumbotron-codelines> div> div> div.col-md-7.text-center .text-md-left> p'
>>> 打印[R HTML 找到SEL  第一= 文本
的GitHub是一个开发平台,通过你的工作方式的启发。从开源到业务,您可以与数百万其他开发人员一起托管和审查代码,管理项目以及构建软件。

还支持XPath:

>>> r HTML xpath '/ html / body / div [1] / a' 
[<Element'a'class =('px-2','py-4','show-on-focus','js-skip- to-content')href ='#start-of-content'tabindex ='1'>]

猜你喜欢

转载自www.cnblogs.com/kermitjam/p/10885923.html