爬虫学习笔记--爬取豆瓣,有道,人人网的简单使用urllib库

爬虫学习笔记--爬取豆瓣,有道,人人网的简单使用urllib库

2018年07月28日 18:04:41 无悔_一叶扁舟 阅读数:37 标签: python爬虫 更多

个人分类: python爬虫

1.爬取有道翻译

 
  1. """

  2. 通过post提交,访问有道翻译,得到翻译的数据

  3. author:一叶扁舟

  4. 说明:使用的python3.7

  5. """

  6.  
  7. import urllib

  8. import urllib.request

  9. # 通过抓包的方式获取的url,并不是浏览器上显示的url,或者谷歌浏览器打开开发则调试工具,查看访问请求

  10. # url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=null"

  11. url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc"

  12.  
  13. # 完整的headers

  14. headers = {

  15. "Accept" : "application/json, text/javascript, */*; q=0.01",

  16. "X-Requested-With" : "XMLHttpRequest",

  17. "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36",

  18. "Content-Type" : "application/x-www-form-urlencoded; charset=UTF-8",

  19. }

  20.  
  21. # 用户接口输入

  22. key = input("请输入需要翻译的文字:")

  23.  
  24. # 发送到web服务器的表单数据

  25. formdata = {

  26. "type" : "AUTO",

  27. "i" : key,

  28. "doctype" : "json",

  29. "xmlVersion" : "1.8",

  30. "keyfrom" : "fanyi.web",

  31. "ue" : "UTF-8",

  32. "action" : "FY_BY_CLICKBUTTON",

  33. "typoResult" : "true"

  34. }

  35.  
  36. # 经过urlencode转码

  37. # 如果报错了:出现了“POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.”则需要在后面添加.encode(encoding='UTF8')

  38. data = urllib.parse.urlencode(formdata).encode(encoding='UTF8')

  39.  
  40. # print(data)

  41. # 如果Request()方法里的data参数有值,那么这个请求就是POST

  42. # 如果没有,就是Get

  43. request = urllib.request.Request(url, data = data, headers = headers)

  44.  
  45. print(urllib.request.urlopen(request).read().decode("utf-8"))

  46.  

2.爬取豆瓣,主要学习ajax加载数据的爬取

 
  1. """

  2. 爬取ajax数据--豆瓣网

  3. author:一叶扁舟

  4. 说明:使用的python3.7

  5. """

  6.  
  7. import urllib

  8. import urllib.request

  9.  
  10.  
  11. url = "https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action"

  12.  
  13. headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

  14.  
  15. formdata = {

  16. "start":"0",

  17. "limit":"100"

  18. }

  19.  
  20. data = urllib.parse.urlencode(formdata).encode("utf-8")

  21.  
  22. request = urllib.request.Request(url, data = data, headers = headers)

  23.  
  24. print (urllib.request.urlopen(request).read().decode("utf-8"))

  25.  
  26.  
  27.  
  28.  

3.cookie的使用,爬取登录后人人网的主页数据

 
  1. """

  2. 人人网cookie的使用

  3. author:一叶扁舟

  4. 说明:使用的python3.7

  5. """

  6. import urllib.request

  7.  
  8.  
  9. url = "http://www.renren.com/410043129/profile"

  10.  
  11. headers = {

  12. "Accept" : "*/*",

  13. "Accept-Language" : "zh-CN,zh;q=0.9",

  14. "Connection" : "keep-alive",

  15. "Content-Type" : "application/x-www-form-urlencoded",

  16. "Cookie" :"xxxxxxxxxxxxxxxx",#将登陆人人网的的cookie直接复制过来

  17. "Host" : "www.renren.com",

  18. "Referer" : "http://www.renren.com/496371843/profile",

  19. "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"

  20.  
  21. }

  22. request = urllib.request.Request(url,headers=headers)

  23.  
  24. response = urllib.request.urlopen(request)

  25.  
  26. print(response.read().decode("utf-8"))

猜你喜欢

转载自blog.csdn.net/weixin_42858906/article/details/83017519