3.11 Python操作数据库 (三)
插入操作:
#去掉id,因为它是字增长的
In [ ]:
sql = "insert into 'class'('name') values('高一四班')
cursor = db.cursor()
cursor.excute(sql)
cursor.excute(sql) #可以执行两次
db.commit()
删除操作:
In [ ]:
sql = "delete from 'class' where 'name' = '高一五班'"
cursor = db.cursor() #游标=
cursor = execute(sql) #执行execute语句
db.commit()
更新操作:
In [ ]:
sql = "update 'class' set 'name' = '高一十四班' where'id' = 4;"
cursor = db.cursor()
cursor = execute(sql)
db.commit()
3.12 Python操作数据库 (四)
捕捉程序异常
In [3]:
a = 10
b = a + 'hello'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-ce8b6729d737> in <module>()
1 a = 10
----> 2 b = a + 'hello'
TypeError: unsupported operand type(s) for +: 'int' and 'str'
使用 try 语句,Exception捕捉
In [4]:
try:
a = 10
b = a + 'hello'
except Exception as e:
print(e)
#有输出,没报错是因为Exception把异常给捕捉住了
unsupported operand type(s) for +: 'int' and 'str'
#在编程时,要捕捉很清楚的异常,它的异常是 TypeError
In [ ]:
try:
a = 10
b = a + 'hello'、
except TypeError as e:
print(e)
#未知的异常不捕捉,删除 :except Exception as e:
print(e)
数据库回滚操作:rollback 回滚操作指:前面已经执行了一些语言,后面执行一个或多个,前面后一个失败,后面都失败,不想再执行前面的
3.13 Python爬虫 (一)
爬虫 链家网 https://bj.lianjia.com/zufang/ 写爬虫,储存在数据库中
打开像‘今日头条’那样的链接,它是一个接口,结构很清晰,返回的是一个 json 文件的,通过获取外部链接的库,通过 json 的出来,容易获取想要的信息
爬虫:爬取外部页面上的信息 (没法从接口的形式获取,所以写爬虫爬取,再从页面中提取想要的信息)
主要使用的Python库:
requests —— 获取页面信息
requests文档学习网址:http://docs.pythonrequests.org/zh_CN/latest/user/quickstart.html
BeautifulSoup —— 提取页面内容 (分析页面信息,提出想要的内容)
BeautifulSoup文档学习网址:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
目标:
获取链家网页面中的每一个链接;
进入每一个链接中具体的租房页面,再在租房页面获取对应的信息。
#定义一个变量为 url,把租房的链接赋值给这个变量
In [11]: url = 'https://bj.lianjia.com/zufang/'
https://bj.lianjia.com/zufang/
zufang后面跟着的是一个参数,通过这个参数在链家网里面筛选信息,如果不设置参数,是从北京所用的租房信息里面返回信息
安装第三方库:
pip install reqeusts
pip install bs4
(BeautifulSoup 存在 pip install bs4 库里面)
3.14 Python爬虫 (二)
In [7]: pip install requests # 翻译 :点安装请求
File "<ipython-input-7-74dcce72a708>", line 1
pip install reqeusts
^
SyntaxError: invalid syntax
导入第三方库:
In [221]:
import requests
from bs4 import BeautifulSoup
# BeaufifulSoup 为了 和 requests区分 ,所以说它来自 bs4 里面
In [222]:
url = 'https://bj.lianjia.com/zufang/'
responce = requests.get(url)
# 通过浏览器直接获得用 get
#requests.post 注册的时候用 post
# responce 这里是结果的意思
soup = BeautifulSoup(responce.text,'lxml')
In [223]: url = 'https://bj.lianjia.com/zufang/'
responce = requests.get(url) # responce 翻译 :结果
response
Out[223]: <Response [200]> # 返回 200 ,意味着这个页面成功获取了数据
# 数据存在的地方
In [15]: responce.text
#获得页面全部的htlm代码 很多 也很乱
Out[15]:
'<!DOCTYPE html><html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge" /><meta http-equiv="Cache-Control" content="no-transform" /><meta http-equiv="Cache-Control" content="no-siteapp" /><meta http-equiv="Content-language" content="zh-CN" /><meta name="format-detection" content="telephone=no" /><meta name="applicable-device" content="pc"><link
网)</title>\n<meta name="description" content="链家北京租房
网,现有真实房屋租赁10765套
………………
1.使用 谷歌浏览器 右击 查看
2.使用 360游览器 右击 审查元素
(这里没有,应该是电脑系统的问题,右击 更多工具 开发者工具,出现一个框,它允许分析页面)
In [34]:
url = 'https://bj.lianjia.com/zufang/'
responce = requests.get(url)
soup = BeautifulSoup(rseponce.text,'lxml')
# lxml 定义的一种解析的,不写,会提示:不写也是可以的,但是如果在其他系统运行会出现不同的情况
In [35]:
url = 'https://bj.lianjia.com/zufang/'
responce = requests.get(url)
soup = BeautifulSoup(rseponce.text)
#把刚才的文本结构化了:
In [18]:
soup
Out[18]:
<!DOCTYPE html>
<html><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="no-transform" http-equiv="Cache-
……………………
#(使用soup好看了,但是不是我们想要的元素)找想要的链接元素在div中,(div中含有链接)
In [19]:
soup.find_all('div',class_="")
定义 类 的时候用到class,class是Python中的关键字,这里为了区分 使用class_
目的是找链接,点击图片打开了,说明图片上带着 a 属性的链接
# div 一个框的意思
Out[19]:
[<div class="wrapper "><div class="fl"><a class="logo" href="//www.lianjia.com/" title="链家房产网"><!-- <img src="https://s1.ljcdn.com/feroot/pc/asset/img/new-version/logo.png?_v=20180319195424"> --></a></div><div class="fr nav "><div class="fl"><ul>
<li>
<a class="" href="https://bj.lianjia.com/ershoufang/">二手房</a>
</li>
……………………
value="1"/></span>我已阅读并同意</label><a class="toprotocol" href="//www.lianjia.com/zhuanti/protocol" target="_blank">《链家用户使用协议》</a></li><li class="li_btn"><a class="register-user-btn"></a>注册</li></ul></form></div>]
找 pic-panel 下面的 a (a : 包含着链接)
# 定义 links_div ,指所有链接的一个框
In [20]:
links_div = soup.find_all('div',class_="pic-panel") #注意pic-panel中间的杠
In [21]:
links_div
#因为前面是find_all,所以它有很多,相当于一个列表
Out[21]:
[<div class="pic-panel"><a href="https://bj.lianjia.com/zufang/101102663605.html" target="_blank"><img alt="西城马甸 双朝南精装干净两居室 采光充足无遮挡" data-apart-layout="https://image1.ljcdn.com/x-se/hdic-frame/a21770d9-d29b-4797-b732-
………………
version/default_block.png?_v=20180319195424"/></a></div>]
# 找第一个元素
In [22]:
links_div[0]
Out[22]:
<div class="pic-panel"><a href="https://bj.lianjia.com/zufang/101102663605.html" target="_blank"><img alt="西城马甸 双朝南精装干净两居室 采光充足无遮挡" data-apart-layout="https://image1.ljcdn.com/x-se/hdic-frame/a21770d9-d29b-4797-b732-3164e700a48b.png.280x210.jpg" data-img="https://image1.ljcdn.com/110000-inspection/rsp_pic_uploadb536ab4f-1083-4dd7-8d41-0a3d7ae16722.jpg.280x210.jpg" src="https://s1.ljcdn.com/feroot/pc/asset/img/new-version/default_block.png?_v=20180319195424"/></a></div>
links_div[0]出来一个框,里面有很多东西,我们仅从框里面提取一个链接 https://bj.lianjia.com/zufang/101102663605.html 即可,其他的是我们不需要的。
从框列表 生成 一个链接列表
使用 for 循环中的 列表推导式 (从一个列表(框列表)生成另一个列表(链接列表))
In [ ]:
links_div = soup.find_all('div',class_="pic-panel")
links = [for div in links_div]
#运行报错
# 想获取 a
In [25]:
links_div[1].a
Out[25]:
<a href="https://bj.lianjia.com/zufang/101102657620.html" target="_blank"><img alt="花家地西里一区可随时拎包入住一居室" data-apart-layout="https://image1.ljcdn.com/x-se/hdic-frame/aaeb8786-0810-4531-8383-bf11d713f256.png.280x210.jpg" data-img="https://image1.ljcdn.com/110000-inspection/661a4358-b111-4fe7-9fa2-f573c2662251.jpg.280x210.jpg" src="https://s1.ljcdn.com/feroot/pc/asset/img/new-version/default_block.png?_v=20180319195424"/></a>
In [26]:
links_div[1].a.get('href')
#href是需要的网址的属性
#此时链接就提取出来了
Out[26]:
'https://bj.lianjia.com/zufang/101102657620.html'
# 构建一个新列表
In [29]:
links_div = soup.find_all('div',class_="pic-panel")
links =[div .a.get('href')for div in links_div]
看列表 links 里面储存的内容 :
In [30]: links
Out[30]:
['https://bj.lianjia.com/zufang/101102663605.html',
'https://bj.lianjia.com/zufang/101102657620.html',
'https://bj.lianjia.com/zufang/101102627382.html',
'https://bj.lianjia.com/zufang/101102562541.html',
'https://bj.lianjia.com/zufang/101102601891.html',
'https://bj.lianjia.com/zufang/101102612472.html',
'https://bj.lianjia.com/zufang/101102560877.html',
'https://bj.lianjia.com/zufang/101102563962.html',
'https://bj.lianjia.com/zufang/101102565535.html',
'https://bj.lianjia.com/zufang/101102567140.html',
'https://bj.lianjia.com/zufang/101102569458.html',
'https://bj.lianjia.com/zufang/101102595781.html',
'https://bj.lianjia.com/zufang/101102583860.html',
'https://bj.lianjia.com/zufang/101102586630.html',
'https://bj.lianjia.com/zufang/101102589206.html',
'https://bj.lianjia.com/zufang/101102590254.html',
'https://bj.lianjia.com/zufang/101102577318.html',
'https://bj.lianjia.com/zufang/101102577961.html',
'https://bj.lianjia.com/zufang/101102593955.html',
'https://bj.lianjia.com/zufang/101102615116.html',
'https://bj.lianjia.com/zufang/101102601605.html',
'https://bj.lianjia.com/zufang/101102573573.html',
'https://bj.lianjia.com/zufang/101102616903.html',
'https://bj.lianjia.com/zufang/101102402567.html',
'https://bj.lianjia.com/zufang/101102424075.html',
'https://bj.lianjia.com/zufang/101102666239.html',
'https://bj.lianjia.com/zufang/101102627365.html',
'https://bj.lianjia.com/zufang/101102630521.html',
'https://bj.lianjia.com/zufang/101102634527.html',
'https://bj.lianjia.com/zufang/101102658371.html']
In [31]:
# 打印列表的长度
links,len(links)
Out[31]:
(['https://bj.lianjia.com/zufang/101102663605.html',
'https://bj.lianjia.com/zufang/101102657620.html',
'https://bj.lianjia.com/zufang/101102627382.html',
'https://bj.lianjia.com/zufang/101102562541.html',
'https://bj.lianjia.com/zufang/101102601891.html',
'https://bj.lianjia.com/zufang/101102612472.html',
'https://bj.lianjia.com/zufang/101102560877.html',
'https://bj.lianjia.com/zufang/101102563962.html',
'https://bj.lianjia.com/zufang/101102565535.html',
'https://bj.lianjia.com/zufang/101102567140.html',
'https://bj.lianjia.com/zufang/101102569458.html',
'https://bj.lianjia.com/zufang/101102595781.html',
'https://bj.lianjia.com/zufang/101102583860.html',
'https://bj.lianjia.com/zufang/101102586630.html',
'https://bj.lianjia.com/zufang/101102589206.html',
'https://bj.lianjia.com/zufang/101102590254.html',
'https://bj.lianjia.com/zufang/101102577318.html',
'https://bj.lianjia.com/zufang/101102577961.html',
'https://bj.lianjia.com/zufang/101102593955.html',
'https://bj.lianjia.com/zufang/101102615116.html',
'https://bj.lianjia.com/zufang/101102601605.html',
'https://bj.lianjia.com/zufang/101102573573.html',
'https://bj.lianjia.com/zufang/101102616903.html',
'https://bj.lianjia.com/zufang/101102402567.html',
'https://bj.lianjia.com/zufang/101102424075.html',
'https://bj.lianjia.com/zufang/101102666239.html',
'https://bj.lianjia.com/zufang/101102627365.html',
'https://bj.lianjia.com/zufang/101102630521.html',
'https://bj.lianjia.com/zufang/101102634527.html',
'https://bj.lianjia.com/zufang/101102658371.html'],
30)
打印出来是30个信息,我们就获取了第一个页面的30个租房信息; 访问其中一个链接,得到的就是一个具体的租房信息
上面是找到一整个页面(共30个)租房信息