Python介绍(16)

3.11 Python操作数据库 (三)

插入操作:

#去掉id,因为它是字增长的

In [ ]:

sql = "insert into 'class'('name') values('高一四班')

cursor = db.cursor()

cursor.excute(sql)

cursor.excute(sql)  #可以执行两次

db.commit()

删除操作:

In [ ]:

sql = "delete from 'class' where 'name' = '高一五班'"

cursor = db.cursor()  #游标=

cursor = execute(sql)   #执行execute语句

db.commit()

 

更新操作:

In [ ]:

sql = "update 'class' set 'name' = '高一十四班' where'id' = 4;"

cursor = db.cursor()

cursor = execute(sql)

db.commit()

3.12 Python操作数据库 (四)

捕捉程序异常

In [3]:

a = 10

b = a + 'hello'

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-3-ce8b6729d737> in <module>()

1 a = 10

----> 2 b = a + 'hello'

TypeError: unsupported operand type(s) for +: 'int' and 'str'

使用 try 语句,Exception捕捉

In [4]:

try:

a = 10

b = a + 'hello'

except Exception as e:

print(e)

#有输出,没报错是因为Exception把异常给捕捉住了

unsupported operand type(s) for +: 'int' and 'str'

#在编程时,要捕捉很清楚的异常,它的异常是  TypeError

In [ ]:

try:

a = 10

b = a + 'hello'、

except TypeError as e:

print(e)

#未知的异常不捕捉,删除  :except Exception as e:

print(e)

数据库回滚操作:rollback 回滚操作指:前面已经执行了一些语言,后面执行一个或多个,前面后一个失败,后面都失败,不想再执行前面的

 

3.13 Python爬虫 (一)

爬虫 链家网 https://bj.lianjia.com/zufang/ 写爬虫,储存在数据库中

打开像‘今日头条’那样的链接,它是一个接口,结构很清晰,返回的是一个 json 文件的,通过获取外部链接的库,通过 json 的出来,容易获取想要的信息

爬虫:爬取外部页面上的信息 (没法从接口的形式获取,所以写爬虫爬取,再从页面中提取想要的信息

主要使用的Python库:

requests —— 获取页面信息

requests文档学习网址:http://docs.pythonrequests.org/zh_CN/latest/user/quickstart.html 

BeautifulSoup —— 提取页面内容 分析页面信息,提出想要的内容)

BeautifulSoup文档学习网址:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html 

 

目标:

获取链家网页面中的每一个链接;

进入每一个链接中具体的租房页面,再在租房页面获取对应的信息。

 

#定义一个变量为 url把租房的链接赋值给这个变量

In [11]: url = 'https://bj.lianjia.com/zufang/'

https://bj.lianjia.com/zufang/

zufang后面跟着的是一个参数,通过这个参数在链家网里面筛选信息,如果不设置参数,是从北京所用的租房信息里面返回信息

安装第三方库:

pip install reqeusts

pip install bs4

BeautifulSoup 存在 pip install bs4 库里面)

 

3.14 Python爬虫 (二)

In [7]: pip install requests    # 翻译 :点安装请求

File "<ipython-input-7-74dcce72a708>", line 1

pip install reqeusts

^

SyntaxError: invalid syntax

导入第三方库:

In [221]:  

import requests

from bs4 import BeautifulSoup

# BeaufifulSoup 为了  requests区分 ,所以说它来自 bs4 里面

In [222]: 

url = 'https://bj.lianjia.com/zufang/'

responce = requests.get(url)

# 通过浏览器直接获得用 get

#requests.post 注册的时候  post

# responce 这里是结果的意思

soup = BeautifulSoup(responce.text,'lxml')

In [223]: url = 'https://bj.lianjia.com/zufang/'

responce = requests.get(url)   # responce 翻译 :结果

response

Out[223]:   <Response [200]>   # 返回 200 ,意味着这个页面成功获取了数据

 

# 数据存在的地方

In [15]: responce.text

#获得页面全部的htlm代码  很多 也很乱

Out[15]:

'<!DOCTYPE html><html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge" /><meta http-equiv="Cache-Control" content="no-transform" /><meta http-equiv="Cache-Control" content="no-siteapp" /><meta http-equiv="Content-language" content="zh-CN" /><meta name="format-detection" content="telephone=no" /><meta name="applicable-device" content="pc"><link

)</title>\n<meta name="description" content="链家北京租房

,现有真实房屋租赁10765套

………………

1.使用 谷歌浏览器 右击 查看

2.使用 360游览器 右击 审查元素

(这里没有,应该是电脑系统的问题,右击 更多工具 开发者工具,出现一个框,它允许分析页面)

In [34]:

url = 'https://bj.lianjia.com/zufang/'

responce = requests.get(url)

soup = BeautifulSoup(rseponce.text,'lxml')

# lxml 定义的一种解析的,不写,会提示:不写也是可以的,但是如果在其他系统运行会出现不同的情况

In [35]:

url = 'https://bj.lianjia.com/zufang/'

responce = requests.get(url)

soup = BeautifulSoup(rseponce.text)

#把刚才的文本结构化了:

In [18]:

soup

Out[18]:

<!DOCTYPE html>

<html><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="no-transform" http-equiv="Cache-

……………………

 

#(使用soup好看了,但是不是我们想要的元素)找想要的链接元素在div中,(div中含有链接)

In [19]:

soup.find_all('div',class_="")

定义  的时候用到class,class是Python中的关键字,这里为了区分 使用class_

目的是找链接,点击图片打开了,说明图片上带着 a 属性的链接

# div 一个框的意思

Out[19]:

[<div class="wrapper "><div class="fl"><a class="logo" href="//www.lianjia.com/" title="链家房产网"><!-- <img src="https://s1.ljcdn.com/feroot/pc/asset/img/new-version/logo.png?_v=20180319195424"> --></a></div><div class="fr nav "><div class="fl"><ul>

<li>

<a class="" href="https://bj.lianjia.com/ershoufang/">二手房</a>

</li>

……………………

value="1"/></span>我已阅读并同意</label><a class="toprotocol" href="//www.lianjia.com/zhuanti/protocol" target="_blank">《链家用户使用协议》</a></li><li class="li_btn"><a class="register-user-btn"></a>注册</li></ul></form></div>]

 pic-panel 下面的 a (a : 包含着链接)

# 定义 links_div  ,指所有链接的一个框

In [20]:

links_div = soup.find_all('div',class_="pic-panel")   #注意pic-panel中间的杠

In [21]:

links_div

#因为前面是find_all,所以它有很多,相当于一个列表

Out[21]:

[<div class="pic-panel"><a href="https://bj.lianjia.com/zufang/101102663605.html" target="_blank"><img alt="西城马甸  双朝南精装干净两居室  采光充足无遮挡" data-apart-layout="https://image1.ljcdn.com/x-se/hdic-frame/a21770d9-d29b-4797-b732-

………………

version/default_block.png?_v=20180319195424"/></a></div>]

# 找第一个元素

In [22]:

links_div[0]

Out[22]:

<div class="pic-panel"><a href="https://bj.lianjia.com/zufang/101102663605.html" target="_blank"><img alt="西城马甸  双朝南精装干净两居室  采光充足无遮挡" data-apart-layout="https://image1.ljcdn.com/x-se/hdic-frame/a21770d9-d29b-4797-b732-3164e700a48b.png.280x210.jpg" data-img="https://image1.ljcdn.com/110000-inspection/rsp_pic_uploadb536ab4f-1083-4dd7-8d41-0a3d7ae16722.jpg.280x210.jpg" src="https://s1.ljcdn.com/feroot/pc/asset/img/new-version/default_block.png?_v=20180319195424"/></a></div>

links_div[0]出来一个框,里面有很多东西,我们仅从框里面提取一个链接 https://bj.lianjia.com/zufang/101102663605.html 即可,其他的是我们不需要的。

 

从框列表 生成 一个链接列表

使用 for 循环中的 列表推导式 (从一个列表(框列表)生成另一个列表(链接列表))

In [ ]:

links_div = soup.find_all('div',class_="pic-panel")

links = [for div in links_div]

#运行报错

# 想获取 a

In [25]:

links_div[1].a

Out[25]:

<a href="https://bj.lianjia.com/zufang/101102657620.html" target="_blank"><img alt="花家地西里一区可随时拎包入住一居室" data-apart-layout="https://image1.ljcdn.com/x-se/hdic-frame/aaeb8786-0810-4531-8383-bf11d713f256.png.280x210.jpg" data-img="https://image1.ljcdn.com/110000-inspection/661a4358-b111-4fe7-9fa2-f573c2662251.jpg.280x210.jpg" src="https://s1.ljcdn.com/feroot/pc/asset/img/new-version/default_block.png?_v=20180319195424"/></a>

In [26]:

links_div[1].a.get('href')

#href是需要的网址的属性

#此时链接就提取出来了

Out[26]:

'https://bj.lianjia.com/zufang/101102657620.html'

 

# 构建一个新列表

In [29]:

links_div = soup.find_all('div',class_="pic-panel")

links =[div .a.get('href')for div in links_div]

 

看列表 links 里面储存的内容 :

In [30]: links

Out[30]:

['https://bj.lianjia.com/zufang/101102663605.html',

'https://bj.lianjia.com/zufang/101102657620.html',

'https://bj.lianjia.com/zufang/101102627382.html',

'https://bj.lianjia.com/zufang/101102562541.html',

'https://bj.lianjia.com/zufang/101102601891.html',

'https://bj.lianjia.com/zufang/101102612472.html',

'https://bj.lianjia.com/zufang/101102560877.html',

'https://bj.lianjia.com/zufang/101102563962.html',

'https://bj.lianjia.com/zufang/101102565535.html',

'https://bj.lianjia.com/zufang/101102567140.html',

'https://bj.lianjia.com/zufang/101102569458.html',

'https://bj.lianjia.com/zufang/101102595781.html',

'https://bj.lianjia.com/zufang/101102583860.html',

'https://bj.lianjia.com/zufang/101102586630.html',

'https://bj.lianjia.com/zufang/101102589206.html',

'https://bj.lianjia.com/zufang/101102590254.html',

'https://bj.lianjia.com/zufang/101102577318.html',

'https://bj.lianjia.com/zufang/101102577961.html',

'https://bj.lianjia.com/zufang/101102593955.html',

'https://bj.lianjia.com/zufang/101102615116.html',

'https://bj.lianjia.com/zufang/101102601605.html',

'https://bj.lianjia.com/zufang/101102573573.html',

'https://bj.lianjia.com/zufang/101102616903.html',

'https://bj.lianjia.com/zufang/101102402567.html',

'https://bj.lianjia.com/zufang/101102424075.html',

'https://bj.lianjia.com/zufang/101102666239.html',

'https://bj.lianjia.com/zufang/101102627365.html',

'https://bj.lianjia.com/zufang/101102630521.html',

'https://bj.lianjia.com/zufang/101102634527.html',

'https://bj.lianjia.com/zufang/101102658371.html']

In [31]:

# 打印列表的长度

links,len(links)

Out[31]:

(['https://bj.lianjia.com/zufang/101102663605.html',

'https://bj.lianjia.com/zufang/101102657620.html',

'https://bj.lianjia.com/zufang/101102627382.html',

'https://bj.lianjia.com/zufang/101102562541.html',

'https://bj.lianjia.com/zufang/101102601891.html',

'https://bj.lianjia.com/zufang/101102612472.html',

'https://bj.lianjia.com/zufang/101102560877.html',

'https://bj.lianjia.com/zufang/101102563962.html',

'https://bj.lianjia.com/zufang/101102565535.html',

'https://bj.lianjia.com/zufang/101102567140.html',

'https://bj.lianjia.com/zufang/101102569458.html',

'https://bj.lianjia.com/zufang/101102595781.html',

'https://bj.lianjia.com/zufang/101102583860.html',

'https://bj.lianjia.com/zufang/101102586630.html',

'https://bj.lianjia.com/zufang/101102589206.html',

'https://bj.lianjia.com/zufang/101102590254.html',

'https://bj.lianjia.com/zufang/101102577318.html',

'https://bj.lianjia.com/zufang/101102577961.html',

'https://bj.lianjia.com/zufang/101102593955.html',

'https://bj.lianjia.com/zufang/101102615116.html',

'https://bj.lianjia.com/zufang/101102601605.html',

'https://bj.lianjia.com/zufang/101102573573.html',

'https://bj.lianjia.com/zufang/101102616903.html',

'https://bj.lianjia.com/zufang/101102402567.html',

'https://bj.lianjia.com/zufang/101102424075.html',

'https://bj.lianjia.com/zufang/101102666239.html',

'https://bj.lianjia.com/zufang/101102627365.html',

'https://bj.lianjia.com/zufang/101102630521.html',

'https://bj.lianjia.com/zufang/101102634527.html',

'https://bj.lianjia.com/zufang/101102658371.html'],

30)

打印出来是30个信息,我们就获取了第一个页面的30个租房信息; 访问其中一个链接,得到的就是一个具体的租房信息

 

上面是找到一整个页面(共30个)租房信息

 

猜你喜欢

转载自blog.csdn.net/zxqjinhu/article/details/80847168