Python数据分析NumPy和pandas（十八、从Web APIs 和数据库中获取数据）

一、与Web APIs 进行数据交互

很多Web网站都提供公共的API，并通过 JSON 或其他格式提供数据。那Python也有很多种方法可以访问网站提供的API，其中一种常用的方法是通过使用 requests 库，使用之前需要先安装它，这里通过pip安装：

pip install requests

下面我通过GitHub网站提供的API来学习使用requests。例如，要在 GitHub 上找到 pandas 的最后 30 个 GitHub 问题列表数据，我们可以使用附加组件requests库发出 GET HTTP 请求：

import requests

url = "https://api.github.com/repos/pandas-dev/pandas/issues"
resp = requests.get(url)
resp.raise_for_status()
print(resp)

以上代码首先导入了requests库，url定义了要访问的API接口，然后通过requests.get()方法访问该接口，并用resp变量获取访问的结果，最后通过结果对象的raise_for_status()方法检查是否有HTTP错误，我们最好始终在使用 requests.get 后调用 raise_for_status 来检查 HTTP 错误。打印resp的输出结果是：<Response [200]> 200表示访问成功。

响应对象resp的 json() 方法会返回一个 Python 对象，其中包含从API接口获取的并已经解析的 JSON 数据（如字典或列表所示）（取决于返回的 JSON内的数据格式）：

import requests

url = "https://api.github.com/repos/pandas-dev/pandas/issues"
resp = requests.get(url)
resp.raise_for_status()

#用响应对象resp的json方法获取python对象
data = resp.json()
#获取第一个元素的title
title = data[0]["title"]
print(title)

输出结果是：

DOC: Update contributing docs for Windows build tools instructions

因为通过这个API检索到的结果是实时数据，因此我们在不同的时候运行此代码会得到不同的第一个问题的标题。data 中的每个元素都是一个字典，包含在 GitHub 问题页面上找到的所有数据（评论除外）。我们可以将data直接传递给 pandas.DataFrame构造一个DataFrame对象并提取我们感兴趣的字段。如下代码：

import pandas as pd
import requests

url = "https://api.github.com/repos/pandas-dev/pandas/issues"
resp = requests.get(url)
resp.raise_for_status()

#用响应对象resp的json方法获取python对象
data = resp.json()
issues = pd.DataFrame(data, columns=["number", "title", "labels", "state"])
print(issues)

输出结果：

	number	title	labels	state
0	60170	DOC: Update contributing docs for Windows build tools instructions	[]	open
1	60169	BUG: build_table_schema (AttributeError: 'datetime.timezone' object has...	[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ==', 'url': 'https://api.g...	open
2	60166	BUG: fix #60128 BUG: Series.combine_first loss of precision	[]	open
3	60163	QST: Why Doesn't Pandas Have an Insert Index Function or Method?	[{'id': 34444536, 'node_id': 'MDU6TGFiZWwzNDQ0NDUzNg==', 'url': 'https:...	open
4	60162	ENH: "stripna" for dropping leading or trailing NaNs	[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg==', 'url': 'https://api.g...	open
5	60161	ENH: Add `area_limit` to `fillna	[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg==', 'url': 'https://api.g...	open
6	60159	BUG/API: preserve dtype in Index `append()`	[{'id': 31404521, 'node_id': 'MDU6TGFiZWwzMTQwNDUyMQ==', 'url': 'https:...	open
7	60154	CI: add test build with numpy 1.26 (<2)	[{'id': 48070600, 'node_id': 'MDU6TGFiZWw0ODA3MDYwMA==', 'url': 'https:...	open
8	60149	BUG: Docs won't build (Unexpected exception in `doc\source\user_guide\e...	[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ==', 'url': 'https://api.g...	open
9	60148	DOC: pandas.DataFrame.to_html additional description for the border par...	[{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OTk=', 'url': 'https://api....	open
10	60146	CI: Add Windows wheels for the free-threaded build	[{'id': 129350, 'node_id': 'MDU6TGFiZWwxMjkzNTA=', 'url': 'https://api....	open
11	60136	[backport 2.3.x] TST (string dtype): add explicit object vs str dtype t...	[]	open
12	60134	TST (string dtype): remove xfails in extension tests + fix categorical/...	[{'id': 57522093, 'node_id': 'MDU6TGFiZWw1NzUyMjA5Mw==', 'url': 'https:...	open
13	60129	ENH: allow complex type inference in convert_dtypes	[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg==', 'url': 'https://api.g...	open
14	60128	BUG: Series.combine_first loss of precision	[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ==', 'url': 'https://api.g...	open
15	60127	BUG: Assigning boolean series with logical indexer	[{'id': 2822098, 'node_id': 'MDU6TGFiZWwyODIyMDk4', 'url': 'https://api...	open
16	60126	VOTE: Voting issue for PDEP-17: Backwards compatibility and deprecation...	[{'id': 5732441949, 'node_id': 'LA_kwDOAA0YD88AAAABVa4fXQ', 'url': 'htt...	open
17	60124	BUG: 'Engine' object has no attribute 'cursor'	[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ==', 'url': 'https://api.g...	open
18	60120	BUG: isin check incorrect with uint64 dtype	[]	open
19	60119	TST (string dtype): fix xfails in test_algos.py	[{'id': 57522093, 'node_id': 'MDU6TGFiZWw1NzUyMjA5Mw==', 'url': 'https:...	open
20	60115	Fixes: [#60084] Added a new file test_timestamp_hash.py	[{'id': 127685, 'node_id': 'MDU6TGFiZWwxMjc2ODU=', 'url': 'https://api....	open
21	60111	ENH: A .chi2() method on the DataFrame and Series class that will resem...	[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg==', 'url': 'https://api.g...	open
22	60108	BUG: many dataframe operations broken when a column contains numpy stru...	[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ==', 'url': 'https://api.g...	open
23	60105	Fix BUG: read_sql tries to convert blob/varbinary to string with pyarro...	[{'id': 47232590, 'node_id': 'MDU6TGFiZWw0NzIzMjU5MA==', 'url': 'https:...	open
24	60104	BUG: Unknown slicing behavior for Multiindexing when passing through st...	[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ==', 'url': 'https://api.g...	open
25	60102	BUG: Frequency shift on empty DataFrame doesn't shift index	[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ==', 'url': 'https://api.g...	open
26	60100	BUG: `read_csv` with chained fsspec TAR file and `compression="infer"` ...	[]	open
27	60099	BUG: Inconsistent output type in Excel for PeriodIndex in Index or Mult...	[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ==', 'url': 'https://api.g...	open
28	60098	PERF: Slowdowns with .isin() on columns typed as np.uint64	[{'id': 8935311, 'node_id': 'MDU6TGFiZWw4OTM1MzEx', 'url': 'https://api...	open
29	60097	[TST] Parse arrow_dtype as datetime index	[{'id': 127685, 'node_id': 'MDU6TGFiZWwxMjc2ODU=', 'url': 'https://api....	open

在我们的实际开发或进行数据分析的时候，我们可以创建一些更方便的Web API接口，让这些接口返回 DataFrame 对象，从而可以更方便的应用pandas进行数据分析。

二、与数据库进行数据交互

在正式业务环境中，数据一般不会存储在文本文件或 Excel 文件中。这个大家都知道，当前主流方式还是将数据存储在基于 SQL 的关系数据库中（如 SQL Server、PostgreSQL 和 MySQL）。至于选择用那种数据库，取决于应用程序的性能、数据完整性和可伸缩性的需求。

pandas 具有一些好的功能，可以非常方便的将 SQL 查询结果加载到 DataFrame 中。下面我将使用 Python 内置 sqlite3 驱动程序创建一个 SQLite3 数据库进行学习。我还是通过代码来学习更直观：

import pandas as pd
import sqlite3

#创建一个表名为test的表，该表有a,b,c,d四个字段。
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""
#连接python内置的sqllite3数据库
con = sqlite3.connect("mydata.sqlite")
#执行建表语句
con.execute(query)
#提交事务
con.commit()

上面这部分代码定义了一个在sqlite中创建表test的语句，test表有a,b,c,d四个字段。接下来我们在test表中插入数据。

#要插入test表中的数据
data = [("Atlanta", "Georgia", 1.25, 6),
        ("Tallahassee", "Florida", 2.6, 3),
        ("Sacramento", "California", 1.7, 5)]
#插入语句
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
#执行插入数据操作
con.executemany(stmt, data)
#提交事务
con.commit()

以上定义了要在test表中插入的数据data和插入语句，并执行提交。

#查询test全表，并返回一个游标
cursor = con.execute("SELECT * FROM test")
#获取返回的所有行数据
rows = cursor.fetchall()
#将数据打印出来看看
print(rows)

以上执行查询test全表数据，并通过获取的游标将返回的数据赋值给rows变量，然后打印输出rows。rows的输出结果是一个元组列表：

[('Atlanta', 'Georgia', 1.25, 6), ('Tallahassee', 'Florida', 2.6, 3), ('Sacramento', 'California', 1.7, 5)]

接下来我们可以将元组列表传递给 DataFrame 构造函数构造一个DataFrame对象，但是还需要获取列名来作为DataFrame对象的列标签（索引），列名我们可以通过游标cursor的 description 属性获得。请注意，对于 SQLite3，游标描述仅提供列名（其他字段是 Python 的数据库 API 规范的一部分，都是 None），但对于其他一些数据库驱动程序，description提供了更多的列信息。

cursor.description

这一行代码通过游标cursor的属性description获取表的列信息，对于SQLite3来说这个列信息只有列名，我们输出cursor.description来看下：

(('a', None, None, None, None, None, None), ('b', None, None, None, None, None, None), ('c', None, None, None, None, None, None), ('d', None, None, None, None, None, None))

#用返回的元组列表和列名构造一个DataFrame对象，
#列名通过推导式获得
pd_rows = pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

上面用返回的元组列表rows和表test的列名构造了一个DataFrame对象pd_rows。下面我们将上面的代码完整的整合下，并输出最后pd_rows内容：

import pandas as pd
import sqlite3

#创建一个表名为test的表，该表有a,b,c,d四个字段。
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""
#连接python内置的sqllite3数据库
con = sqlite3.connect("mydata.sqlite")
#执行建表语句
con.execute(query)
#提交事务
con.commit()

#要插入test表中的数据
data = [("Atlanta", "Georgia", 1.25, 6),
        ("Tallahassee", "Florida", 2.6, 3),
        ("Sacramento", "California", 1.7, 5)]
#插入语句
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
#执行插入数据操作
con.executemany(stmt, data)
#提交事务
con.commit()

#查询test全表，并返回一个游标
cursor = con.execute("SELECT * FROM test")
#获取返回的所有行数据
rows = cursor.fetchall()
#将数据打印出来看看
print(rows)
#通过游标查看列信息,SQLite3的列信息中只包含列名
print(cursor.description)
#用返回的元组列表和列名构造一个DataFrame对象
pd_rows = pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
print(pd_rows)

输出pd_rows结果：

a b c d
0 Atlanta Georgia 1.25 6
1 Tallahassee Florida 2.60 3
2 Sacramento California 1.70 5

上面的代码从连接数据库、创建表、插入表数据、查询表数据到通过返回的数据和列名构造DataFrame，这些工作还是比较繁琐的，我们不希望每次查询数据库时都重复一些操作。所以我再对 SQLAlchemy 项目进行一个简要的学习，SQLAlchemy是一个流行的 Python SQL 工具包，它抽象出 SQL 数据库之间的许多常见差异，同时，pandas 具有 read_sql()函数，可让我们很方便的从通用 SQLAlchemy 连接中读取数据。先用pip安装SQLAlchemy。

pip install sqlalchemy

现在，我将使用 SQLAlchemy 连接到同一个 SQLite 数据库，并从之前创建的表中读取数据：

import sqlalchemy as sqla

#连接到数据库
db = sqla.create_engine("sqlite:///mydata.sqlite")
#读取数据
data = pd.read_sql("SELECT * FROM test", db)
print(data)

输出结果：

a b c d
0 Atlanta Georgia 1.25 6
1 Tallahassee Florida 2.60 3
2 Sacramento California 1.70 5

这相比于之前使用python内置的sqlite中的方法的先连接到数据库、再执行查询、然后通过游标获取查询后的数据，最后用返回的数据和列名构造DataFrame对象简单了很多。

三、总结

在前面写的学习笔记过程中，我学习了许多有用的工具，通过这些学习算是入了数据分析的门了。但是访问数据通常是数据分析过程的第一步，在后面的学习过程中，还得更深入的学习数据整理、数据可视化、时间序列分析等。

猜你喜欢

目录

热门文章