项目：从电子表格中读取数据

假定你有一张电子表格的数据，来自于 2010 年美国人口普查。你有一个无聊的任务，要遍历表中的几千行，计算总的人口，以及每个县的普查区的数目（普查区就是一个地理区域，是为人口普查而定义的）。每行表示一个人口普查区。我们将这个电子表格文件命名为 censuspopdata.xlsx，可以从 http://nostarch.com/automatestuff/下载它。它的内容如图 12-2 所示。尽管 Excel 是要能够计算多个选中单元格的和，你仍然需要选中 3000 个以上县的单元格。即使手工计算一个县的人口只需要几秒钟，整张电子表格也需要几个小时时间。

在这个项目中，你要编写一个脚本，从人口普查电子表格文件中读取数据，并在几秒钟内计算出每个县的统计值。下面是程序要做的事：

• 从 Excel 电子表格中读取数据。

• 计算每个县中普查区的数目。

• 计算每个县的总人口。

• 打印结果。

这意味着代码需要完成下列任务：

• 用 openpyxl 模块打开 Excel 文档并读取单元格。

• 计算所有普查区和人口数据，将它保存到一个数据结构中。

• 利用 pprint 模块，将该数据结构写入一个扩展名为.py的文本文件。

读取电子表格数据

censuspopdata.xlsx 电子表格中只有一张表，名为'Population by Census Tract'。每一行都保存了一个普查区的数据。列分别是普查区的编号（A），州的简称（B），县的名称（C），普查区的人口（D）。打开一个新的文件编辑器窗口，输入以下代码。将文件保存为readCensusExcel.py。

#! python3
# readCensusExcel.py - Tabulates population and number of census tracts for
# each county.

import openpyxl, pprint
print('Opening workbook...')
wb = openpyxl.load_workbook('censuspopdata.xlsx')
sheet = wb['Population by Census Tract']
county = {}

# TODO: Fill in countyData with each county's population and tracts.
print('Reading rows...')
for row in range(2, sheet.max_row + 1):
	# Each row in the spreadsheet has data for one census tract.
	State = sheet['B' + str(row)].value
	county = sheet['C' + str(row)].value
	pop = sheet['D' + str(row)].value

# TODO: Open a new text file and write the contents of countyData to it.

这段代码导入了 openpyxl 模块，也导入了 pprint 模块，你用后者来打印最终的县的数据。然后代码打开了 censuspopdata.xlsx 文件，取得了包含人口普查数据的工作表，开始迭代它的行。请注意，你也创建了一个 countyData 变量，它将包含你计算的每个县的人口和普查区数目。但在它里面存储任何东西之前，你应该确定它内部的数据结构。

填充数据结构

保存在 countyData 中的数据结构将是一个字典，以州的简称作为键。每个州的简称将映射到另一个字典，其中的键是该州的县的名称。每个县的名称又映射到一个字典，该字典只有两个键，'tracts'和'pop'。这些键映射到普查区数目和该县的人口。例如，该字典可能类似于：

{'AK': {'Aleutians East': {'pop': 3141, 'tracts': 1},        
        'Aleutians West': {'pop': 5561, 'tracts': 2},         
        'Anchorage': {'pop': 291826, 'tracts': 55},         
        'Bethel': {'pop': 17013, 'tracts': 3},         
        'Bristol Bay': {'pop': 997, 'tracts': 1},         
        --snip--

如果前面的字典保存在 countyData 中，下面的表达式求值结果如下：

>>> countyData['AK']['Anchorage']['pop'] 
291826 
>>> countyData['AK']['Anchorage']['tracts'] 
55

一般来说，countyData 字典中的键看起来像这样：

countyData[state abbrev][county]['tracts'] 
countyData[state abbrev][county]['pop']

既然知道了 countyData 的结构，就可以编写代码，用县的数据填充它。将下面的代码添加到程序的末尾：

#! python3
# readCensusExcel.py - Tabulates population and number of census tracts for
# each county.

--snip--
# TODO: Fill in countyData with each county's population and tracts.
print('Reading rows...')
for row in range(2, sheet.max_row + 1):
	# Each row in the spreadsheet has data for one census tract.
	state = sheet['B' + str(row)].value
	county = sheet['C' + str(row)].value
	pop = sheet['D' + str(row)].value

	# Make sure the key for this state exists.
	countyData.setdefault(state, {})
	# Make sure the key for this county in this state exists.
	countyData[state].setdefault(county, {'tracts': 0, 'pop': 0})

	# Each row represents one census tract, so increment by one.
	countyData[state][county]['tracts'] += 1
	# Increase the county pop by the pop in this census tract.
	countyData[state][county]['pop'] += int(pop)

# TODO: Open a new text file and write the contents of countyData to it.

最后的两行代码执行实际的计算工作，在 for 循环的每次迭代中，针对当前的县，增加 tracts 的值，并增加 pop 的值。其他代码存在是因为，只有 countyData 中存在的键，你才能引用它的值。（也就是说，如果'AK'键不存在，countyData['AK']['Anchorage']['tracts'] += 1 将导致一个错误）。为了确保州简称的键存在，你需要调用 setdefault()方法，在 state 还不存在时设置一个默认值。正如 countyData 字典需要一个字典作为每个州缩写的值，这样的字典又需要一个字典，作为每个县的键的值。这样的每个字典又需要键'tracts'和'pop'，它们的初始值为整数 0（如果这个字典的结构令你混淆，回去看看本节开始处字典的例子）。如果键已经存在，setdefault()不会做任何事情，因此在 for 循环的每次迭代中调用它不会有问题。

将结果写入文件

for 循环结束后，countyData 字典将包含所有的人口和普查区信息，以县和州为键。这时，你可以编写更多代码，将数据写入文本文件或另一个 Excel 电子表格。目前，我们只是使用 pprint.pformat()函数，将变量字典的值作为一个巨大的字符串，写入文件 census2010.py。在程序的末尾加上以下代码（确保它没有缩进，这样它就在 for 循环之外）：

#! python3
# readCensusExcel.py - Tabulates population and number of census tracts for
# each county.
--snip--

for row in range(2, sheet.max_row + 1):
--sinp--

# Open a new text file and write the contents of countyData to it.
print('Writing results...')
resultFile = open('census2010.py', 'w')
resultFile.write('allData = ' + pprint.pformat(countyData))
resultFile.close()
print('Done.')

pprint.pformat()函数产生一个字符串，它本身就是格式化好的、有效的 Python 代码。将它输出到文本文件 census2010.py，你就通过 Python 程序生成了一个 Python 程序！这可能看起来有点复杂，但好处是你现在可以导入 census2010.py，就像任何其他 Python 模块一样。在交互式环境中，将当前工作目录变更到新创建的文件所在的文件夹（在我的笔记本上，就是 C:\Python34），然后导入它：

>>> import os 
>>> os.chdir('C:\\Python34') 
>>> import census2010 
>>> census2010.allData['AK']['Anchorage'] 
{'pop': 291826, 'tracts': 55} 
>>> anchoragePop = census2010.allData['AK']['Anchorage']['pop'] 
>>> print('The 2010 population of Anchorage was ' + str(anchoragePop)) 
The 2010 population of Anchorage was 291826

readCensusExcel.py程序是可以扔掉的代码：当你把它的结果保存为census2010.py 之后，就不需要再次运行该程序了。任何时候，只要需要县的数据，就可以执行 import census2010。手工计算这些数据可能需要数小时，这个程序只要几秒钟。利用 OpenPyXL，可以毫无困难地提取保存在 Excel 电子表格中的信息，并对它进行计算。从 http://nostarch. com/automatestuff/可以下载这个完整的程序。

类似程序的思想

许多公司和组织机构使用 Excel 来保存各种类型的数据，电子表格会变得庞大，这并不少见。解析 Excel 电子表格的程序都有类似的结构：它加载电子表格文件，准备一些变量或数据结构，然后循环遍历电子表格中的每一行。这样的程序可以做下列事情： • 比较一个电子表格中多行的数据。

• 打开多个 Excel 文件，跨电子表格比较数据。

• 检查电子表格是否有空行或无效的数据，如果有就警告。

• 从电子表格中读取数据，将它作为 Python 程序的输入。

完整代码：

#! python3
# readCensusExcel.py - Tabulates population and number of census tracts for
# each county.

import openpyxl, pprint
print('Opening workbook...')
wb = openpyxl.load_workbook('censuspopdata.xlsx')
sheet = wb['Population by Census Tract']
countyData = {}

# TODO: Fill in countyData with each county's population and tracts.
print('Reading rows...')
for row in range(2, sheet.max_row + 1):
	# Each row in the spreadsheet has data for one census tract.
	state = sheet['B' + str(row)].value
	county = sheet['C' + str(row)].value
	pop = sheet['D' + str(row)].value

	# Make sure the key for this state exists.
	countyData.setdefault(state, {})
	# Make sure the key for this county in this state exists.
	countyData[state].setdefault(county, {'tracts': 0, 'pop': 0})

	# Each row represents one census tract, so increment by one.
	countyData[state][county]['tracts'] += 1
	# Increase the county pop by the pop in this census tract.
	countyData[state][county]['pop'] += int(pop)

# Open a new text file and write the contents of countyData to it.
print('Writing results...')
resultFile = open('census2010.py', 'w')
resultFile.write('allData = ' + pprint.pformat(countyData))
resultFile.close()
print('Done.')

项目：从电子表格中读取数据

猜你喜欢