python 数据分析4 数据结构，函数，文件

四、python 数据结构，函数与文件

数据结构

元组

元组「tuple」是一个固定长度，不可改变的Python序列对象。

创建元组：最简单方式，是用逗号分隔一列值；用复杂的表达式定义元组，最好将值放到圆括号内；用tuple 可以将任意序列或迭代器转换成元组；用方括号访问元组中的元素

tup = 4, 5, 6
tup # (4, 5, 6)

nested_tup = (4, 5, 6), (7, 8)
nested_tup # ((4, 5, 6), (7, 8))

tuple([4, 0, 2]) # (4, 0, 2)

tup = tuple('string')
tup # ('s', 't', 'r', 'i', 'n', 'g')

tup[0] # 's'

修改元组对象：元组中存储的对象可能是可变对象，一旦创建了元组，元组中的对象不能修改；元组中的某个对象是可变的，比如列表，可以在原位进行修改

tup = tuple(['foo', [1, 2], True])
tup[1].append(3)
tup # ('foo', [1, 2, 3], True)

串联元组：“+”将元组串联；“*”将元组复制串联「对象本身并没有被复制，只是引用了它」

(4, None, 'foo') + (6, 0) + ('bar',) # (4, None, 'foo', 6, 0, 'bar')
('foo', 'bar') * 4 # ('foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar'
)

拆分元组：该功能可用于替换变量名「其他语言需要引入中间变量」；还可以用来迭代元组或列表序列；高级元组拆分功使用了特殊语法 *rest ，从元组开头摘取几个元素。

tup = (4, 5, 6)
a, b, c = tup
b # 5

tup = 4, 5, (6, 7)
a, b, (c, d) = tup
d # 7

# 交换变量名
a, b = 1, 2 # a = 1, b = 2
b, a = a, b # a = 2, b = 1

# 迭代元组和列表序列
seq = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
for a, b, c in seq:
    print('a={0}, b={1}, c={2}'.format(a, b, c))

a=1, b=2, c=3
a=4, b=5, c=6
a=7, b=8, c=9

# 高级元组拆分
values = 1, 2, 3, 4, 5
a, b, *rest = values
a, b # (1, 2)
rest # [3, 4, 5]

# rest的名字不重要，不需要的变量常写作下划线
a, b, *_ = values
_ # [3, 4, 5]

元组方法：count 「也适用于列表」

a = (1, 2, 2, 2, 3, 4, 2)
a.count(2) # 4

列表

与元组对比，列表的长度可变、内容可以被修改。

定义：用方括号或 list 函数「list 函数常用来在数据处理中实体化迭代器或生成器」

a_list = [2, 3, 7, None]

tup = ('foo', 'bar', 'baz')
b_list = list(tup)

# 迭代器
gen = range(10)
list(gen) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

添加和删除元素：

append：末尾添加
insert：指定位置插入
pop：移除并返回指定位置元素
remove：找到第一个值并移除
in 和 not in：检查列表是否包含某个值

b_list.append('dwarf') # b_list = ['foo', 'peekaboo', 'baz', 'dwarf']

b_list.insert(1, 'red') #  b_list = ['foo', 'red', 'peekaboo', 'baz', 'dwarf']

b_list.pop(2) #  b_list = ['foo', 'red', 'baz', 'dwarf']

b_list.append('foo') # b_list = ['foo', 'red', 'baz', 'dwarf', 'foo']
b_list.remove('foo') # b_list = ['red', 'baz', 'dwarf', 'foo']

'dwarf' in b_list # True
'dwarf' not in b_list：# False

串联组合列表：加号和 extend

[4, None, 'foo'] + [7, 8, (2, 3)] # [4, None, 'foo', 7, 8, (2, 3)]

x = [4, None, 'foo']
x.extend([7, 8, (2, 3)])
x # [4, None, 'foo', 7, 8, (2, 3)]


everything = []
for chunk in list_of_lists:
    everything.extend(chunk)

排序：sort 函数将列表原地排序，不创建新对象；该函数有一些选项，常用二级排序 key

a = [7, 2, 5, 1, 3]
a.sort()
a # [1, 2, 3, 5, 7]

b = ['saw', 'small', 'He', 'foxes', 'six']
b.sort(key=len)
b # ['He', 'saw', 'six', 'small', 'foxes']

二分查找：bisect 模块，不检查列表是否排好序。bisect.bisect 可以找到插入值后仍保证排序的位置，bisect.insort 是向这个位置插入值。

import bisect

c = [1, 2, 2, 2, 3, 4, 7]
bisect.bisect(c, 5) # 6
bisect.insort(c, 6)
c # [1, 2, 2, 2, 3, 4, 6, 7]

切片：选取大多数序列类型的一部分，基本形式是 start：stop，不包含结束元素。

seq = [7, 2, 3, 7, 5, 6, 0, 1]

seq[1:5] # [2, 3, 7, 5]

# 赋值
seq[3:4] = [6, 3] 
seq # [7, 2, 3, 6, 3, 5, 6, 0, 1]

# 省略 start 或 stop，省略后默认开头或结尾
seq[:5] # [7, 2, 3, 6, 3]
seq[3:] # [6, 3, 5, 6, 0, 1]

# 从后向前切片
seq[-4:] # [5, 6, 0, 1]
seq[-6:-2] # [6, 3, 5, 6]

# 第二个冒号后使用 step
seq[::2] # 隔一个取一个元素 [7, 3, 3, 6, 1]
seq[::-1] # 颠倒列表或元组 [1, 0, 6, 5, 3, 6, 3, 2, 7]

序列函数：

enumerate函数：迭代序列时跟踪序号

# 手动跟踪
i = 0
for value in collection:
    # do something with value
    i += 1

# 用 enumerate 函数
for i, value in enumerate(collection):
    # do something with value

# 比如把序号当作字典值
some_list = ['foo', 'bar', 'baz']
mapping = {}
for i, v in enumerate(some_list):
    mapping[v] = i
mapping # {'bar': 1, 'baz': 2, 'foo': 0}

sorted函数：从任意序列的元素返回一个新的排好序的列表「可以接受和 sort 相同的参数」

sorted([7, 1, 2, 6, 0, 3, 2]) # [0, 1, 2, 2, 3, 6, 7]
sorted('horse race') # [' ', 'a', 'c', 'e', 'e', 'h', 'o', 'r', 'r', 's']

zip函数：将多个列表、元组或其它序列成对组合成一个元组列表

seq1 = ['foo', 'bar', 'baz']
seq2 = ['one', 'two', 'three']
zipped = zip(seq1, seq2)
list(zipped) # [('foo', 'one'), ('bar', 'two'), ('baz', 'three')]

# zip 可以处理任意多的序列，元素的个数取决于最短的序列
seq3 = [False, True]
list(zip(seq1, seq2, seq3)) # list(zip(seq1, seq2, seq3))


# 结合 enumerate，同时迭代多个序列
for i, (a, b) in enumerate(zip(seq1, seq2)):
    print('{0}: {1}, {2}'.format(i, a, b))
0: foo, one
1: bar, two
2: baz, three

# 解压序列
pitchers = [('Nolan', 'Ryan'), ('Roger', 'Clemens'), ('Schilling', 'Curt')]
first_names, last_names = zip(*pitchers)
first_names # ('Nolan', 'Roger', 'Schilling')
last_names # ('Ryan', 'Clemens', 'Curt')

reversed函数：从后向前迭代一个序列，是一个生成器。

list(reversed(range(10))) # [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

字典

别名哈希映射或关联数组。键值对的大小可变集合，键和值都是Python对象。

定义：使用尖括号，用冒号分隔键和值；可以访问、插入或设定字典中的元素

empty_dict = {}

d1 = {'a' : 'some value', 'b' : [1, 2, 3, 4]}

d1[7] = 'an integer'
d1 # {'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}

d1['b'] # [1, 2, 3, 4]

方法：

in 检查是否包含某个键

'b' in d1 # True

del 、pop「返回+删除」删除值

del d1['b'] 

d1.pop('b') # [1, 2, 3, 4]

keys 和values 是字典的键和值的迭代器方法「键值对无序，这两个方法可以用相同的顺序输出键和值」

list(d1.keys()) # ['a', 'b', 7]
list(d1.values()) # ['some value', [1, 2, 3, 4], 'an integer']

update ：两字典融合，原地改变字典舍弃旧值

# d1 =  {'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}
d1.update({'b' : 'foo', 'c' : 12})
d1 # {'a': 'some value', 'b': 'foo', 7: 'an integer', 'c': 12}

用序列创建字典：将两个序列配对组合成字典

mapping = {}
for key, value in zip(key_list, value_list):
    mapping[key] = value

# 或者
mapping = dict(zip(range(5), reversed(range(5))))
mapping # {0: 4, 1: 3, 2: 2, 3: 1, 4: 0}

默认值：setdefault 方法和 defaultdict 类

# 常见逻辑
if key in some_dict:
    value = some_dict[key]
else:
    value = default_value
#  简写为
value = some_dict.get(key, default_value) # get 默认会返回 None

# 通过首字母将单词分类
words = ['apple', 'bat', 'bar', 'atom', 'book']
by_letter = {}

for word in words:
    letter = word[0]
    if letter not in by_letter:
        by_letter[letter] = [word]
    else:
        by_letter[letter].append(word)

by_letter # {'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

# 设置默认值 setdefault
for word in words:
    letter = word[0]
    by_letter.setdefault(letter, []).append(word)

# collections 模块的 defaultdict 类
from collections import defaultdict

by_letter = defaultdict(list)
for word in words:
    by_letter[word[0]].append(word)

有效键类型：字典值可任意，而键通常是标量类型（整数、浮点型、字符串）或元组（元组中的对象必须是不可变的）。这被称为“可哈希性”。用hash 函数检测一个对象是否是可哈希的（可被用作字典的键）

hash('string') # 5023931463650008331
hash((1, 2, [2, 3]))  # fails because lists are mutable

# 列表当做键，可将列表转化为元组，只要内部元素可以被哈希，它也就可以被哈希
d = {}
d[tuple([1, 2, 3])] = 5
d # {(1, 2, 3): 5}

集合

无序的不可重复的元素的集合，可以看作只有键没有值的字典。
定义：通过set函数或使用尖括号set语句

set([2, 2, 2, 1, 3, 3]) # {1, 2, 3}
{2, 2, 2, 1, 3, 3} # {1, 2, 3}

运算：常用集合方法如下
在这里插入图片描述

a = {1, 2, 3, 4, 5}
b = {3, 4, 5, 6, 7, 8}

a.union(b) # {1, 2, 3, 4, 5, 6, 7, 8}
a | b # {1, 2, 3, 4, 5, 6, 7, 8}

a.intersection(b) # {3, 4, 5}
a & b # {3, 4, 5}

c = a.copy()
c |= b
c # {1, 2, 3, 4, 5, 6, 7, 8}

元素不变性：与字典类似，集合元素通常都是不可变的「参考字典的有效键类型」。要获得类似列表的元素，必须转换成元组。

my_data = [1, 2, 3, 4]
my_set = {tuple(my_data)}
my_set # {(1, 2, 3, 4)}

子集父集相等：issubset issuperset ==

a_set = {1, 2, 3, 4, 5}

{1, 2, 3}.issubset(a_set) # True
a_set.issuperset({1, 2, 3}) # True
{1, 2, 3} == {3, 2, 1} # True

列表、字典和集合推导式

列表推导式：从一个集合过滤元素，形成列表，在传递参数的过程中还可以修改元素。

list_comp = [expr for val in collection if condition]

# 等价于
result = []
for val in collection:
    if condition:
    result.append(expr)

# 例如
strings = ['a', 'as', 'bat', 'car', 'dove', 'python']
list_comp = [x.upper() for x in strings if len(x) > 2] 
list_comp # ['BAT', 'CAR', 'DOVE', 'PYTHON']

字典推导式

dict_comp = {key-expr : value-expr for value in collection if condition}

# 例如，创建一个字符串的查找映射表以确定它在列表中的位置
loc_mapping = {val : index for index, val in enumerate (strings)}
loc_mapping # {'a': 0, 'as': 1, 'bat': 2, 'car': 3, 'dove': 4, 'python': 5}

集合推导式：与列表很像，用尖括号

set_comp = {expr for value in collection if condition}

# 例如只想要字符串的长度
unique_lengths = {len(x) for x in strings}
unique_lengths # {1, 2, 3, 4, 6}

# map 函数可以进一步简化
set(map(len, strings)) # {1, 2, 3, 4, 6}

嵌套列表推导式：适用于包含列表的列表，列表推导式的for部分是根据嵌套的顺序，过滤条件放在最后。

例一：用一个列表包含所有的名字，这些名字中包含两个或更多的e

all_data = [['John', 'Emily', 'Michael', 'Mary', 'Steven'], ['Maria', 'Juan', 'Javier', 'Natalia', 'Pilar']]

# for 循环实现
names_of_interest = []
for names in all_data:  
    enough_es = [name for name in names if name.count('e') >= 2]
    names_of_interest.extend(enough_es)

# 嵌套列表推导式，将这些写在一起
result = [name for names in all_data for name in names if name.count('e') >= 2]
result # ['Steven']

例二：将一个整数元组的列表扁平化成了一个整数列表

some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

# for 循环实现
flattened = []
for tup in some_tuples:
    for x in tup:
    flattened.append(x)

# 嵌套列表推导式
flattened = [x for tup in some_tuples for x in tup]
flattened # [1, 2, 3, 4, 5, 6, 7, 8, 9]

可以有任意多级别的嵌套，有两三个以上的嵌套，应该考虑下代码可读性。

[[x for x in tup] for tup in some_tuples] # [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

函数

函数是 Python 中最主要也是最重要的代码组织和复用手段。

def 声明，return 返回，没写 return 就返回 None。

参数

可以有一些位置参数（positional）和一些关键字参数（keyword）。关键字参
数通常用于指定默认值或可选参数。x 和 y 是位置参数，z 是关键字参数。

关键字参数必须位于位置参数（如果有的话）之后

def my_function(x, y, z=1.5): 
    if z > 1:
        return z * (x + y)
    else:
        return z / (x + y)

# 调用方式
my_function(5, 6, z=0.7)
my_function(3.14, 7, 3.5)
my_function(10, 20)

# 可以任何顺序指定关键字参数
my_function(x=5, y=6, z=7)
my_function(y=6, x=5, z=7)

命名空间、作用域

函数可以访问两种不同作用域中的变量：全局（global）和局部（local）。

任何在函数中赋值的变量默认都是被分配到局部命名空间（local namespace）中，函数被调用时创建，函数执行完毕被销毁。

全局变量用 global 关键字声明。

返回多个值

python 函数可以返回多个值，还可以返回字典。

def f():
    a = 5
    b = 6
    c = 7
    return a, b, c

a, b, c = f() # a = 5   a, b, c = (5, 6, 7)
return_value = f() # (5, 6, 7)

def f():
    a = 5
    b = 6
    c = 7
    return {'a' : 1, 'b' : 2, 'c' : 3}

a, b, c = f() # a = 'a'    a, b, c = ('a', 'b', 'c')
return_value = f() # {'a': 5, 'b': 6, 'c': 7}

函数作为对象

以清理数据为例。

states = [' Alabama ', 'Georgia!', 'Georgia', 'georgia', 'FlOrIda', 'south carolina##', 'West virginia?']

import re

def clean_strings(strings):
    result = []
    for value in strings:
        value = value.strip()
        value = re.sub('[!#?]', '', value)
        value = value.title()
        result.append(value)
    return result
clean_strings(states)

把这些运算(函数)做成列表，当作列表元素

def remove_punctuation(value):
    return re.sub('[!#?]', '', value)
clean_ops = [str.strip, remove_punctuation, str.title]

def clean_strings(strings, ops):
    result = []
    for value in strings:
        for function in ops:
            value = function(value)
        result.append(value)
    return result

clean_strings(states, clean_ops)

此时 clean_strings 更具可复用性。

函数用作其他函数的参数，比如内置的 map 函数，它用于在一组数据上应用一个函数。

for x in map(remove_punctuation, states):
    print(x)

匿名函数

Python支持一种被称为匿名的、或 lambda 函数。

它由单条语句组成，该语句的结果就是返回值，通过lambda关键字定义。

与 def 声明的函数不同，这种函数对象本身是没有提供名称 name 属性。

def short_function(x):
    return x * 2
    
# 等价于
equiv_anon = lambda x: x * 2

很多数据转换函数都以函数作为参数的。直接传入 lambda 函数比编写完整函数声明要少输入很多字（也更清晰），甚至比将 lambda 函数赋值给一个变量还要少输入很多字。

def apply_to_list(some_list, f):
    return [f(x) for x in some_list]

ints = [4, 0, 1, 5, 6]
apply_to_list(ints, lambda x: x * 2)

根据各字符串不同字母的数量对其进行排序。

strings = ['foo', 'card', 'bar', 'aaaa', 'abab']

strings.sort(key=lambda x: len(set(list(x))))

strings # ['aaaa', 'foo', 'abab', 'bar', 'card']

柯里化：部分参数应用

柯里化（currying）是指通过“部分参数应用”（partial argument application）从现有函数派生出新函数的技术。

给定一个两数相加的函数

def add_numbers(x, y):
    return x + y

通过这个函数，派生出一个新的只有一个参数的函数 add_five，第二个参数是“柯里化的”「curried」。

add_five =  lambda y: add_numbers(5, y)

其实就是定义了一个可以调用已有函数的新函数。内置的 functools 模块可以用 partial 函数「偏函数」将此过程简化。

from functools import partial
add_five = partial(add_numbers, 5)

生成器和生成器表达式

能以一种一致的方式对序列进行迭代（比如列表中的对象或文件中的行）是Python 的一个重要特点。这是通过一种叫做迭代器协议（iterator protocol，它是一种使对象可迭代的通用方式）的方式实现的，一个原生的使对象可迭代的方法。

迭代可使用迭代器，生成器和生成器表达式。

迭代器是一种特殊对象，它可以在诸如for循环之类的上下文中向Python解释器输送对象。大部分能接受列表之类的对象的方法也都可以接受任何可迭代对象。比如min、max、sum 等内置方法以及 list、tuple 等类型构造器：

some_dict = {'a': 1, 'b': 2, 'c': 3}
dict_iterator = iter(some_dict)
list(dict_iterator) # ['a', 'b', 'c']

生成器

生成器（generator）是构造新的可迭代对象的一种简单方式。一般的函数执行之后只会返回单个值，而生成器则是以延迟的方式返回一个值序列，即每返回一个值之后暂停，直到下一个值被请求时再继续。要创建一个生成器，只需将函数中的 return 替换为 yeild 即可。

def square(n=10):
    for i in range(1,n+1):
        yield i ** 2
gen = square()

调用该生成器时，没有任何代码会被立即执行

gen # <generator object square at 0x0000017FF9893CA8>

直到你从该生成器中请求元素时，它才会开始执行其代码：

for i in gen:
    print(i)
1
4
9
16
25
36
49
64
81
100

生成器表达式

另一种更简洁的构造生成器的方法是使用生成器表达式（generator
expression）。这是一种类似于列表、字典、集合推导式的生成器。其创建方式为，把列表推导式两端的方括号改成圆括号：

gen = (x ** 2 for x in range(100))

#等价于
def gen():
    for i in range(100):
        yield i ** 2

生成器表达式也可以取代列表推导式，作为函数参数

sum(x ** 2 for x in range(100)) # 328350

dict((i, i **2) for i in range(5)) # {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

itertools模块

标准库itertools模块中有一组用于许多常见数据算法的生成器。

例如，groupby可以接受任何序列和一个函数。它根据函数的返回值对序列中的连续元素进行分组。

# 根据名字首字母分组
import itertools
first_letter = lambda x: x[0]

names = ['Alan', 'Adam', 'Wes', 'Will', 'Albert', 'Steven']

for letter, names in itertools.groupby(names, first_letter):
    print(letter, list(names)) # names就是一个生成器
A ['Alan', 'Adam']
W ['Wes', 'Will']
A ['Albert']
S ['Steven']

一些生成器例子：
在这里插入图片描述

错误和异常处理

优雅地处理Python的错误和异常是构建健壮程序的重要部分。

用 try 和 except 处理异常：

# float函数可以将字符串转换成浮点数，但输入有误时，有 ValueError 错误
float('1.2345') # 1.2345
float('something') # ValueError 
float((1, 2)) # TypeError 

# 处理 ValueError 
def attempt_float(x):
    try:
        return float(x)
    except:
        return x

attempt_float('1.2345') # 1.2345
attempt_float('something') # 'something'
attempt_float((1, 2)) # (1, 2)

# 只想处理 ValueError
def attempt_float(x):
    try:
        return float(x)
    except ValueError:
        return x
        
attempt_float((1, 2)) # TypeError

用元组包含多个异常：

def attempt_float(x):
    try:
        return float(x)
    except (TypeError, ValueError):
        return x

不抑制异常，无论 try 部分代码是否成功，都执行一段代码，用 finally：

f = open(path, 'w')
try:
    write_to_file(f)
except:
    print('Failed')
else:
    print('Succeeded')
finally:
    f.close()

文件和操作系统

python 内置文件操作方法，最常用文件方法：
在这里插入图片描述

打开文件 open close EOL with语句

打开文件：open 函数，默认为只读模式（‘r’）

f = open(path) # f = open(path, 'r')

所有读写模式：
在这里插入图片描述像处理列表那样来处理这个文件句柄 f，比如对行进行迭代：

for line in f:
    pass

从文件中取出的行都带有完整的行结束符（EOL），处理后得到一组没有EOL的行:

lines = [x.rstrip() for x in open(path)]

如果用 open 创建文件对象，一定要用 close 关闭它。

f.close()

用 with 语句，退出代码块自动关闭文件。

with open(path) as f:
    lines = [x.rstrip() for x in f]

读文件 read seek tell

对于可读文件，一些常用的方法是read、seek和tell。

read 会从文件返回字符 read 模式会将文件句柄的位置提前，提前的数量是读取的字节数。
tell 给出当前的位置。
seek将文件位置更改为文件中的指定字节。

f = open(path)
f2 = open(path, 'rb') 

# 用 sys 模块检查默认编码
import sys
sys.getdefaultencoding() # 'utf-8'

# read 字符内容由文件编码决定，二进制模式打开的就是原始字节
f.read(10) # 'Sueña el r'
f2.read(10) # b'Sue\xc3\xb1a el '

# tell
f.tell() # 11 默认编码用11个字节才解开10个字符
f2.tell() # 10

# seek
f.seek(3) # 3

f.close()
f2.close()

写文件 write writelines

with open('tmp.txt', 'w') as handle:
    handle.writelines(x for x in open(path) if len(x) > 1)

with open('tmp.txt') as f:
    lines = f.readlines()

catOneTwo

原创文章 46 获赞 36 访问量 2万+

关注私信