1、问题背景
设想你是工厂的生产主管,你有一些芯片在两次测试中的测试结果。对于这两次测试,你想决定是否芯片要被接受或抛弃。为了帮助你做出艰难的决定,你拥有过去芯片的测试数据集,从其中你可以构建一个逻辑回归模型。
1数据读取与可视化
- 读取数据,并可视化
## 读取数据
df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
print('df.head()')
print(df.head())
## 散点图可视化
positive = df[df['accepted'].isin([1])]
negative = df[df['accepted'].isin([0])]
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(positive['test1'], positive['test2'], s=50, c='b', marker='o', label='Accepted')
ax.scatter(negative['test1'], negative['test2'], s=50, c='r', marker='x', label='Rejected')
ax.legend()
ax.set_xlabel('Test 1 Score')
ax.set_ylabel('Test 2 Score')
plt.show()
- 得到图1所示的分布图:
- 由图可知,这个数据的分界线是非线性的,可以考虑通过构造多项式特征来解决。
2 特征映射
- 特征映射方案:
for i in 0..i
for p in 0..i:
output x^(i-p) * y^p
- 映射代码实现:
## 特征映射函数
def feature_mapping(x, y, power, as_ndarray=False):
# """return mapped features as ndarray or dataframe"""
# data = {}
# # inclusive
# for i in np.arange(power + 1):
# for p in np.arange(i + 1):
# data["f{}{}".format(i - p, p)] = np.power(x, i - p) * np.power(y, p)
data = {
"f{}{}".format(i - p, p): np.power(x, i - p) * np.power(y, p)
for i in np.arange(power + 1)
for p in np.arange(i + 1)
}
if as_ndarray:
return pd.DataFrame(data).values
else:
return pd.DataFrame(data)
- 得到如图2所示的数据。
3 准备训练特征数据及标签
- 注意此时的标签在第0列
## 数据处理
x1 = np.array(df.test1)
x2 = np.array(df.test2)
data = feature_mapping(x1, x2, power=6)
print('data.shape')
print(data.shape)
print('data.head()')
print(data.head())
theta = np.zeros(data.shape[1])
X = feature_mapping(x1, x2, power=6, as_ndarray=True)
print('X.shape', end=':')
print(X.shape)
y = get_y(df)
print('y.shape', end=':')
print(y.shape)
4 带正则化项的代价函数
- 带正则化项的代价函数如下:
J ( θ ) = 1 m ∑ i = 1 m [ − y ( i ) log ( h θ ( x ( i ) ) ) − ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2 (1) J\left( \theta \right)=\frac{1}{m}\sum\limits_{i=1}^{m}{[-{ {y}^{(i)}}\log \left( { {h}_{\theta }}\left( { {x}^{(i)}} \right) \right)-\left( 1-{ {y}^{(i)}} \right)\log \left( 1-{ {h}_{\theta }}\left( { {x}^{(i)}} \right) \right)]}+\frac{\lambda }{2m}\sum\limits_{j=1}^{n}{\theta _{j}^{2}} \tag{1} J(θ)=m1i=1∑m[−y(i)log(hθ(x(i)))−(1−y(i))log(1−hθ(x(i)))]+2mλj=1∑nθj2(1)
- 具体实现:
## 正则化代价函数
def cost(theta, X, y):
''' cost fn is -l(theta) for you to minimize'''
return np.mean(-y * np.log(sigmoid(X @ theta)) - (1 - y) * np.log(1 - sigmoid(X @ theta)))
def regularized_cost(theta, X, y, l=1):
# '''you don't penalize theta_0'''
theta_j1_to_n = theta[1:]
regularized_term = (l / (2 * len(X))) * np.power(theta_j1_to_n, 2).sum()
return cost(theta, X, y) + regularized_term
print('初始参数值对应的正则化代价函数值:', regularized_cost(theta, X, y, l=1))
5 带正则化项的梯度下降
- 正则化的梯度下降中偏导数为:
∂ J ( θ ) ∂ θ j = ( 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ) + λ m θ j for j ≥ 1 (2) \frac{\partial J\left( \theta \right)}{\partial { {\theta }_{j}}}=\left( \frac{1}{m}\sum\limits_{i=1}^{m}{\left( { {h}_{\theta }}\left( { {x}^{\left( i \right)}} \right)-{ {y}^{\left( i \right)}} \right)} \right)+\frac{\lambda }{m}{ {\theta }_{j}}\text{ }\text{ for j}\ge \text{1} \tag{2} ∂θj∂J(θ)=(m1i=1∑m(hθ(x(i))−y(i)))+mλθj for j≥1(2)
- 具体实现:
## 梯度下降函数
def gradient(theta, X, y):
# '''just 1 batch gradient'''
return (1 / len(X)) * X.T @ (sigmoid(X @ theta) - y)
def regularized_gradient(theta, X, y, l=1):
# '''still, leave theta_0 alone'''
theta_j1_to_n = theta[1:]
regularized_theta = (l / len(X)) * theta_j1_to_n
# by doing this, no offset is on theta_0
regularized_term = np.concatenate([np.array([0]), regularized_theta])
return gradient(theta, X, y) + regularized_term
print('初始参数对应的梯度下降值:', regularized_gradient(theta, X, y))
6 拟合参数
- 类似于不带正则化的拟合方法,具体实现如下:
## 拟合参数
import scipy.optimize as opt
print('init cost = {}'.format(regularized_cost(theta, X, y)))
res = opt.minimize(fun=regularized_cost, x0=theta, args=(X, y), method='Newton-CG', jac=regularized_gradient)
print('拟合结果:', res, end='\n')
7 预测
- 具体实现如下:
##预测
from sklearn.metrics import classification_report # 这个包是评价报告
final_theta = res.x
def predict(x, theta):
prob = sigmoid(x @ theta)
return (prob >= 0.5).astype(int)
y_pred = predict(X, final_theta)
print(classification_report(y, y_pred))
8 寻找决策边界
-
由于这个数据集的边界是非线性的,且做了特征映射,无法像线性边界那样,直接通过方程变换得到特征之间的线性关系。
-
因此这里确定决策边界的思路如下:
- 使用经过特征映射的实验数据来求得优化后的参数向量
- 生成实验数据范围内的大量数据,计算 X θ X\theta Xθ,选择足够接近于0的那些点(通过设置的阈值来选择)
- 用找出来的点形成决策边界
-
具体实现
# 定义逻辑回归函数,可通过power自定义多项式的数量
def feature_mapped_logistic_regression(power, l):
# """for drawing purpose only.. not a well generealize logistic regression
# power: int
# raise x1, x2 to polynomial power
# l: int
# lambda constant for regularization term
# """
df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
x1 = np.array(df.test1)
x2 = np.array(df.test2)
y = get_y(df)
X = feature_mapping(x1, x2, power, as_ndarray=True)
theta = np.zeros(X.shape[1])
res = opt.minimize(fun=regularized_cost,
x0=theta,
args=(X, y, l),
method='TNC',
jac=regularized_gradient)
final_theta = res.x
return final_theta
# 寻找决策边界上的点
def find_decision_boundary(density, power, theta, threshhold):
t1 = np.linspace(-1, 1.5, density)
t2 = np.linspace(-1, 1.5, density)
cordinates = [(x, y) for x in t1 for y in t2]
x_cord, y_cord = zip(*cordinates)
mapped_cord = feature_mapping(x_cord, y_cord, power) # this is a dataframe
inner_product = mapped_cord.values @ theta
decision = mapped_cord[np.abs(inner_product) < threshhold]
# 根据特征映射方法,f10、f01分别对应的就是x,y
return decision.f10, decision.f01
# 可视化决策边界
import seaborn as sns
def draw_boundary(power, l):
# """
# power: polynomial power for mapped feature
# l: lambda constant
# """
density = 1000
threshhold = 2 * 10 ** -3
final_theta = feature_mapped_logistic_regression(power, l)
x, y = find_decision_boundary(density, power, final_theta, threshhold)
df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
sns.lmplot('test1', 'test2', hue='accepted', data=df, size=6, fit_reg=False, scatter_kws={
"s": 100})
plt.scatter(x, y, c='red', s=10)
plt.title('Decision boundary')
plt.show()
- 当指定
power=6
时,得到如图3所示的决策边界:
draw_boundary(power=6, l=1) # lambda=1
- 若不进行正则化,则会过拟合:
draw_boundary(power=6, l=0) # no regularization, over fitting,#lambda=0,没有正则化,过拟合了
- lambda过大,惩罚过大,则会欠拟合:
draw_boundary(power=6, l=100) # underfitting,# lambda=100,欠拟合