4.2-逻辑回归-带正则化项

1、问题背景

设想你是工厂的生产主管,你有一些芯片在两次测试中的测试结果。对于这两次测试,你想决定是否芯片要被接受或抛弃。为了帮助你做出艰难的决定,你拥有过去芯片的测试数据集,从其中你可以构建一个逻辑回归模型。

1数据读取与可视化

  1. 读取数据,并可视化
## 读取数据
df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
print('df.head()')
print(df.head())
## 散点图可视化
positive = df[df['accepted'].isin([1])]
negative = df[df['accepted'].isin([0])]

fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(positive['test1'], positive['test2'], s=50, c='b', marker='o', label='Accepted')
ax.scatter(negative['test1'], negative['test2'], s=50, c='r', marker='x', label='Rejected')
ax.legend()
ax.set_xlabel('Test 1 Score')
ax.set_ylabel('Test 2 Score')
plt.show()
  1. 得到图1所示的分布图:

在这里插入图片描述

图1
  1. 由图可知,这个数据的分界线是非线性的,可以考虑通过构造多项式特征来解决。

2 特征映射

  1. 特征映射方案:
for i in 0..i
  for p in 0..i:
    output x^(i-p) * y^p
  1. 映射代码实现:
## 特征映射函数
def feature_mapping(x, y, power, as_ndarray=False):
    #     """return mapped features as ndarray or dataframe"""
    # data = {}
    # # inclusive
    # for i in np.arange(power + 1):
    #     for p in np.arange(i + 1):
    #         data["f{}{}".format(i - p, p)] = np.power(x, i - p) * np.power(y, p)

    data = {
    
    "f{}{}".format(i - p, p): np.power(x, i - p) * np.power(y, p)
            for i in np.arange(power + 1)
            for p in np.arange(i + 1)
            }

    if as_ndarray:
        return pd.DataFrame(data).values
    else:
        return pd.DataFrame(data)
  1. 得到如图2所示的数据。

在这里插入图片描述

图2

3 准备训练特征数据及标签

  1. 注意此时的标签在第0列
## 数据处理
x1 = np.array(df.test1)
x2 = np.array(df.test2)
data = feature_mapping(x1, x2, power=6)
print('data.shape')
print(data.shape)
print('data.head()')
print(data.head())

theta = np.zeros(data.shape[1])
X = feature_mapping(x1, x2, power=6, as_ndarray=True)
print('X.shape', end=':')
print(X.shape)

y = get_y(df)
print('y.shape', end=':')
print(y.shape)

4 带正则化项的代价函数

  1. 带正则化项的代价函数如下:

J ( θ ) = 1 m ∑ i = 1 m [ − y ( i ) log ⁡ ( h θ ( x ( i ) ) ) − ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ] + λ 2 m ∑ j = 1 n θ j 2 (1) J\left( \theta \right)=\frac{1}{m}\sum\limits_{i=1}^{m}{[-{ {y}^{(i)}}\log \left( { {h}_{\theta }}\left( { {x}^{(i)}} \right) \right)-\left( 1-{ {y}^{(i)}} \right)\log \left( 1-{ {h}_{\theta }}\left( { {x}^{(i)}} \right) \right)]}+\frac{\lambda }{2m}\sum\limits_{j=1}^{n}{\theta _{j}^{2}} \tag{1} J(θ)=m1i=1m[y(i)log(hθ(x(i)))(1y(i))log(1hθ(x(i)))]+2mλj=1nθj2(1)

  1. 具体实现:
## 正则化代价函数
def cost(theta, X, y):
    ''' cost fn is -l(theta) for you to minimize'''
    return np.mean(-y * np.log(sigmoid(X @ theta)) - (1 - y) * np.log(1 - sigmoid(X @ theta)))


def regularized_cost(theta, X, y, l=1):
    #     '''you don't penalize theta_0'''
    theta_j1_to_n = theta[1:]
    regularized_term = (l / (2 * len(X))) * np.power(theta_j1_to_n, 2).sum()

    return cost(theta, X, y) + regularized_term


print('初始参数值对应的正则化代价函数值:', regularized_cost(theta, X, y, l=1))

5 带正则化项的梯度下降

  1. 正则化的梯度下降中偏导数为:

∂ J ( θ ) ∂ θ j = ( 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ) + λ m θ j   for j ≥ 1 (2) \frac{\partial J\left( \theta \right)}{\partial { {\theta }_{j}}}=\left( \frac{1}{m}\sum\limits_{i=1}^{m}{\left( { {h}_{\theta }}\left( { {x}^{\left( i \right)}} \right)-{ {y}^{\left( i \right)}} \right)} \right)+\frac{\lambda }{m}{ {\theta }_{j}}\text{ }\text{ for j}\ge \text{1} \tag{2} θjJ(θ)=(m1i=1m(hθ(x(i))y(i)))+mλθj  for j1(2)

  1. 具体实现:
## 梯度下降函数
def gradient(theta, X, y):
    #     '''just 1 batch gradient'''
    return (1 / len(X)) * X.T @ (sigmoid(X @ theta) - y)


def regularized_gradient(theta, X, y, l=1):
    #     '''still, leave theta_0 alone'''
    theta_j1_to_n = theta[1:]
    regularized_theta = (l / len(X)) * theta_j1_to_n

    # by doing this, no offset is on theta_0
    regularized_term = np.concatenate([np.array([0]), regularized_theta])

    return gradient(theta, X, y) + regularized_term


print('初始参数对应的梯度下降值:', regularized_gradient(theta, X, y))

6 拟合参数

  1. 类似于不带正则化的拟合方法,具体实现如下:
## 拟合参数
import scipy.optimize as opt

print('init cost = {}'.format(regularized_cost(theta, X, y)))

res = opt.minimize(fun=regularized_cost, x0=theta, args=(X, y), method='Newton-CG', jac=regularized_gradient)
print('拟合结果:', res, end='\n')

7 预测

  1. 具体实现如下:
##预测
from sklearn.metrics import classification_report  # 这个包是评价报告

final_theta = res.x

def predict(x, theta):
    prob = sigmoid(x @ theta)
    return (prob >= 0.5).astype(int)

y_pred = predict(X, final_theta)

print(classification_report(y, y_pred))

8 寻找决策边界

  1. 由于这个数据集的边界是非线性的,且做了特征映射,无法像线性边界那样,直接通过方程变换得到特征之间的线性关系。

  2. 因此这里确定决策边界的思路如下:

    1. 使用经过特征映射的实验数据来求得优化后的参数向量
    2. 生成实验数据范围内的大量数据,计算 X θ X\theta ,选择足够接近于0的那些点(通过设置的阈值来选择)
    3. 用找出来的点形成决策边界
  3. 具体实现

# 定义逻辑回归函数,可通过power自定义多项式的数量
def feature_mapped_logistic_regression(power, l):
    #     """for drawing purpose only.. not a well generealize logistic regression
    #     power: int
    #         raise x1, x2 to polynomial power
    #     l: int
    #         lambda constant for regularization term
    #     """
    df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
    x1 = np.array(df.test1)
    x2 = np.array(df.test2)
    y = get_y(df)

    X = feature_mapping(x1, x2, power, as_ndarray=True)
    theta = np.zeros(X.shape[1])

    res = opt.minimize(fun=regularized_cost,
                       x0=theta,
                       args=(X, y, l),
                       method='TNC',
                       jac=regularized_gradient)
    final_theta = res.x

    return final_theta
# 寻找决策边界上的点
def find_decision_boundary(density, power, theta, threshhold):
    t1 = np.linspace(-1, 1.5, density)
    t2 = np.linspace(-1, 1.5, density)

    cordinates = [(x, y) for x in t1 for y in t2]
    x_cord, y_cord = zip(*cordinates)
    mapped_cord = feature_mapping(x_cord, y_cord, power)  # this is a dataframe

    inner_product = mapped_cord.values @ theta

    decision = mapped_cord[np.abs(inner_product) < threshhold]
    # 根据特征映射方法,f10、f01分别对应的就是x,y
    return decision.f10, decision.f01
  
# 可视化决策边界
import seaborn as sns
def draw_boundary(power, l):
    #     """
    #     power: polynomial power for mapped feature
    #     l: lambda constant
    #     """
    density = 1000
    threshhold = 2 * 10 ** -3

    final_theta = feature_mapped_logistic_regression(power, l)
    x, y = find_decision_boundary(density, power, final_theta, threshhold)

    df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
    sns.lmplot('test1', 'test2', hue='accepted', data=df, size=6, fit_reg=False, scatter_kws={
    
    "s": 100})

    plt.scatter(x, y, c='red', s=10)
    plt.title('Decision boundary')
    plt.show()
  1. 当指定power=6时,得到如图3所示的决策边界:
draw_boundary(power=6, l=1)  # lambda=1

在这里插入图片描述

图3
  1. 若不进行正则化,则会过拟合:
draw_boundary(power=6, l=0)  # no regularization, over fitting,#lambda=0,没有正则化,过拟合了

在这里插入图片描述

图4
  1. lambda过大,惩罚过大,则会欠拟合:
draw_boundary(power=6, l=100)  # underfitting,# lambda=100,欠拟合

在这里插入图片描述

图5

猜你喜欢

转载自blog.csdn.net/colleges/article/details/126739488