人工智能day01

机器学习

一.概述

1.什么是机器学习

  • 人工智能:通过人工的方法,实现或者近似实现某些需要人类智能处理的问题,都可以称为人工智能
  • 机器学习:一个计算机程序在完成任务T之后,获得经验E,而该经验的效果可以通过P得以表现,如果随着T的增加,借助P来表现的E也可以同步增进,则称这样的程序为机器学习系统.
  • 自我完善,自我修正,自我增强

2.为什么需要机器学习

  1. 简化或者替代人工方式的模式识别,易于系统的开发维护和升级换代.
  2. 对于那些算法过于复杂,或者没有明确解法的问题,机器学习系统具有得天独厚的优势
  3. 借鉴机器学习的过程,反向推理出隐藏在业务数据背后的规则.----数据挖掘

3.机器学习的类型

  1. 有监督学习,无监督学习,半监督学习和强化学习
  2. 批量学习和增量学习
  3. 基于实例的学习和基于模型的学习

4.机器学习的流程

  • 数据采集
  • 数据清洗     数据
    ...........................
  • 数据预处理
  • 选择模型
  • 训练模型
  • 验证模型     机器模型
    ..................................
  • 使用模型      业务
  • 维护和升级

二.数据预处理

import sklearn.preprocessing as sp

样本矩阵
                    输入数据             输出数据
             _____特征_____
              /       |        |        \
          身高  体重  年龄  性别
 样本1 1.7    60    25     男    -> 8000
 样本2 1.5    50    20     女    -> 6000
... 

  1. 均值移除(标准化)
    特征A:10+-5 
    特征B:10000+-5000
    特征淹没
      通过算法调整令样本矩阵中每一列(特征)的平均值为0,标准差为1.这样一来,所有特征对最终模型的预测结果都有接近一致的贡献,模型对每个特征的倾向性更加均衡.
    [a b c]
    m=(a+b+c)/3, s=sqrt(((a-m)^2+(b-m)^2+(c-m)^2)/3)
    [a' b' c']
    a'=a-m
    b'=b-m
    c'=c-m
    m'
    =(a'+b'+c')/3
    =(a-m+b-m+c-m)/3
    =(a+b+c-3m)/3
    =(a+b+c)/3-m
    =m-m
    =0
    [a" b" c"]
    a"=a'/s
    b"=b'/s
    c"=c'/s
    m"=0
    s"
    =sqrt((a"^2+b"^2+c"^2)/3)
    =sqrt((a'^2+b'^2+c'^2)/(3s^2))
    =sqrt(((a-m)^2+(b-m)^2+(c-m)^2)/(3s^2))
    =sqrt(3s^2/(3s^2))
    =1
    sp.scale(原始样本矩阵)->经过均值移除后的样本矩阵
    代码:std.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.preprocessing as sp
    raw_samples = np.array([
        [3, -1.5,  2,   -5.4],
        [0,  4,   -0.3,  2.1],
        [1,  3.3, -1.9, -4.3]])
    print(raw_samples)
    print(raw_samples.mean(axis=0))
    print(raw_samples.std(axis=0))
    std_samples = raw_samples.copy()
    for col in std_samples.T:
        col_mean = col.mean()
        col_std = col.std()
        col -= col_mean
        col /= col_std
    print(std_samples)
    print(std_samples.mean(axis=0))
    print(std_samples.std(axis=0))
    std_samples = sp.scale(raw_samples)
    print(std_samples)
    print(std_samples.mean(axis=0))
    print(std_samples.std(axis=0))
  2. 范围缩放
    90/150 80/100 5/5
    将样本矩阵每一列的元素经过某种线性变换,使得所有列的元素都处在同样的范围区间内.
    kx + b = y
    k col_min + b = min   \  -> k b
    k col_max + b = max  /
    / col_min 1 \ x / k \ = / min \
    \ col_max 1/    \ b /    \ max /
    ---------------   -----    --------
          a               x             b
                                = np.linalg.solve(a, b)
                                = np.linalg.lstsq(a, b)[0]
    范围缩放器 = sp.MinMaxScaler(
        feature_range=(min, max))
    范围缩放器.fit_transform(原始样本矩阵)
        ->经过范围缩放后的样本矩阵
    有时候也把以[0, 1]区间作为目标范围的范围缩放称为"归一化"
    代码:mms.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.preprocessing as sp
    raw_samples = np.array([
        [3, -1.5,  2,   -5.4],
        [0,  4,   -0.3,  2.1],
        [1,  3.3, -1.9, -4.3]])
    print(raw_samples)
    mms_samples = raw_samples.copy()
    for col in mms_samples.T:
        col_min = col.min()
        col_max = col.max()
        a = np.array([
            [col_min, 1],
            [col_max, 1]])
        b = np.array([0, 1])
        x = np.linalg.solve(a, b)
        col *= x[0]
        col += x[1]
    print(mms_samples)
    mms = sp.MinMaxScaler(feature_range=(0, 1))
    mms_samples = mms.fit_transform(raw_samples)
    print(mms_samples)
  3. 归一化
               Python C/C++ Java PHP
    2016  20          30        40    10    /100
    2017  30          20        30    10    /90
    2018  10            5          1      0    /16
    用每个样本各个特征值除以该样本所有特征值绝对值之和,以占比的形式来表现特征。
    sp.normalize(原始样本矩阵, norm='l1')
        ->经过归一化后的样本矩阵
    l1 - l1范数,矢量诸元素的绝对值之和
    l2 - l2范数,矢量诸元素的(绝对值的)平方之和
    ...
    ln - ln范数,矢量诸元素的绝对值的n次方之和
    代码:nor.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.preprocessing as sp
    raw_samples = np.array([
        [3, -1.5,  2,   -5.4],
        [0,  4,   -0.3,  2.1],
        [1,  3.3, -1.9, -4.3]])
    print(raw_samples)
    nor_samples = raw_samples.copy()
    for row in nor_samples:
        row_absum = abs(row).sum()
        row /= row_absum
    print(nor_samples)
    print(abs(nor_samples).sum(axis=1))
    nor_samples = sp.normalize(raw_samples, norm='l1')
    print(nor_samples)
    print(abs(nor_samples).sum(axis=1))
  4. 二值化
    根据事先给定阈值,将样本矩阵中高于阈值的元素设置为1,否则设置为0,得到一个完全由1和0组成的二值矩阵。
    二值化器 = sp.Binarizer(threshold=阈值)
    二值化器.transform(原始样本矩阵)
        ->经过二值化后的样本矩阵
    代码:bin.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.preprocessing as sp
    raw_samples = np.array([
        [3, -1.5,  2,   -5.4],
        [0,  4,   -0.3,  2.1],
        [1,  3.3, -1.9, -4.3]])
    print(raw_samples)
    bin_samples = raw_samples.copy()
    bin_samples[bin_samples <= 1.4] = 0
    bin_samples[bin_samples > 1.4] = 1
    print(bin_samples)
    bin = sp.Binarizer(threshold=1.4)
    bin_samples = bin.transform(raw_samples)
    print(bin_samples)
    
  5. 独热编码
    用一个只包含一个1和若干个0的序列来表达每个特征值的编码方式,借此既保留了样本矩阵的所有细节,同时又得到一个只含有1和0的稀疏矩阵,既可以提高模型的容错性,同时还能节省内存空间。
    1        3        2
    7        5        4
    1        8        6
    7        3        9
    ----------------------
    1:10  3:100 2:1000
    7:01  5:010 4:0100
              8:001 6:0010
                         9:0001
    ----------------------
    101001000
    010100100
    100010010
    011000001
    独热编码器 = sp.OneHotEncoder(
        sparse=是否紧缩(缺省True), dtype=类型)
    独热编码器.fit_transform(原始样本矩阵)
        ->经过独热编码后的样本矩阵
    代码:ohe.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.preprocessing as sp
    raw_samples = np.array([
        [1, 3, 2],
        [7, 5, 4],
        [1, 8, 6],
        [7, 3, 9]])
    print(raw_samples)
    # 建立编码字典列表
    code_tables = []
    for col in raw_samples.T:
        # 针对一列的编码字典
        code_table = {}
        for val in col:
            code_table[val] = None
        code_tables.append(code_table)
    # 为编码字典列表中每个编码字典添加值
    for code_table in code_tables:
        size = len(code_table)
        for one, key in enumerate(sorted(
                code_table.keys())):
            code_table[key] = np.zeros(
                shape=size, dtype=int)
            code_table[key][one] = 1
    # 根据编码字典表对原始样本矩阵做独热编码
    ohe_samples = []
    for raw_sample in raw_samples:
        ohe_sample = np.array([], dtype=int)
        for i, key in enumerate(raw_sample):
            ohe_sample = np.hstack(
                (ohe_sample, code_tables[i][key]))
        ohe_samples.append(ohe_sample)
    ohe_samples = np.array(ohe_samples)
    print(ohe_samples)
    ohe = sp.OneHotEncoder(sparse=False, dtype=int)
    ohe_samples = ohe.fit_transform(raw_samples)
    print(ohe_samples)
  6. 标签编码
    文本形式的特征值->数值形式的特征值
    其编码数值源于标签字符串的字典排序,与标签本身的含义无关
    职位 车
    员工 toyota - 0
    组长 ford     - 1
    经理 audi     - 2
    老板 bmw    - 3
    标签编码器 = sp.LabelEncoder()
    标签编码器.fit_transform(原始样本矩阵)
        ->经过标签编码后的样本矩阵
    标签编码器.inverse_transform(经过标签编码后的样本矩阵)
        ->原始样本矩阵
    代码:lab.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.preprocessing as sp
    raw_samples = np.array([
        'audi', 'ford', 'audi', 'toyota',
        'ford', 'bmw', 'toyota', 'bmw'])
    print(raw_samples)
    lbe = sp.LabelEncoder()
    lbe_samples = lbe.fit_transform(raw_samples)
    print(lbe_samples)
    raw_samples = lbe.inverse_transform(lbe_samples)
    print(raw_samples)
    

三.机器学习基本问题

  1. 回归问题:由已知飞分布于连续域中的输入和输出,通过不断地模型训练,找到输入和输出之间的联系,通常这种联系可以通过一个函数方程被形式化:如y = w0 + w1 +w2x^2 ...,当提供未知输出的输入时,就可以根据以上函数方程,预测出与之对应的连续域输出
  2. 分类问题:如果将回归问题中的输出从连续域变为离散域,那么该问题就是一个分类问题
  3. 聚类问题:从已知的输入中寻找某种模式,比如相似性,根据该模式将输入划分为不同的集群,并对新的输入应用同样的划分方式,以确定其归属的集群
  4. 降维问题:从大量的特征中选择那些对模型预测最关键的少量特征,以降低输入样本的维度,提高模型的性能

四.一元线性回归

  1. 预测函数
    输入 输出
    0       1
    1       3
    2       5
    3       7
    4       9
    ...
    y = 1 + 2x
    10 -> 21
    y = w0 +w1x
    任务就是寻找预测函数中的模型参数w0和w1,以满足输入和输出之间的联系
  2. 单样本误差
    x - >[y=w0+w1x] ->y'  y->e= 1/2(y-y')^2  
  3. 总样本误差
    E = ∑[1/2(y-y')^2]
  4. 损失函数
    Loss(w0,w1) =∑[1/2(y-(w0+w1x))^2]
    任务就是寻找可以使损失函数取得最小值的模型参数w0和w1.
  5. 梯度下降法寻优
    随机选择一组模型参数w0和w1
    计算损失函数在该模型参数处的梯度 
    [DLoss/Dwo,Dloss/Dw1]                  <--|    
    计算与该梯度反方向的修正步长           |
    [-nDLoss/Dwo,-nDLoss/Dw1]               |
    计算下一组模型参数                             |
    w0=w1-nDLoss/Dwo                             |
    w1=w1-nDLoss/Dw1------------------------+
    直到满足迭代终止条件:
    迭代足够多次,
    损失值已经足够小,
    损失值已经不再明显减少
    Loss = SIGMA[1/2(y-y')^2], y'=w0+w1x
    DLoss/Dw0
    =SIGMA[D(1/2(y-y')^2)/Dw0]
    =SIGMA[(y-y')D(y-y')/Dw0]
    =SIGMA[(y-y')(Dy/Dw0-Dy'/Dw0)]
    =-SIGMA[(y-y')(Dy'/Dw0)]
    =-SIGMA[(y-y')]
    DLoss/Dw1
    =SIGMA[D(1/2(y-y')^2)/Dw1]
    ...
    =-SIGMA[(y-y')(Dy'/Dw1)]
    =-SIGMA[(y-y')x]
    代码:gd.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import matplotlib.pyplot as mp
    from mpl_toolkits.mplot3d import axes3d
    train_x = np.array([0.5, 0.6, 0.8, 1.1, 1.4])
    train_y = np.array([5.0, 5.5, 6.0, 6.8, 7.0])
    n_epoches = 1000
    lrate = 0.01
    epoches, losses = [], []
    w0, w1 = [1], [1]
    for epoch in range(1, n_epoches + 1):
        epoches.append(epoch)
        losses.append(((train_y - (
            w0[-1] + w1[-1] * train_x)) ** 2 / 2).sum())
        print('{:4}> w0={:.8f}, w1={:.8f}, loss={:.8f}'.format(
            epoches[-1], w0[-1], w1[-1], losses[-1]))
        d0 = -(train_y - (
            w0[-1] + w1[-1] * train_x)).sum()
        d1 = -((train_y - (
            w0[-1] + w1[-1] * train_x)) * train_x).sum()
        w0.append(w0[-1] - lrate * d0)
        w1.append(w1[-1] - lrate * d1)
    w0 = np.array(w0[:-1])
    w1 = np.array(w1[:-1])
    sorted_indices = train_x.argsort()
    test_x = train_x[sorted_indices]
    test_y = train_y[sorted_indices]
    pred_test_y = w0[-1] + w1[-1] * test_x
    grid_w0, grid_w1 = np.meshgrid(
        np.linspace(0, 9, 500),
        np.linspace(0, 3.5, 500))
    flat_w0, flat_w1 = grid_w0.ravel(), grid_w1.ravel()
    flat_loss = (((flat_w0 + np.outer(
        train_x, flat_w1)) - train_y.reshape(
        -1, 1)) ** 2).sum(axis=0) / 2
    grid_loss = flat_loss.reshape(grid_w0.shape)
    mp.figure('Linear Regression', facecolor='lightgray')
    mp.title('Linear Regression', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.grid(linestyle=':')
    mp.scatter(train_x, train_y, marker='s',
               c='dodgerblue', alpha=0.5, s=80,
               label='Training')
    mp.scatter(test_x, test_y, marker='D',
               c='orangered', alpha=0.5, s=60,
               label='Testing')
    mp.scatter(test_x, pred_test_y, c='orangered',
               alpha=0.5, s=60, label='Predicted')
    for x, y, pred_y in zip(
            test_x, test_y, pred_test_y):
        mp.plot([x, x], [y, pred_y], c='orangered',
                alpha=0.5, linewidth=1)
    mp.plot(test_x, pred_test_y, '--', c='limegreen',
            label='Regression', linewidth=1)
    mp.legend()
    mp.figure('Training Progress', facecolor='lightgray')
    mp.subplot(311)
    mp.title('Training Progress', fontsize=20)
    mp.ylabel('w0', fontsize=14)
    mp.gca().xaxis.set_major_locator(
        mp.MultipleLocator(100))
    mp.tick_params(labelsize=10)
    mp.grid(linestyle=':')
    mp.plot(epoches, w0, c='dodgerblue', label='w0')
    mp.legend()
    mp.subplot(312)
    mp.ylabel('w1', fontsize=14)
    mp.gca().xaxis.set_major_locator(
        mp.MultipleLocator(100))
    mp.tick_params(labelsize=10)
    mp.grid(linestyle=':')
    mp.plot(epoches, w1, c='limegreen', label='w1')
    mp.legend()
    mp.subplot(313)
    mp.xlabel('epoch', fontsize=14)
    mp.ylabel('loss', fontsize=14)
    mp.gca().xaxis.set_major_locator(
        mp.MultipleLocator(100))
    mp.tick_params(labelsize=10)
    mp.grid(linestyle=':')
    mp.plot(epoches, losses, c='orangered', label='loss')
    mp.legend()
    mp.tight_layout()
    mp.figure('Loss Function')
    ax = mp.gca(projection='3d')
    mp.title('Loss Function', fontsize=20)
    ax.set_xlabel('w0', fontsize=14)
    ax.set_ylabel('w1', fontsize=14)
    ax.set_zlabel('loss', fontsize=14)
    mp.tick_params(labelsize=10)
    ax.plot_surface(grid_w0, grid_w1, grid_loss,
                    rstride=10, cstride=10, cmap='jet')
    ax.plot(w0, w1, losses, 'o-', c='orangered',
            label='BGD')
    mp.legend()
    mp.figure('Batch Gradient Descent',
              facecolor='lightgray')
    mp.title('Batch Gradient Descent', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.grid(linestyle=':')
    mp.contourf(grid_w0, grid_w1, grid_loss, 1000,
                cmap='jet')
    cntr = mp.contour(grid_w0, grid_w1, grid_loss, 10,
                      colors='black', linewidths=0.5)
    mp.clabel(cntr, inline_spacing=0.1, fmt='%.2f',
              fontsize=8)
    mp.plot(w0, w1, 'o-', c='orangered', label='BGD')
    mp.legend()
    mp.show()
    

    import sklearn.linear_model as lm
    线性回归器 = lm.LinearRegression()
    线性回归器.fit(已知输入, 已知输出) # 计算模型参数
    线性回归器.predict(新的输入)->新的输出
    代码:line.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.linear_model as lm
    import sklearn.metrics as sm
    import matplotlib.pyplot as mp
    x, y = [], []
    with open('../../data/single.txt', 'r') as f:
        for line in f.readlines():
            data = [float(substr) for substr
                    in line.split(',')]
            x.append(data[:-1])
            y.append(data[-1])
    x = np.array(x)
    y = np.array(y)
    model = lm.LinearRegression()
    model.fit(x, y)
    pred_y = model.predict(x)
    # 1/(1+E)
    print(sm.r2_score(y, pred_y))
    mp.figure('Linear Regression', facecolor='lightgray')
    mp.title('Linear Regression', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.grid(linestyle=':')
    mp.scatter(x, y, c='dodgerblue', alpha=0.75, s=60,
               label='Sample')
    sorted_indices = x.ravel().argsort()
    mp.plot(x[sorted_indices], pred_y[sorted_indices],
            c='orangered', label='Regression')
    mp.legend()
    mp.show()
    

    模型的转储与载入:pickle
    代码:dump.py、load.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import pickle
    import numpy as np
    import sklearn.linear_model as lm
    import sklearn.metrics as sm
    import matplotlib.pyplot as mp
    x, y = [], []
    with open('../../data/single.txt', 'r') as f:
        for line in f.readlines():
            data = [float(substr) for substr
                    in line.split(',')]
            x.append(data[:-1])
            y.append(data[-1])
    x = np.array(x)
    y = np.array(y)
    model = lm.LinearRegression()
    model.fit(x, y)
    pred_y = model.predict(x)
    # 1/(1+E)
    print(sm.r2_score(y, pred_y))
    with open('../../data/linear.pkl', 'wb') as f:
        pickle.dump(model, f)
    mp.figure('Linear Regression', facecolor='lightgray')
    mp.title('Linear Regression', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.grid(linestyle=':')
    mp.scatter(x, y, c='dodgerblue', alpha=0.75, s=60,
               label='Sample')
    sorted_indices = x.ravel().argsort()
    mp.plot(x[sorted_indices], pred_y[sorted_indices],
            c='orangered', label='Regression')
    mp.legend()
    mp.show()
    
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import pickle
    import numpy as np
    import sklearn.linear_model as lm
    import sklearn.metrics as sm
    import matplotlib.pyplot as mp
    x, y = [], []
    with open('../../data/single.txt', 'r') as f:
        for line in f.readlines():
            data = [float(substr) for substr
                    in line.split(',')]
            x.append(data[:-1])
            y.append(data[-1])
    x = np.array(x)
    y = np.array(y)
    with open('../../data/linear.pkl', 'rb') as f:
        model = pickle.load(f)
    pred_y = model.predict(x)
    # 1/(1+E)
    print(sm.r2_score(y, pred_y))
    with open('../../data/linear.pkl', 'wb') as f:
        pickle.dump(model, f)
    mp.figure('Linear Regression', facecolor='lightgray')
    mp.title('Linear Regression', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.grid(linestyle=':')
    mp.scatter(x, y, c='dodgerblue', alpha=0.75, s=60,
               label='Sample')
    sorted_indices = x.ravel().argsort()
    mp.plot(x[sorted_indices], pred_y[sorted_indices],
            c='orangered', label='Regression')
    mp.legend()
    mp.show()
    

五.领回归

  • Loss(w0,w1) = SIGMA[1/2(y-(w0+w1x))^2]
                             -正则强度 *f(w0,w1)
  • 通过正则的方法,即在损失函数中加入正则项,以减弱模型参数对熟练数据的匹配度,借以规避少数明显偏移正常范围的异常样本影响模型的回归效果
    代码
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.linear_model as lm
    import matplotlib.pyplot as mp
    x, y = [], []
    with open('../../data/abnormal.txt', 'r') as f:
        for line in f.readlines():
            data = [float(substr) for substr
                    in line.split(',')]
            x.append(data[:-1])
            y.append(data[-1])
    x = np.array(x)
    y = np.array(y)
    model1 = lm.LinearRegression()
    model1.fit(x, y)
    pred_y1 = model1.predict(x)
    model2 = lm.Ridge(300, fit_intercept=True)
    model2.fit(x, y)
    pred_y2 = model2.predict(x)
    mp.figure('Linear & Ridge Regression',
              facecolor='lightgray')
    mp.title('Linear & Ridge Regression', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.grid(linestyle=':')
    mp.scatter(x, y, c='dodgerblue', alpha=0.75, s=60,
               label='Sample')
    sorted_indices = x.ravel().argsort()
    mp.plot(x[sorted_indices], pred_y1[sorted_indices],
            c='orangered', label='Linear')
    mp.plot(x[sorted_indices], pred_y2[sorted_indices],
            c='limegreen', label='Ridge')
    mp.legend()
    mp.show()

六.多项式回归

  • 多元线性:    y=w0+w1x1+w2x2+w3x3+...+wnxn
                           ^ x1 = x^1
                            | x2 = x^2
                            | ...
                            | xn = x^n
    一元多项式:y=w0+w1x+w2x^2+w3x^3+...+wnx^n
    x->多项式特征扩展器 -x1...xn-> 线性回归器->w0...wn
          \______________________________________/
                                       管线
    代码:poly.py
    # -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    import numpy as np
    import sklearn.pipeline as pl
    import sklearn.preprocessing as sp
    import sklearn.linear_model as lm
    import sklearn.metrics as sm
    import matplotlib.pyplot as mp
    train_x, train_y = [], []
    with open('../../data/single.txt', 'r') as f:
        for line in f.readlines():
            data = [float(substr) for substr
                    in line.split(',')]
            train_x.append(data[:-1])
            train_y.append(data[-1])
    train_x = np.array(train_x)
    train_y = np.array(train_y)
    model = pl.make_pipeline(sp.PolynomialFeatures(10),
                             lm.LinearRegression())
    model.fit(train_x, train_y)
    pred_train_y = model.predict(train_x)
    print(sm.r2_score(train_y, pred_train_y))
    test_x = np.linspace(train_x.min(), train_x.max(),
                         1000).reshape(-1, 1)
    pred_test_y = model.predict(test_x)
    mp.figure('Polynomial Regression',
              facecolor='lightgray')
    mp.title('Polynomial Regression', fontsize=20)
    mp.xlabel('x', fontsize=14)
    mp.ylabel('y', fontsize=14)
    mp.tick_params(labelsize=10)
    mp.grid(linestyle=':')
    mp.scatter(train_x, train_y, c='dodgerblue',
               alpha=0.75, s=60, label='Sample')
    mp.plot(test_x, pred_test_y, c='orangered',
            label='Regression')
    mp.legend()
    mp.show()
    

猜你喜欢

转载自blog.csdn.net/qq_42584444/article/details/83994837