机器学习-特征选择

一,介绍

常见的特征选择方法有三类:过滤式、包裹式、嵌入式。

(1)过滤式

过滤式中最著名的方法为Relief。其思想是:现在同类中找到样本最相近的两点,称为“猜中近邻”;再从异类样本中寻找最近的两点,称为“猜错近邻”,然后用于计算某个属性的相关统计量:

                                                 

其中为第i个分量在j属性上的取值。

对离散属性而言:

                                                

对连续属性而言,需要先归一化,然后:

                                               

Relief是针对二分类问题设计的,对于多分类问题,用Relief-F处理:首先,在同类中需寻找K个“猜中近邻”,然后在其他类中分别找到K个“猜错近邻”:

                                            

其中,pl为第l类样本在数据集中所占比例。

(2)包裹式

包裹式特征选择直接把最终使用的学习器性能作为特征子集的评价准则。LVW就是其中的一种典型特征选择方法,它是在拉斯维加斯方法框架下使用随机策略进行子集搜索。接下来我们介绍下这个随机策略:

著名的随机策略现在有两个,一个是上面所说的拉斯维加斯方法,另外一个是蒙特卡洛方法。

蒙特卡罗算法:采样越多,越接近最优解;举个例子筐里有100个苹果,让我每次闭眼拿1个,挑出最大的。于是我随机拿1个,再随机拿1个跟它比,留下大的,再随机拿1个……我每拿一次,留下的苹果都至少不比上次的小。拿的次数越多,挑出的苹果就越大,但我除非拿100次,否则无法肯定挑出了最大的。这个挑苹果的算法,就属于蒙特卡罗算法——尽量找好的,但不保证是最好的。

拉斯维加斯算法:采样越多,越有可能找到最优解;举个例子有一把锁,给我100把钥匙,只有1把是对的。于是我每次随机拿1把钥匙去试,打不开就再换1把。我试的次数越多,打开(最优解)的机会就越大,但在打开之前,那些错的钥匙都是没有用的。这个试钥匙的算法,就是拉斯维加斯的——尽量找最好的,但不保证能找到。

(3)嵌入式

嵌入式特征选择时间特征选择过程和学习器训练过程融为一体。

我们考虑最简单的线性回归模型,其优化目标则为:

                                                                             

为了防止过拟合,我们加入正则化项:

                                                                            

如果λ>0,称为岭回归。

                                                                            

如果λ>0,称为LASSO回归。

为了求得最优解,需要对上式求导。但是直接求导比较困难,于是我们借助泰勒展开式变为:

                                      

再通过L-Lipschitz条件将二阶导转换为L:

                                           

对泰勒展开式简化:

                             

转化得到:

                                               

令:

                                                                  

得到闭式解:

                                                   

二,代码实现

relief算法:

import numpy as np
from random import randrange

from sklearn.datasets import make_classification
from sklearn.preprocessing import normalize


def distanceNorm(Norm, D_value):
    if Norm == '1':
        counter = np.absolute(D_value)
        counter = np.sum(counter)
    elif Norm == '2':
        counter = np.power(D_value, 2)
        counter = np.sum(counter)
        counter = np.sqrt(counter)
    elif Norm == 'Infinity':
        counter = np.absolute(D_value)
        counter = np.max(counter)
    else:
        raise Exception('We will program this later......')

    return counter


def Relief(features, labels, iter_ratio):
    (m, n) = np.shape(features)
    distance = np.zeros((m, m))
    weight = np.zeros(n)

    if iter_ratio >= 0.5:
        # 计算距离
        for index_i in range(m):
            for index_j in range(index_i + 1, m):
                D_value = features[index_i] - features[index_j]
                distance[index_i, index_j] = distanceNorm('2', D_value)   # 计算两个元素之间的欧式距离
        distance += distance.T   # 存储距离矩阵
    else:
        pass;

    for iter_num in range(int(iter_ratio * m)):
        nearHit = list()
        nearMiss = list()
        distance_sort = list()

        # 随机选择样本
        index_i = randrange(0, m, 1)
        self_features = features[index_i]

        # 获取猜中近邻和 猜错近邻
        if iter_ratio >= 0.5:
            distance[index_i, index_i] = np.max(distance[index_i])  # 获取与自己相距最大点(用于排除自己)
            for index in range(m):
                distance_sort.append([distance[index_i, index], index, labels[index]])  # 存储所有距离
        else:
            distance = np.zeros(m)
            for index_j in range(m):
                D_value = features[index_i] - features[index_j]
                distance[index_j] = distanceNorm('2', D_value)
            distance[index_i] = np.max(distance)
            for index in range(m):
                distance_sort.append([distance[index], index, labels[index]])
        distance_sort.sort(key=lambda x: x[0])                        # 距离排序
        for index in range(m):
            if nearHit == [] and distance_sort[index][2] == labels[index_i]:
                nearHit = features[distance_sort[index][1]]    # 猜中近邻
            elif nearMiss == [] and distance_sort[index][2] != labels[index_i]:
                nearMiss = features[distance_sort[index][1]]   # 猜错近邻
            elif nearHit != [] and nearMiss != []:
                break
            else:
                continue

        # 更新权重
        weight = weight - np.power(self_features - nearHit, 2) + np.power(self_features - nearMiss, 2)
    print(weight)
    return weight

if __name__ == '__main__':
    features, labels = make_classification(n_samples=500)   # 随机生成分类样本(500*20,二分类)
    features = normalize(X=features, norm='l2', axis=0)    # 归一化数据
    for x in range(1, 10):
        weight = Relief(features, labels, 1)

Relief-F算法:

import numpy as np
from random import randrange

from sklearn.datasets import make_classification
from sklearn.preprocessing import normalize


def distanceNorm(Norm, D_value):
    if Norm == '1':
        counter = np.absolute(D_value)
        counter = np.sum(counter)
    elif Norm == '2':
        counter = np.power(D_value, 2)
        counter = np.sum(counter)
        counter = np.sqrt(counter)
    elif Norm == 'Infinity':
        counter = np.absolute(D_value)
        counter = np.max(counter)
    else:
        raise Exception('We will program this later......')

    return counter


def Relief(features, labels, iter_ratio,k=5):
    (m, n) = np.shape(features)
    distance = np.zeros((m, m))
    weight = np.zeros(n)

    if iter_ratio >= 0.5:
        # 计算距离
        for index_i in range(m):
            for index_j in range(index_i + 1, m):
                D_value = features[index_i] - features[index_j]
                distance[index_i, index_j] = distanceNorm('2', D_value)   # 计算两个元素之间的欧式距离
        distance += distance.T   # 存储距离矩阵
    else:
        pass;

    for iter_num in range(int(iter_ratio * m)):
        # 随机选择样本
        index_i = randrange(0, m, 1)
        self_features = features[index_i]

        nearHit = list()
        nearMiss = dict()
        n_labels = list(set(labels))
        termination = np.zeros(len(n_labels))
        temp = np.ones(len(n_labels))
        del n_labels[n_labels.index(labels[index_i])]
        for label in n_labels:
            nearMiss[label] = list()
        distance_sort = list()

        # 获取猜中近邻和 猜错近邻
        if iter_ratio >= 0.5:
            distance[index_i, index_i] = np.max(distance[index_i])  # 获取与自己相距最大点(用于排除自己)
            for index in range(m):
                distance_sort.append([distance[index_i, index], index, labels[index]])  # 存储所有距离
        else:
            distance = np.zeros(m)
            for index_j in range(m):
                D_value = features[index_i] - features[index_j]
                distance[index_j] = distanceNorm('2', D_value)
            distance[index_i] = np.max(distance)
            for index in range(m):
                distance_sort.append([distance[index], index, labels[index]])
        distance_sort.sort(key=lambda x: x[0])                        # 距离排序
        for index in range(m):
            if distance_sort[index][2] == labels[index_i]:    # 猜中近邻
                if len(nearHit) < k:
                    nearHit.append(features[distance_sort[index][1]])
                else:
                    termination[distance_sort[index][2]] = 1
            elif distance_sort[index][2] != labels[index_i]: # 猜错近邻
                if len(nearMiss[distance_sort[index][2]]) < k:
                    nearMiss[distance_sort[index][2]].append(features[distance_sort[index][1]])
                else:
                    termination[distance_sort[index][2]] = 1

            if (termination == temp).all()==True:   # 所有分类获取到后退出循环
                break

        # 更新权重
        nearHit_term = np.zeros(n)
        for x in nearHit:
            nearHit += np.abs(np.power(self_features - x, 2))
        nearMiss_term = np.zeros((len(list(set(labels))), n))
        for index, label in enumerate(nearMiss.keys()):
            for x in nearMiss[label]:
                nearMiss_term[index] += np.abs(np.power(self_features - x, 2))
            weight += nearMiss_term[index] / (k * len(nearMiss.keys()))
        weight -= nearHit_term / k
    print(weight)
    return weight

if __name__ == '__main__':
    features, labels = make_classification(n_samples=500,n_classes=4,n_informative=3)   # 随机生成分类样本(500*20,二分类)
    features = normalize(X=features, norm='l2', axis=0)    # 归一化数据
    for x in range(1, 10):
        weight = Relief(features, labels, 1)

LASSO算法选择特征值

from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston
import numpy as np

boston = load_boston()
scaler = StandardScaler()
X = scaler.fit_transform(boston["data"])
Y = boston["target"]
names = boston["feature_names"]

lasso = Lasso(alpha=.3)
lasso.fit(X, Y)

def pretty_print_linear(coefs, names = None, sort = False):
    if len(names) == 0:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst,  key = lambda x:-np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name)
                                   for coef, name in lst)

print ("Lasso模型各元素权重: ", pretty_print_linear(lasso.coef_, names, sort = True))

猜你喜欢

转载自blog.csdn.net/lyn5284767/article/details/81530437