数据集文件的导入
['test.csv', 'train.csv', 'sample_submission.csv']
数据集中每一个特征的可能取值的数据类型
dtypes = {
'MachineIdentifier': 'category',
'ProductName': 'category',
'EngineVersion': 'category',
'AppVersion': 'category',
'AvSigVersion': 'category',
'IsBeta': 'int8',
'RtpStateBitfield': 'float16',
'IsSxsPassiveMode': 'int8',
'DefaultBrowsersIdentifier': 'float32',
'AVProductStatesIdentifier': 'float32',
'AVProductsInstalled': 'float16',
'AVProductsEnabled': 'float16',
'HasTpm': 'int8',
'CountryIdentifier': 'int16',
'CityIdentifier': 'float32',
'OrganizationIdentifier': 'float16',
'GeoNameIdentifier': 'float16',
'LocaleEnglishNameIdentifier': 'int16',
'Platform': 'category',
'Processor': 'category',
'OsVer': 'category',
'OsBuild': 'int16',
'OsSuite': 'int16',
'OsPlatformSubRelease': 'category',
'OsBuildLab': 'category',
'SkuEdition': 'category',
'IsProtected': 'float16',
'AutoSampleOptIn': 'int8',
'PuaMode': 'category',
'SMode': 'float16',
'IeVerIdentifier': 'float16',
'SmartScreen': 'category',
'Firewall': 'float16',
'UacLuaenable': 'float32',
'UacLuaenable': 'float64', # was 'float32'
'Census_MDC2FormFactor': 'category',
'Census_DeviceFamily': 'category',
'Census_OEMNameIdentifier': 'float32', # was 'float16'
'Census_OEMModelIdentifier': 'float32',
'Census_ProcessorCoreCount': 'float16',
'Census_ProcessorManufacturerIdentifier': 'float16',
'Census_ProcessorModelIdentifier': 'float32', # was 'float16'
'Census_ProcessorClass': 'category',
'Census_PrimaryDiskTotalCapacity': 'float64', # was 'float32'
'Census_PrimaryDiskTypeName': 'category',
'Census_SystemVolumeTotalCapacity': 'float64', # was 'float32'
'Census_HasOpticalDiskDrive': 'int8',
'Census_TotalPhysicalRAM': 'float32',
'Census_ChassisTypeName': 'category',
'Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float32', # was 'float16'
'Census_InternalPrimaryDisplayResolutionHorizontal': 'float32', # was 'float16'
'Census_InternalPrimaryDisplayResolutionVertical': 'float32', # was 'float16'
'Census_PowerPlatformRoleName': 'category',
'Census_InternalBatteryType': 'category',
'Census_InternalBatteryNumberOfCharges': 'float64', # was 'float32'
'Census_OSVersion': 'category',
'Census_OSArchitecture': 'category',
'Census_OSBranch': 'category',
'Census_OSBuildNumber': 'int16',
'Census_OSBuildRevision': 'int32',
'Census_OSEdition': 'category',
'Census_OSSkuName': 'category',
'Census_OSInstallTypeName': 'category',
'Census_OSInstallLanguageIdentifier': 'float16',
'Census_OSUILocaleIdentifier': 'int16',
'Census_OSWUAutoUpdateOptionsName': 'category',
'Census_IsPortableOperatingSystem': 'int8',
'Census_GenuineStateName': 'category',
'Census_ActivationChannel': 'category',
'Census_IsFlightingInternal': 'float16',
'Census_IsFlightsDisabled': 'float16',
'Census_FlightRing': 'category',
'Census_ThresholdOptIn': 'float16',
'Census_FirmwareManufacturerIdentifier': 'float16',
'Census_FirmwareVersionIdentifier': 'float32',
'Census_IsSecureBootEnabled': 'int8',
'Census_IsWIMBootEnabled': 'float16',
'Census_IsVirtualDevice': 'float16',
'Census_IsTouchEnabled': 'int8',
'Census_IsPenCapable': 'int8',
'Census_IsAlwaysOnAlwaysConnectedCapable': 'float16',
'Wdft_IsGamer': 'float16',
'Wdft_RegionIdentifier': 'float16',
'HasDetections': 'int8'
}
将不同特征的取值分类(数值型的分为一类,字符串类型的分为一类)
numerics = ['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_columns = [c for c,v in dtypes.items() if v in numerics]
categorical_columns = [c for c,v in dtypes.items() if v not in numerics]
统计不同特征值缺失值所占的比重,以及非缺失值中取某一个值所占比重最大的
思路点拨:
(1)首先建立一个空的列表,这一个列表由多个列表或者元组构成(一般情况选择元组,这样的话不易被修改)
(2)这一个元组分为五个维度,第一个维度:是特征名;第二个维度:同一个特征有多少种不同的取值数;第三个维度:该特征是缺失值的总数占所有样本的百分比;第四个维度:该特征中非缺失值中可能取值所占比例最大的可能取值所占的比例;第五个维度:该特征的数据类型
(3)构造循坏将每一个数据的取值一元组的形势添加进这一个空的列表中
(4)根据采集好的数据集列表我们初始化一个DataFrame,并以’Feature’, ‘Unique_values’, ‘Percentage of missing values’, ‘Percentage of values in the biggest category’, 'type’为列表的表头
stats = []
for col in train.columns:
stats.append((col, train[col].nunique(), train[col].isnull().sum() * 100 / train.shape[0], train[col].value_counts(normalize=True, dropna=False).values[0] * 100, train[col].dtype))
stats_df = pd.DataFrame(stats, columns=['Feature', 'Unique_values', 'Percentage of missing values', 'Percentage of values in the biggest category', 'type'])
stats_df.sort_values('Percentage of missing values', ascending=False)
筛选需要的检测特征
思路点拨:
(1)将训练集数据文件的特征取值转换为列表形式
(2)计算该特征所占比重最大的可能取值所占的比重,如果比重>90%,说明,特征基本上为一致的,到使用分类器进行训练的时候不易于区分,故将这一个特征视为不需要检测的特征,即从列表中删去
(3)循环结束后,将筛选出来的特征列重新赋给train(训练集变量)中
good_cols = list(train.columns)
for col in train.columns:
rate = train[col].value_counts(normalize=True, dropna=False).values[0]
if rate > 0.9:
good_cols.remove(col)
train = train[good_cols]
打印出筛选后的特征的表格,并统计是否被检测的数目
train.head()
效果如下:
统计是否被检测的数目
train['HasDetections'].value_counts()
输出结果
0 4462591
1 4458892
Name: HasDetections, dtype: int64