"Author's Homepage": Shibie Sanri wyx
"Author's Profile": CSDN top100, Alibaba Cloud Blog Expert, Huawei CloudShare Expert, High-quality Creator in the Network Security Field
"Recommended Column": Friends who are interested in network security can pay attention to the column "Introduction to Mastery of Network Security"
sklearn dataset
Machine learning is a way to realize artificial intelligence. It can automatically analyze and obtain "models" from "data" , and use the models to "predict" unknown data .
Simply put, it is to summarize the laws from historical data and use it to solve emerging problems.
To summarize the rules from the data, a "data set" needs to be provided , and the data set consists of two parts : "eigenvalue" and "target value" .
There are many useful tools for machine learning, here we use sekearn.
sklearn is a Python-based machine learning toolkit that comes with a large number of data sets for us to practice various machine learning algorithms.
Two, install sklearn
Environmental requirements:
- Python(>=2.7 or >=3.3)
- NumPy (>= 1.8.2)
- SciPy (>= 0.13.3)
install first numpy
, scipy
then installscikit-learn
PyCharm upper left corner [file]-[Settings]-[Project: pythonProject]-[Python Interpreter]
2. Get the dataset
There are three ways to "get data" for sklearn datasets :
- sklearn.datasets.load_*(): small-scale datasets (local loading)
- sklearn.datasets.fetch_*(): large-scale datasets (online download)
- sklearn.datasets.make_*(): locally generate datasets (local construction)
The "return value" of the sklearn dataset is in dictionary format:
- data: eigenvalue data array
- target: target value data array (label)
- target_names: label name (correspondence between target value and label)
- DESCR: data description
- feature_names: feature names
Next, we get a built-in local dataset:
from sklearn import datasets
# 获取数据集
iris = datasets.load_iris()
# 打印数据集
print(iris)
output:
From the output results, the data set it returns is a dictionary, which contains information such as feature values (data), target values (target), and so on.
We can call the return value "property" to view a certain information of the data set separately:
from sklearn import datasets
# 获取数据集
iris = datasets.load_iris()
# 查看数据值
print(iris.data)
# 查看目标值(标签)
print(iris.target)
# 查看标签名
print(iris.target_names)
# 查看数据描述
print(iris.DESCR)
# 查看特征名
print(iris.feature_names)
3. Data set division
Datasets are usually divided into two parts:
- "Training data" : used for training and generating models.
- "Test data" : used for testing to determine whether the model is valid.
sklearn.model_selection.train_test_split() is used to divide the data set
parameter:
- x: (required) array type, eigenvalue of the dataset
- y: (required) array type, the target value of the dataset
- test_size: (optional, default 0.25) float, the size of the test set
- random_state: (optional) integer, random number seed, different random numbers correspond to different sampling results.
return value:
- Training set eigenvalues, test set eigenvalues, training set target values, test set target values
Next, we divide the local data set we just obtained. The size of the test set is not given, which is the default value of 0.25, which means that 25% is used as test data and the remaining 75% is used as training data.
from sklearn import datasets
from sklearn import model_selection
# 获取数据集
iris = datasets.load_iris()
# 数据集的特征值
data_arr = iris.data
# 数据集的目标值(标签)
target_arr = iris.target
x_data, y_data, x_target, y_target = model_selection.train_test_split(data_arr, target_arr)
print('训练集特征值', x_data)
print('测试集特征值', y_data)
print('训练集目标值', x_target)
print('测试集目标值', y_target)