[Machine Learning] The use of sklearn datasets, the acquisition and division of datasets

"Author's Homepage": Shibie Sanri wyx
"Author's Profile": CSDN top100, Alibaba Cloud Blog Expert, Huawei CloudShare Expert, High-quality Creator in the Network Security Field
"Recommended Column": Friends who are interested in network security can pay attention to the column "Introduction to Mastery of Network Security"

Machine learning is a way to realize artificial intelligence. It can automatically analyze and obtain "models" from "data" , and use the models to "predict" unknown data .

Simply put, it is to summarize the laws from historical data and use it to solve emerging problems.

To summarize the rules from the data, a "data set" needs to be provided , and the data set consists of two parts : "eigenvalue" and "target value" .

There are many useful tools for machine learning, here we use sekearn.

sklearn is a Python-based machine learning toolkit that comes with a large number of data sets for us to practice various machine learning algorithms.

Two, install sklearn

Environmental requirements:

  • Python(>=2.7 or >=3.3)
  • NumPy (>= 1.8.2)
  • SciPy (>= 0.13.3)

install first numpy, scipythen installscikit-learn

PyCharm upper left corner [file]-[Settings]-[Project: pythonProject]-[Python Interpreter]

insert image description here

2. Get the dataset

There are three ways to "get data" for sklearn datasets :

  • sklearn.datasets.load_*(): small-scale datasets (local loading)
  • sklearn.datasets.fetch_*(): large-scale datasets (online download)
  • sklearn.datasets.make_*(): locally generate datasets (local construction)

The "return value" of the sklearn dataset is in dictionary format:

  • data: eigenvalue data array
  • target: target value data array (label)
  • target_names: label name (correspondence between target value and label)
  • DESCR: data description
  • feature_names: feature names

Next, we get a built-in local dataset:

from sklearn import datasets

# 获取数据集
iris = datasets.load_iris()
# 打印数据集
print(iris)

output:

insert image description here

From the output results, the data set it returns is a dictionary, which contains information such as feature values ​​(data), target values ​​(target), and so on.

We can call the return value "property" to view a certain information of the data set separately:

from sklearn import datasets

# 获取数据集
iris = datasets.load_iris()

# 查看数据值
print(iris.data)
# 查看目标值(标签)
print(iris.target)
# 查看标签名
print(iris.target_names)
# 查看数据描述
print(iris.DESCR)
# 查看特征名
print(iris.feature_names)

3. Data set division

Datasets are usually divided into two parts:

  • "Training data" : used for training and generating models.
  • "Test data" : used for testing to determine whether the model is valid.

sklearn.model_selection.train_test_split() is used to divide the data set

parameter:

  • x: (required) array type, eigenvalue of the dataset
  • y: (required) array type, the target value of the dataset
  • test_size: (optional, default 0.25) float, the size of the test set
  • random_state: (optional) integer, random number seed, different random numbers correspond to different sampling results.

return value:

  • Training set eigenvalues, test set eigenvalues, training set target values, test set target values

Next, we divide the local data set we just obtained. The size of the test set is not given, which is the default value of 0.25, which means that 25% is used as test data and the remaining 75% is used as training data.

from sklearn import datasets
from sklearn import model_selection

# 获取数据集
iris = datasets.load_iris()

# 数据集的特征值
data_arr = iris.data
# 数据集的目标值(标签)
target_arr = iris.target


x_data, y_data, x_target, y_target = model_selection.train_test_split(data_arr, target_arr)
print('训练集特征值', x_data)
print('测试集特征值', y_data)
print('训练集目标值', x_target)
print('测试集目标值', y_target)

Guess you like

Origin blog.csdn.net/wangyuxiang946/article/details/131338818