New features in scikit-learn: label encoding, decision tree missing value handling, and many other new features

 Author: Coggle

Introduction

scikit-learn 1.3

This update adds many bug fixes and improvements, and introduces some important new features. For an exhaustive list of all changes, see the release notes.

https://scikit-learn.org/stable/whats_new/v1.3.html#changes-1-3

Install the latest version using pip:

pip install --upgrade scikit-learn

Or use conda:

conda install -c conda-forge scikit-learn

Feature 1: Metadata Routing

https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_metadata_routing.html

New metadata routing methods sample_weightthat affect how metadata is routed by meta estimators like pipeline.Pipelineand .model_selection.GridSearchCV

Although the infrastructure for this feature is included in this release, work is still in progress and not all meta-estimators support this new feature. You can learn more about this feature in the Metadata Routing User Guide.

特性2:HDBSCAN: hierarchical density-based clustering

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html

HDBSCAN can find clusters with different densities by performing a modified version of epsilon on multiple epsilon values ​​simultaneously cluster.DBSCAN, making it more robust than epsilon and more robust to parameter selection.cluster.HDBSCANcluster.DBSCAN

import numpy as np
from sklearn.cluster import HDBSCAN
from sklearn.datasets import load_digits
from sklearn.metrics import v_measure_score

X, true_labels = load_digits(return_X_y=True)
print(f"数字的数量:{len(np.unique(true_labels))}")

hdbscan = HDBSCAN(min_cluster_size=15).fit(X)
非噪声标签 = hdbscan.labels_[hdbscan.labels_ != -1]
print(f"找到的聚类数:{len(np.unique(非噪声标签))}")

print(v_measure_score(true_labels[hdbscan.labels_ != -1], 非噪声标签))

Feature 3: TargetEncoder

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html

preprocessing.TargetEncoderIdeal for categorical features with high cardinality. It encodes a class based on a scaled-down estimate of the mean target value of observations belonging to that class.

import numpy as np
from sklearn.preprocessing import TargetEncoder

X = np.array([["cat"] * 30 + ["dog"] * 20 + ["snake"] * 38], dtype=object).T
y = [90.3] * 30 + [20.4] * 20 + [21.2] * 38

enc = TargetEncoder(random_state=0)
X_trans = enc.fit_transform(X, y)

enc.encodings_

Feature 4: Decision trees support missing values

Now tree.DecisionTreeClassifierand tree.DecisionTreeRegressorclasses support missing values. For each possible threshold for nonmissing data, the divider evaluates the division assigning all missing values ​​to the left or right node.

import numpy as np
from sklearn.tree import DecisionTreeClassifier

X = np.array([0, 1, 6, np.nan]).reshape(-1, 1)
y = [0, 0, 1, 1]

tree = DecisionTreeClassifier(random_state=0).fit(X, y)
tree.predict(X)

Property 5: Validation Curve

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ValidationCurveDisplay.html

Now you can use from_estimator to create a ValidationCurveDisplay instance to visualize the validation curve.

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import ValidationCurveDisplay

X, y = make_classification(1000, 10, random_state=0)

_ = ValidationCurveDisplay.from_estimator(
    LogisticRegression(),
    X,
    y,
    param_name="C",
    param_range=np.geomspace(1e-5, 1e3, num=9),
    score_type="both",
    score_name="Accuracy",
)

Characteristic 6: Gamma loss

Through loss="gamma"parameters, ensemble.HistGradientBoostingRegressorthe class supports the use of Gamma bias loss function. This loss function is suitable for modeling strictly positive-valued targets with right-skewed distributions.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_low_rank_matrix
from sklearn.ensemble import HistGradientBoostingRegressor

n_samples, n_features = 500, 10
rng = np.random.RandomState(0)
X = make_low_rank_matrix(n_samples, n_features, random_state=rng)
coef = rng.uniform(low=-10, high=20, size=n_features)
y = rng.gamma(shape=2, scale=np.exp(X @ coef) / 2)
gbdt = HistGradientBoostingRegressor(loss="gamma")
cross_val_score(gbdt, X, y).mean()

Feature 7: Long-tail Category Aggregation

preprocessing.OrdinalEncoderSimilar to , now preprocessing.OneHotEncodersupports aggregating uncommon categories into a single output per feature. Parameters to enable aggregation of uncommon categories include min_frequencyand max_categories.

from sklearn.preprocessing import OrdinalEncoder
import numpy as np

X = np.array(
    [["dog"] * 5 + ["cat"] * 20 + ["rabbit"] * 10 + ["snake"] * 3], dtype=object
).T
enc = OrdinalEncoder(min_frequency=6).fit(X)
enc.infrequent_categories_

Recommended reading:

My 2022 Internet School Recruitment Sharing

My 2021 summary

Talking about the difference between algorithm post and development post

Internet school recruitment research and development salary summary

The 2022 Internet job hunting status, gold 9 silver 10 will soon become copper 9 iron 10! !

Public number: AI snail car

Stay humble, stay disciplined, keep improving

fd92587b3fb26b42a0a0cc5eb26d719f.jpeg

Send [Snail] to get a copy of "Hands-on AI Project" (written by AI Snail Car)

Send [1222] to get a good leetcode brushing notes

Send [AI Four Classics] Get four classic AI e-books

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/132463413