Author: Coggle
Introduction
scikit-learn 1.3
This update adds many bug fixes and improvements, and introduces some important new features. For an exhaustive list of all changes, see the release notes.
https://scikit-learn.org/stable/whats_new/v1.3.html#changes-1-3
Install the latest version using pip:
pip install --upgrade scikit-learn
Or use conda:
conda install -c conda-forge scikit-learn
Feature 1: Metadata Routing
https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_metadata_routing.html
New metadata routing methods sample_weight
that affect how metadata is routed by meta estimators like pipeline.Pipeline
and .model_selection.GridSearchCV
Although the infrastructure for this feature is included in this release, work is still in progress and not all meta-estimators support this new feature. You can learn more about this feature in the Metadata Routing User Guide.
特性2:HDBSCAN: hierarchical density-based clustering
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html
HDBSCAN can find clusters with different densities by performing a modified version of epsilon on multiple epsilon values simultaneously cluster.DBSCAN
, making it more robust than epsilon and more robust to parameter selection.cluster.HDBSCAN
cluster.DBSCAN
import numpy as np
from sklearn.cluster import HDBSCAN
from sklearn.datasets import load_digits
from sklearn.metrics import v_measure_score
X, true_labels = load_digits(return_X_y=True)
print(f"数字的数量:{len(np.unique(true_labels))}")
hdbscan = HDBSCAN(min_cluster_size=15).fit(X)
非噪声标签 = hdbscan.labels_[hdbscan.labels_ != -1]
print(f"找到的聚类数:{len(np.unique(非噪声标签))}")
print(v_measure_score(true_labels[hdbscan.labels_ != -1], 非噪声标签))
Feature 3: TargetEncoder
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html
preprocessing.TargetEncoder
Ideal for categorical features with high cardinality. It encodes a class based on a scaled-down estimate of the mean target value of observations belonging to that class.
import numpy as np
from sklearn.preprocessing import TargetEncoder
X = np.array([["cat"] * 30 + ["dog"] * 20 + ["snake"] * 38], dtype=object).T
y = [90.3] * 30 + [20.4] * 20 + [21.2] * 38
enc = TargetEncoder(random_state=0)
X_trans = enc.fit_transform(X, y)
enc.encodings_
Feature 4: Decision trees support missing values
Now tree.DecisionTreeClassifier
and tree.DecisionTreeRegressor
classes support missing values. For each possible threshold for nonmissing data, the divider evaluates the division assigning all missing values to the left or right node.
import numpy as np
from sklearn.tree import DecisionTreeClassifier
X = np.array([0, 1, 6, np.nan]).reshape(-1, 1)
y = [0, 0, 1, 1]
tree = DecisionTreeClassifier(random_state=0).fit(X, y)
tree.predict(X)
Property 5: Validation Curve
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ValidationCurveDisplay.html
Now you can use from_estimator to create a ValidationCurveDisplay instance to visualize the validation curve.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import ValidationCurveDisplay
X, y = make_classification(1000, 10, random_state=0)
_ = ValidationCurveDisplay.from_estimator(
LogisticRegression(),
X,
y,
param_name="C",
param_range=np.geomspace(1e-5, 1e3, num=9),
score_type="both",
score_name="Accuracy",
)
Characteristic 6: Gamma loss
Through loss="gamma"
parameters, ensemble.HistGradientBoostingRegressor
the class supports the use of Gamma bias loss function. This loss function is suitable for modeling strictly positive-valued targets with right-skewed distributions.
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_low_rank_matrix
from sklearn.ensemble import HistGradientBoostingRegressor
n_samples, n_features = 500, 10
rng = np.random.RandomState(0)
X = make_low_rank_matrix(n_samples, n_features, random_state=rng)
coef = rng.uniform(low=-10, high=20, size=n_features)
y = rng.gamma(shape=2, scale=np.exp(X @ coef) / 2)
gbdt = HistGradientBoostingRegressor(loss="gamma")
cross_val_score(gbdt, X, y).mean()
Feature 7: Long-tail Category Aggregation
preprocessing.OrdinalEncoder
Similar to , now preprocessing.OneHotEncoder
supports aggregating uncommon categories into a single output per feature. Parameters to enable aggregation of uncommon categories include min_frequency
and max_categories
.
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
X = np.array(
[["dog"] * 5 + ["cat"] * 20 + ["rabbit"] * 10 + ["snake"] * 3], dtype=object
).T
enc = OrdinalEncoder(min_frequency=6).fit(X)
enc.infrequent_categories_
Recommended reading:
My 2022 Internet School Recruitment Sharing
Talking about the difference between algorithm post and development post
Internet school recruitment research and development salary summary
The 2022 Internet job hunting status, gold 9 silver 10 will soon become copper 9 iron 10! !
Public number: AI snail car
Stay humble, stay disciplined, keep improving
Send [Snail] to get a copy of "Hands-on AI Project" (written by AI Snail Car)
Send [1222] to get a good leetcode brushing notes
Send [AI Four Classics] Get four classic AI e-books