Feature selection - weka search function

Due to the limitation of the author's level, this article may be obscure and difficult to understand. I hereby express my apologies to the friends who read this article!
There are three types of search functions for feature selection in weka3.8.1, namely Ranker, GreedyStepwise, BestFirst. The functions of these three categories are to cooperate with the evaluation function to filter and sort the features in the training data set. The following is the author's personal understanding of these three categories. If there is any mistake, I hope that friends who find it will criticize and correct it.

In general, Ranker is the fastest. If you do not consider the relationship between features, you can consider using it, such as the Naive Bayes algorithm; GreedyStepwise is the second fastest, and if resources are limited, you should consider the relationship between features, such as logistic regression, For decision tree, GreedyStepwise is recommended; if there are more resources and the relationship between features should be considered, BestFirst is recommended because its effect is better than GreedyStepwise.

Ranker
obtains a feature ranking list, sorted according to the role of each feature; the
advantage
is fast, and each feature is evaluated independently; the
disadvantage
can only evaluate a single feature, not a feature combination

GreedyStepwise
obtains a recommended feature set, the feature The set includes the initially specified feature set, if not specified, it is empty;
on the basis of the initial feature set, traverse each feature in order, evaluate the effect of each feature on the initial feature set, and if it has a positive effect, the The feature is added to the initial feature set, and if it has no positive effect, the feature is ignored;
advantages:
each feature is evaluated, basically a relatively valuable feature set can be found, the calculation amount is not large, and the calculation time is not too long long.
Doubt 1 :
The basic set of feature set evaluation is constantly changing, which is unfair for each feature to be evaluated; for example, the first feature has a value of 1.0, and the second feature has a value of 1.1, both of which are selected. When the nth feature is selected, because the initial data set is selected A large number of valuable features, with a feature value of n of 2.0, may be rejected. That is, the previous feature selection is relatively loose, and the feature selection is more stringent in the later stage of traversal.
Doubt 2 :
Screening in order, the possible problem is that the effect of a specific combination is very good, and it may not be found. For example, the effect of the combination of 1, 2, and 3 is 1.0, but the effect of 123 plus 4 is not good. 123 plus The effect of 5 is not good,
but the effect of 4 plus 5 is better than the combination of 123. At this time, the combination of 4 and 5 cannot be found. (Doubtful points to be verified)

BestFirst
1. Initialize the feature set r as the basis for the next calculation; the initial set can be specified in the function; otherwise, the initial set is empty, if it is obtained in reverse (subtraction), the initial set is all features;
2. Clone the feature set r, get a new set c, traverse all sets, find the n features that best match the clone set c, add these features to the existing initial feature set, and store them in the list l Medium (n can be set)
3. Obtain the first object from the list l, that is, the feature set with the best current effect, as a new clone feature set c, compare sets c and r, if c is more effective, assign c value Give r, at the same time delete the first object in the list l, repeat steps 2
4, repeat step 3 in a loop, until the continuous loop n times, the feature set r has not changed, then exit, and return to the feature set r.
Advantages :
Each feature participates independently to ensure that each feature competes in a fair environment, which can solve the doubts of GreedyStepwise 1. The
combination order is not a specific order by features, which can solve the doubts of GreedyStepwise 2
If the parameter setting is large enough, almost all possibilities can be traversed. Under the guiding ideology of selecting the best among the best, when the computing resources are sufficient, it is possible to find a relatively excellent and reliable feature set.
Doubt :
The amount of calculation is relatively large, and the consumption of computer resources is relatively large. If the parameter setting is relatively small, although the amount of calculation is controllable, there is still a high possibility that the optimal feature set has not been found.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326081725&siteId=291194637