Read the file and database data read
Reference links 1 and 2
You may want to use the most commonly used components (components) are:
l Instances your data
l Filter pretreatment data
l Classifiers / Clusterer is based on pre-processed data, classification / clustering
l Evaluating Evaluation classifier / clusterer
l Attribute selection data removal properties are not relevant
Available code following tests:
public static Instances getInstances(String filePath) {
try {
filePath = "C:\\Weka-3-8\\data\\iris.arff";
//3.5.5和3.4.X版本
Instances data = new Instances( new BufferedReader( new FileReader(filePath) ) );
// setting class attribute
// Class Index是指示用于分类的目标属性的下标。在ARFF文件中,它被默认为是最后一个属性,这也就是为什么它被设置成numAttributes-1.
//你必需在使用一个Weka函数(ex: weka.classifiers.Classifier.buildClassifier(data))之前设置Class Index。
data.setClassIndex(data.numAttributes() - 1);
System.out.println( "#################data:" );
System.out.println( data );
//3.5.5和更新的版本
//DataSource类不仅限于读取ARFF文件,它同样可以读取CSV文件和其它格式的文件(基本上Weka可以通过它的转换器(converters)导入所有的文件格式)。
DataSource source = new DataSource(filePath);
Instances data2 = source.getDataSet();
// setting class attribute if the data format does not provide this information
// E.g., the XRFF format saves the class attribute information as well
//if (data2.classIndex() == -1)
data2.setClassIndex(data2.numAttributes() - 1);
System.out.println( "#################data2:" );
System.out.println( data2 );
//读取数据库
InstanceQuery query = new InstanceQuery();
//数据库配置在weka.jar文件中
query.setUsername("root");
query.setPassword("");
query.setQuery("select * from url_features limit 0,10");//url_features
// if your data is sparse, then you can say so too
// query.setSparseData(true);
Instances data3 = query.retrieveInstances();
//把数据集全部输入出
System.out.println( data3 );
//用numInstances可以获得数据集中有多少样本
for( int i = 0; i < data3.numInstances(); i++ )
{
//instance( i )是得到第i个样本
System.out.println( data3.instance( i ) );
}
return data2;
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
Database configuration instructions see link 3.
Classifier
Reference Link 4
To classify the data set, first step is to specify the data set which column as a category, if this step is forgotten (in fact often forgotten) appear "Class index is negative (not set)!" This error, set a method as members of a category with setClassIndex Instances of classes, to set the number of categories may be the last column () members property obtained by the method numAttributes Instances of classes minus one.
Instances m_instances = getInstances(filePath);//这里使用了上面代码的data2的方法
J48 classifier = new J48();
//NaiveBayes classifier2 = new NaiveBayes();
//SMO classifier = new SMO();
classifier.buildClassifier( m_instances );
//输出的内容是数据中第0、60、110行的数据的分类结果
System.out.println( classifier.classifyInstance( m_instances.instance( 0 ) ) );
System.out.println( classifier.classifyInstance( m_instances.instance( 60 ) ) );
System.out.println( classifier.classifyInstance( m_instances.instance( 110 ) ) );
Classification and evaluation
Reference links 5
//首先初始化一个Evaluation对象,Evaluation类没有无参的构造函数,一般用Instances对象作为构造函数的参数。
//如果没有分开训练集和测试集,可以使用Cross Validation方法,
// Evaluation中crossValidateModel方法的四个参数分别为,第一个是分类器,第二个是在某个数据集上评价的数据集,第三个参数是交叉检验的次数(10是比较常见的),第四个是一个随机数对象。
//提醒大家一下,使用crossValidateModel时,分类器不需要先训练,这其实也应该是常识了。
//Evaluation中提供了多种输出方法,大家如果用过weka软件,会发现方法输出结果与软件中某个显示结果的是对应的。例中的三个方法toClassDetailsString,toSummaryString,toMatrixString比较常用。
public static void crossValidation() throws Exception
{
J48 classifier = new J48();
//NaiveBayes classifier = new NaiveBayes();
//SMO classifier = new SMO();
Evaluation eval = new Evaluation( m_instances );
eval.crossValidateModel( classifier, m_instances, 10, new Random(1));
System.out.println(eval.toClassDetailsString());
System.out.println(eval.toSummaryString());
System.out.println(eval.toMatrixString());
}
//如果有训练集和测试集,可以使用Evaluation 类中的evaluateModel方法,
// 方法中的参数为:第一个为一个训练过的分类器,第二个参数是在某个数据集上评价的数据集。例中我为了简单用训练集再次做为测试集,希望大家不会糊涂。
public static void evaluateTestData() throws Exception
{
J48 classifier = new J48();
//NaiveBayes classifier = new NaiveBayes();
//SMO classifier = new SMO();
classifier.buildClassifier( m_instances );
Evaluation eval = new Evaluation( m_instances );
eval.evaluateModel( classifier, m_instances );
System.out.println(eval.toClassDetailsString());
System.out.println(eval.toSummaryString());
System.out.println(eval.toMatrixString());
}
The output is:
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.980 0.000 1.000 0.980 0.990 0.985 0.990 0.987 Iris-setosa
0.940 0.030 0.940 0.940 0.940 0.910 0.952 0.880 Iris-versicolor
0.960 0.030 0.941 0.960 0.950 0.925 0.961 0.905 Iris-virginica
Weighted Avg. 0.960 0.020 0.960 0.960 0.960 0.940 0.968 0.924
Correctly Classified Instances 144 96 %
Incorrectly Classified Instances 6 4 %
Kappa statistic 0.94
Mean absolute error 0.035
Root mean squared error 0.1586
Relative absolute error 7.8705 %
Root relative squared error 33.6353 %
Total Number of Instances 150
=== Confusion Matrix ===
a b c <-- classified as
49 1 0 | a = Iris-setosa
0 47 3 | b = Iris-versicolor
0 2 48 | c = Iris-virginica
Description:
True False
returns a true example (tp) pseudo-positive patients (fp)
do not return a pseudo counter-example (fn) true negatives (tn)
1, FN: False Negative, a sample is judged to be negative, but in fact is a positive sample.
2, FP: False Positive, the sample is judged to be positive, but in fact is a negative samples.
3, TN: True Negative, the sample is judged to be negative, indeed, negative samples.
4, TP: True Positive, the sample is judged to be positive, indeed, evidence samples.
TP Rate : TP / (TP + FN), the classifier of the identified positive samples proportion of all positive samples of
the FP Rate : the FP / (the FP + the TN), classifier mistaken negative samples positive class account all negative samples ratio
Precision : accuracy: P = tp / (tp + fp) of the positive samples / system system identified all positive samples
the recall : recall: R = tp / (tp + fn) positive samples / system identified by the system all the identified total number of samples
F-Measure: F value is case Precision and Recall weighted harmonic mean, sometimes contradictory P and R indicator will appear, so you need to consider them, the most common method is the F-Measure (also known as F -Score).
= 2 * P * Fl R & lt / (P + R & lt)
the MCC: Matthews correlation coefficient , a measure of imbalance data set is better. In a common performance evaluation scores, MCC is the only correct score considered confusion matrix size ratio. Especially in an unbalanced data set (e.g., a data set representing positive examples 99.9%), MCC can be accurately determined whether the predicted evaluation was smooth, and the accuracy or F1 rates can not.
Said in MCC is essentially a description of the correlation coefficient between the actual and the predicted classification classification, it is in the range of [-1,1], is a perfect prediction value of a subject 1, a value of 0 the predicted results are not randomized prediction, classification and prediction -1 means the actual classification completely inconsistent.
The formula is:
Area ROC
a PRC Area
abscissa ROC curve for the false positive rate (FPR), the ordinate is the true positive rate (TPR)
is generally assessed by the area under the curve (AUC) curve model calculated these two properties: between 0.5 and 1.0, the better the performance AUC larger model. Optimization of the ROC curve tends to negative and maximize the positive values correctly classified correctly classified. Optimized differently, PR curve tends to maximize the value correctly classified, and not directly consider the negative correctly classified.
In the positive and negative samples to be extremely unevenly distributed (highly skewed datasets) case, PRC can react more effectively than the quality of the classifier ROC. Abscissa Recall, ordinate Precision.
Classified the Instances correctly : correct classification
Incorrectly Classified the Instances : misclassified
the Kappa statistic : i.e. Cronbach value (inter-rater, coefficient of internal consistency), as an important indicator of the degree of consistency of the evaluation of the determination. Values between 0 and 1. Both Kappa≥0.75 good agreement; 0.75> Usually both Kappa≥0.4 consistency; Kappa <0.4 both poor consistency.
Absolute error on Mean
Root Mean Squared error
mean absolute error and root mean square error, it is a measure of the difference classifier predicted and actual results, the smaller the better.
Absolute error Relative
Root Squared error relative
relative absolute error and relative root mean square error, sometimes absolute error does not reflect the true magnitude of the error, and to reflect the size of the error relative error by error reflects the true value of the proportion accounted for.
The Matrix Confusion : Confusion Matrix
Select Properties
The next test can be used in the following two pieces of code, each function of the mathematical principles not do too well. Reference links 8,9.
public static void selectAtt() throws Exception
{
//AttributeSelection来自import weka.attributeSelection.AttributeSelection
AttributeSelection attsel = new AttributeSelection();
CfsSubsetEval eval = new CfsSubsetEval();
GreedyStepwise search = new GreedyStepwise();
search.setSearchBackwards(true);
attsel.setEvaluator(eval);
attsel.setSearch(search);
attsel.SelectAttributes(m_instances);
int attarray[] =attsel.selectedAttributes();
System.out.println("result:"+attsel.toResultsString());
System.out.println("the selected attributes are as follows:");
for (int i=0;i<attarray.length;i++ ){
//System.out.println(attarray[i]);
System.out.print(m_instances.attribute((int)attarray[i]).name()+',');
}
}*/
public static void selectAttribute() throws Exception
{
//初始化搜索算法(search method)及属性评测算法(attribute evaluator)
Ranker rank = new Ranker();
InfoGainAttributeEval eval = new InfoGainAttributeEval();
// 3.根据评测算法评测各个属性
eval.buildEvaluator(m_instances);
// 4.按照特定搜索算法对属性进行筛选
//在这里使用的Ranker算法仅仅是属性按照InfoGain的大小进行排序
int[] attrIndex = rank.search(eval, m_instances);
//5.打印结果信息 在这里我们了属性的排序结果同时将每个属性的InfoGain信息打印出来
StringBuffer attrIndexInfo = new StringBuffer();
StringBuffer attrInfoGainInfo = new StringBuffer();
attrIndexInfo.append("Selected attributes:");
attrInfoGainInfo.append("Ranked attributes:\n");
for (int i = 0; i < attrIndex.length; i++) {
attrIndexInfo.append(attrIndex[i]);
attrIndexInfo.append(",");
attrInfoGainInfo.append(eval.evaluateAttribute(attrIndex[i]));
attrInfoGainInfo.append("\t");
attrInfoGainInfo.append((m_instances.attribute(attrIndex[i]).name()));
attrInfoGainInfo.append("\n");
}
System.out.println(attrIndexInfo.toString());
System.out.println(attrInfoGainInfo.toString());
}
SUMMARY The following functions described with reference to the link 10, the link 11 inside the additional content in more detail, that further can be seen.
There are two attributes selection mode weka
1, a subset of attributes evaluator + search methods (which can be said to be cyclic, the former part of each cycle of operation)
2, a single property evaluator sorted +
Feature evaluation function
evaluation criteria play an important role in the feature selection process, which is based on the feature selection. Evaluation criteria can be divided into two: one is the evaluation criteria for the predictive power of each feature individually measure; the other is the evaluation criteria for evaluating overall performance prediction of a sub-set of features. Filter and Method Wrapper: two important types of methods respectively.
In the Filter method , in general, it does not depend on specific learning algorithm to evaluate a subset of features, but draw ideological statistics, information theory and other subjects, and be based on the intrinsic properties of the dataset evaluate the predictive power of each feature , and to find Some sort of optimum characteristic compositional feature subset. And Wrapper method , the embedded with subsequent learning algorithm to the feature selection process overall, by testing the prediction algorithm performance on this subset of features to determine its merits, but little attention to predict the performance characteristics of each subset of features. Thus, each of the features which does not require an optimal feature subset is optimal.
With a subset of attributes evaluator
CfssubEval: Considering the degree of repetition between the predicted values and attributes single attribute.
classifiersubsetEval: evaluation attribute set with the evaluator
consistencySubsetEval: mapping the training data set up to detect the type of machine attribute consistency
WrapperSubsetEval: using a classifier and cross-validation (packing method)
Search methods
bestFirst: backtracking greedy search
ExhaustiveSearch: exhaustive search
GeneticSearch: genetic search algorithm
GreedyStepwise: no backtracking greedy search
randomSearch: random search
RankSearch: flow properties and use property evaluation subset will sort attributes potential
Single attribute evaluator
ChiSquaredAttributeEval: property assessments to X2 as the basis of class-based
GainRationAttributeEval: the gain was based on property assessments
InfoGainAttributeEval: property assessment based on the information gain
OneRAttributeEval: Methodology to evaluate the properties OneR
PrincipleComponent: principal component analysis and conversion of
ReliefAttributeEval: Evaluation based on the attributes of the instance
SymmeticalUncertAttributeEavl: to assess the property as the basis of a symmetrical uncertainty
Sort method
Ranker: in accordance with the assessment of the property to sort them
Clustering Algorithm
Reference Link 1
Clusterer establish a similar way to build a classifier, but instead of using buildClassifier (Instances) method, which uses buildClusterer (Instances), the following code snippet shows how to use a maximum of 100 iterations using EM clusterer method.
Evaluation of a Clusterer, you ClusterEvaluation classes available, e.g., poly output several classes.
import weka.clusterers.EM;
import weka.clusterers.ClusterEvaluation;
String[] options = new String[2];
options[0] = "-I"; // max. iterations
options[1] = "100";
EM clusterer = new EM(); // new instance of clusterer
clusterer.setOptions(options); // set the options
clusterer.buildClusterer(m_instances);
ClusterEvaluation eval = new ClusterEvaluation();
eval.setClusterer(clusterer); // the clusterer to evaluate
// data to evaluate the clusterer on
eval.evaluateClusterer(m_instances);//newData
// output # of clusters
System.out.println("# of clusters: " + eval.getNumClusters());
//eval.crossValidateModel( // cross-validate
// clusterer, m_instances, 10, // with 10 folds,newData
// new Random(1)); // and random number generator with seed 1
System.out.println(eval.clusterResultsToString());
SetClassIndex attention to the need to comment out, or will be error "weka.clusterers.EM:! Can not handle any class attribute"
Output data set is the result iris.arff
EM
==
Number of clusters selected by cross validation: 4
Number of iterations performed: 16
Cluster
Attribute 0 1 2 3
(0.32) (0.33) (0.2) (0.14)
====================================================
sepallength
mean 5.897 5.006 6.9426 6.1304
std. dev. 0.5279 0.3489 0.498 0.2943
sepalwidth
mean 2.7519 3.418 3.1103 2.8088
std. dev. 0.3103 0.3772 0.2952 0.2361
petallength
mean 4.2267 1.464 5.8559 5.0993
std. dev. 0.445 0.1718 0.4626 0.2462
petalwidth
mean 1.3134 0.244 2.1495 1.8254
std. dev. 0.1864 0.1061 0.232 0.2152
class
Iris-setosa 1 51 1 1
Iris-versicolor 48.1125 1 1.0182 3.8693
Iris-virginica 2.0983 1 31.0375 19.8641
[total] 51.2108 53 33.0557 24.7335
Clustered Instances
0 48 ( 32%)
1 50 ( 33%)
2 29 ( 19%)
3 23 ( 15%)
Log likelihood: -2.03504
reference
1, Weka development [-1] - Weka used in your code, https://blog.csdn.net/u010968153/article/details/46275445
2, Weka development [1] -Instances class, https://blog.csdn.net/zt_706/article/details/8855286
3, weka connected database detailed steps, https://blog.csdn.net/qq_34760892/article/details/54630723
4, Weka development [2] - class classifier, https://blog.csdn.net/zt_706/article/details/8855314
5, Weka development [3] -Evaluation class, https://blog.csdn.net/zt_706/article/details/8855339
6, Weka classification and evaluation Evaluation output analysis, https://blog.csdn.net/qiao1245/article/details/50886070
7, Weka feature selection (the Attribute Selection), http://blog.sciencenet.cn/blog-713110-568654.html
8, Weka feature selection (the Attribute Selection), http://blog.sciencenet.cn/blog-713110-568654.html
9, Weka secondary development experience, http://www.cnblogs.com/thinkml/p/4170399.html
10, weka attribute selection, https://www.cnblogs.com/xaf-dfg/p/3558383.html
11, using a machine learning tool WEKA summary, including algorithm selection, parameter optimization, attribute selection, https://www.cnblogs.com/lutaitou/p/5818027.html
12, attribute selection algorithm related papers, https://www.docin.com/p-215712031.html , https://wenku.baidu.com/view/2f7b18ece009581b6bd9eb46.html