Prediction(3)Model - Decision Tree

Prediction(3)Model - Decision Tree

Error Message:
[error] (run-main-0) java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Solution:
Add this to your ENV
export AWS_ACCESS_KEY_ID=xxxxx
export AWS_SECRET_ACCESS_KEY=xxxxxxxxx

Exception:
numClasses: Int = 1500 categoricalFeaturesInfo: scala.collection.immutable.Map[Int,Int] = Map(0 -> 57, 1 -> 29674) impurity: String = gini maxDepth: Int = 5 maxBins: Int = 30000 java.lang.IllegalArgumentException: requirement failed: RandomForest/DecisionTree given maxMemoryInMB = 256, which is too small for the given features. Minimum value = 340 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:187)

Solution:
http://stackoverflow.com/questions/31965611/how-to-increase-maxmemoryinmb-for-decisiontree

https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree$

These codes help.
val categoricalFeaturesInfo2 = Map[Int, Int]( 0 -> 57, 1-> 29674)
val impurity2 = "variance"
val maxDepth2 = 5
val maxBins2 = 30000

val model2 = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo2, impurity2,
maxDepth2, maxBins2)

Some Core Codes during Decision Tree Training
1. Import the Classes
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.tree.configuration.Strategy
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.{Gini, Impurity}

2. Use Case Class to Give Dataframe Column Name
case class City(cityName:String, cityCode:Double)

var i = 0
val cities_df = sqlContext.sql("select distinct(city) from jobs").map(row => {
    i= i + 1
    City(row.getString(0), i)
}).toDF

//cities_df.count()

cities_df.registerTempTable("cities")

3. Use pattern to Load the Data
val date = "{20,21,22,23,24,25,26}"
val clicksRaw_df = sqlContext.load(s"s3n://xx-prediction-engine/xxx/decision_tree_data/clicks/2015/09/" + date + "/*/*", "json")

filter and operate on data
clicksRaw_df.registerTempTable("clicks_raw")

val clicks_df = sqlContext.sql("select sum(count_num) as count_num,job_id as job_id from clicks_raw group by job_id")
clicks_df.registerTempTable("clicks")

val jobsRaw_df = sqlContext.load(s"s3n://xxx-prediction-engine/predictData/decision_tree_data/jobs_with_num/2015/09/" + date + "/*", "json")
jobsRaw_df.registerTempTable("jobs_raw")

val jobsall_df = sqlContext.sql("select sum(num) as num, id as id, city as city, industry as industry from jobs_raw group by id, city, industry ")
val jobindustry_df = jobsall_df.filter(jobsall_df("num") > 6)
jobindustry_df.registerTempTable("jobindustry")

val jobs_df = sqlContext.sql("select id as id, city as city, industry as industry from jobindustry where industry is not null and industry <> '' ")
jobs_df.registerTempTable("jobs")

val total = jobsRaw_df.count()
val total7days = jobs_df.count()
val total1 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num = 1").count()
val total2 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num = 2").count()
val total5 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num > 2 and c.count_num < 5").count()
val total10 = sqlContext.sql("select * from clicks c, jobs j where c.job_id = j.id and c.count_num > 4 and c.count_num > 10").count()

println("total jobs= " + total)
println("total 7 days jobs= " + total7days)
println("1 clicks = " + total1 + " " + total1 * 100 / total7days + "%")
println("2 clicks = " + total2 + " " + total2 * 100 / total7days + "%")
println("5 clicks = " + total5 + " " + total5 * 100 / total7days + "%")
println("10 clicks = " + total10 + " " + total10 * 100 / total7days + "%")

4. Prepare the LabeledPoint
val data = sqlContext.sql("select j.id, j.industry, c.count_num, cities.cityCode from jobs as j left join cities as cities on j.city = cities.cityName left join clicks as c on j.id = c.job_id ").map( row=>{
//0 - id
//1 - industry
//2 - count
//3 - cityCode
val label = row.get(2) match {
      case s:Long => s
      case _ => 0l
}
val industry = java.lang.Double.parseDouble(row.getString(1))
val cityCode = row.get(3) match {
      case s : Double => s
      case _ => 0
}
val features = Vectors.dense(industry, cityCode)
LabeledPoint(label, features)
})

val splits = data.randomSplit(Array(0.2,0.1))
val (trainingData, testData) = (splits(0), splits(1))

5. Decision Tree Classification and Regression
// Train a DecisionTree model with classification
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses1 = 4000
val categoricalFeaturesInfo1 = Map[Int, Int]( 0 -> 57, 1-> 29674)
val impurity1 = "gini"
val maxDepth1 = 5
val maxBins1 = 30000

val strategy = new Strategy( Algo.Classification, Gini , maxDepth1, numClasses1, maxBins = maxBins1, categoricalFeaturesInfo = categoricalFeaturesInfo1, maxMemoryInMB = 1024)

val model1 = DecisionTree.train(trainingData, strategy)

// Train a DecisionTree model with regression
// Empty categoricalFeaturesInfo indicates all features are continuous.
val categoricalFeaturesInfo2 = Map[Int, Int]( 0 -> 57, 1-> 29674)
val impurity2 = "variance"
val maxDepth2 = 5
val maxBins2 = 30000

val model2 = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo2, impurity2,
maxDepth2, maxBins2)

6. Valuate the Model
// Evaluate model on test instances and compute test error
val labelAndPreds2 = testData.map { point =>
val prediction = model2.predict(point.features)
(point.label, prediction)
}

labelAndPreds2.filter(x=>x._1 > 0.0).take(5).foreach { case (score, label) =>
    println("label = " + score + " predict = " + label);
}

println("=============================================")

labelAndPreds2.filter(x=>x._1 == 0.0).take(5).foreach { case (score, label) =>
    println("label = " + score + " predict = " + label);
}

val testErr = labelAndPreds2.filter(r => r._1 != r._2).count.toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification tree model:\n" + model2.toDebugString)

Reference:
Decision Tree
http://spark.apache.org/docs/latest/mllib-guide.html

Factorization Machines
http://blog.csdn.net/itplus/article/details/40536025

http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application

Prediction(3)Model - Decision Tree

猜你喜欢