"Property" 5h spend a day learning Hadoop + Spark huge amount of big data analysis and machine learning practical

The theme of this article is Hadoop + Spark large data analysis and machine learning. As we all know, is the use of the most Hadoop Big Data platform, however Spark meteoric rise, and is compatible with Hadoop faster, companies have begun to join the development of the Spark. For example, IBM Joins Apache Spark community, we intend to cultivate one million data scientists. Google (Google) and Microsoft were also applied Spark features to build a service, cloud development and data analysis and machine learning platform. These big companies to join, but also means that the future more companies will adopt Hadoop + Spark data analysis of large data.

However, although many books currently on the market of big data, but most tend to introduce the theory or application level, although a lot of information on the network, but also very messy. This article hopes to introduce the principle of a lucid and instructions, plus Hands operation, example programs, to reduce the learning curve big data technology, leading the reader into the field of big data and machine learning. Of course, the entire ecosystem of large data is very large, too many things to learn. I hope the reader through the study of this article Once you have the basic concept can be relatively easy to enter this field, in order to continue in-depth study on the technical and other big data.Here Insert Picture Description

Documentation of the

This document is easy to understand the "big data and machine learning" principles start with introduction and notes, describes the basic concepts of big data and machine learning, such as: classification, analysis, training, modeling, forecasting, machine learning (recommendation engine), machine learning (binary classification), machine learning (multivariate classification), machine learning (regression analysis) and data visualization applications. Hands provides a wealth of operational procedures and examples explain the reader to learn in order to reduce the threshold of big data technology, the book shows how to install multiple Linux virtual machines through the Virtual Box virtual machine on a single Windows system, how to build a Hadoop cluster , then create Spark development environment. This paper introduces built Shangjishijian platform is not limited to a single physical computer. For the process to build a qualified company and school, described with reference to the text, you can practice the same platform to build on more than one physical computer, so closer to big data and machine learning real operating environment.

Section describes

Chapter 1, Big Data and Machine Learning: The Big Data, Hadoop, HDFS, MapReduce, Spark , Machine Learning
Chapter 2 VirtualBox virtual machine to install the software: Hands operation. Install Virtual Box virtual machine, so you can install multiple Linux virtual machine on Windows systems
Chapter 3 Ubuntu Linux operating system installation: Hands operation. Install Ubuntu Linux operating system
Here Insert Picture Description
installation Chapter 4 Hadoop Single Node Cluster: The Hands operation. Mounting a single machine Hadoop Single Node Cluster
Chapter 5 Hadoop Multi Node Cluster installation: Shangjishijian operation. Installing multiple machines Hadoop Multi Node Cluster
Chapter 6 Hadoop HDFS commands: Hands operation. Demonstrate HDFS commands
Here Insert Picture Description
Chapter 7 Hadoop MapReduce: Introduction Hadoop MapReduce principle. WordCount.java sample programs. Demonstration frequency of use of each word in the article appeared in the Hadoop MapReduce computing

Installation and introduction of Chapter 8 Spark: Hands operation. Spark installation and operation of spark-shell interface demonstrated in different environments

Chapter 9 Spark RDD: Hands operation. Spark most basic functions described RDD (Resilient Distributed Dataset, elasticity distributed data sets) basic operations
Here Insert Picture Description
Chapter 10 Spark Integrated Development Environment: Hands-on operation. Install Integrated Development Environment (IDE). WordCount.scala sample programs. Demonstration frequency of use of each word in the article that appears calculated SparkMapReduce

Chapter 11 Creating recommend Bow | Engine: describes how to use the film to establish Spark MLlib MovieLens dataset recommendation engine (Recommendation Engine). Recommend.scala sample programs. Demonstrate how to obtain the data, training model, recommended user or a movie, build a movie recommendation system. AlsEvalution.scala sample programs. Demonstrate how to debug recommendation engine parameters, find the best combination of parameters Here Insert Picture Description
Chapter 12 StumbleUpon data set: StumbleUpon data set belongs to a binary classification problem, you can predict which pages can be temporary or long-term exist based on the characteristics of the web page

Chapter 13 binary classification decision tree: RunDecisionTreeBinary.scala sample programs. Demonstrates how to use the binary classification decision tree analysis StumbleUpon data sets to predict which pages can be temporary or long-term exist, and to find the best combination of parameters, improve forecast accuracy

Chapter 14 logistic regression binary classification: RunLogisticRegressionWithSGDBinary.scala sample programs. Demonstrates how to use the binary classification decision tree analysis StumbleUpon data sets to predict which pages can be temporary or long-term exist, and to find the best combination of parameters, improve forecast accuracy

Chapter 15 the SVM binary classification: RunSVMWithSGDBinary.scala sample programs. Demonstrate the SVM binary classification analysis StumbleUpon data sets to predict which pages can be temporary or long-term exist, and to find the best combination of parameters, improve forecast accuracy
Here Insert Picture Description
Chapter 16 Naive Bayesian two yuan Category: RunNaiveBayesBinary.scala sample programs. Demonstrate Naive Bayes (Naive-Bayes) binary classification analysis StumbleUpon data sets to predict which pages can be temporary or long-term exist, and to find the best combination of parameters, improve forecast accuracy

Chapter 17 Decision Tree multivariate classification: RunDecisionTreeMulti.scala sample programs. How an exemplary decision tree classification analysis Covtype multivariate data set (forest cover vegetation), depending on the conditions of the land can be predicted that the vegetation, and to find the best combination of parameters to improve prediction accuracy

Chapter 18 Decision Tree Regression Analysis: RunDecisionTreeRegression.scala sample programs. Demonstration introduces the decision tree regression analysis, Bike Sharing data sets. The day (holidays and conditions can be predicted number of hours per rented, and to find the best combination of parameters to improve prediction accuracy
Here Insert Picture Description
Chapter 19 using the Apache Zeppelin data visualization: Hands mounting operation and use Zeppelin ml-100k data collection, demonstrates the use of Spark SQL for data analysis and data visualization
Here Insert Picture Description
most people might think that big data needs to learn in many machine environment, in fact, by means of a virtual machine can be on their own computer exercises build Hadoop cluster, and Spark established development environment. This book introduces the basic concepts of the actual operation MapReduce and HDFS Hadoop, as well as the basic concepts of RDD and MapReduce Spark.

Large data analysis of actual cases -MoiveLens (movie recommendation engine), StumbleUpon (page binary classification), CovType (forest vegetation cover operation), Bike Sharing (Ubike class rental predictive analysis). Detailed examples of program code with a variety of machine learning algorithms, demonstrate how to get the data, data analysis, modeling, predictions, progressive approach to introduce Spark machine learning.

The "Hadoop + Spark big data" Xiao Bian has been good for everyone finishing
Here Insert Picture Description

Published 85 original articles · won praise 7 · views 20000 +

Guess you like

Origin blog.csdn.net/Ppikaqiu/article/details/104718822