What are the common problems outlined Hadoop

Recently, many new to the junior partner asked what Hadoop common problems, something simple and small series finishing below, now for everyone to share, we want to help small partners.

1, and now companies use Hadoop version 1.x or mainly 2.x?

Currently Baidu, Tencent, Ali-based Internet companies are based on hadoop.

aX as the base version, of course, each company will be the second development to customize to meet the needs of different clusters.

bX has not officially used in the internal Baidu, or to the main 1.X, but Baidu has developed a system HCE (HadoopC ++ Expand System) for the 1.X problem.

Supplementary: Hadoop2.x in many other corporate applications, such as Jingdong.

2, later want to engage in large data aspects of the work, the algorithm to grasp to what extent, the algorithm accounts for the major part of it?

[Big Data to develop learning materials collection method: adding big data exchange technology to learn q group 522,189,307, private letters administrator can receive a free

First of all, if you want to engage in large data relating to the field, then, hadoop as a tool to use, you first need to learn how to use. Hadoop can not go deep into the details of the source code level.

Then is the understanding of the algorithm, often need to design a distributed data mining algorithms to achieve, but you still need to understand the algorithm itself, such as the commonly used k-means clustering.

3, now spark, storm more and more fire, Google also released a Cloud Dataflow, is later than the main Hadoop should learn hdfs and yarn, and after major Hadoop programmers to do is pack up and only provides an interface to let ordinary programmers can use, like Cloudera and Google the same?

The students, you worry, hadoop and spark, strom is to solve different problems, which is good that there is no bad, or to learn Hadoop to the mainstream version of hadoop-1.X, 2.X most important is multi a yarn framework, well understood. If you are looking hadoop own research and development suggestions, if you are hadoop application-related research and development, look at the mainstream 1.X on the line.

4, White ask, large data processing software is installed on the server anyway, what effect does the program, clustering, operation and maintenance of large data belongs to the siege lion or content of the work it?

The traditional program can only run on a single machine, and a large data processing which is often written using distributed programming framework, such as hadoopmapreduce, can only run on hadoop cluster platform.

Responsibility for operation and maintenance: to ensure the stability and reliability of the cluster, the machine

hadoop development of the system itself: to improve the performance of Hadoop clusters, add new features.

Big Data applications: the hadoop as a tool to achieve mass data processing or related needs.

After 5, split large files into many small files, how to effectively deal with these small files with Hadoop carried out? And how to make each node as load balancing?

a. how to effectively deal with these small files with Hadoop carried out?

hadoop when dealing with large-scale data is very effective, but when dealing with a large number of small files because the system resource overhead will be too large to be less efficient, for such problems, small files can be packaged as a large file, such as the use of SequcenFile file format, such as a file signature for the key, the contents of the file itself is written in a file record SequcenFile for value, so that multiple small files into a large file you can by SequcenFile file format, before the files are mapped to each small file SequcenFile a record.

b. how to make each node as load balancing?

In hadoop cluster load balancing is critical, this situation leads to often because of uneven distribution of user data, and calculates the number of slots is indeed a balanced distribution of resources at each node, so that when the job runs non-local task there will be a lot of data transmission, resulting in cluster load is not balanced, therefore solve the uneven point is the balanced distribution of user data, you can use the built-in balancer hadoop script commands.

For as resource scheduling imbalance caused by the need to consider the specific scheduling and job allocation mechanism.

Guess you like

Origin blog.csdn.net/fdfsdrjku/article/details/92720380