大数据工程师面试题—2

2.7. 用mapreduce来实现下面需求？
现在有10个文件夹,每个文件夹都有1000000个url.现在让你找出top1000000url。
方法一：
运用2个job，第一个job直接用filesystem读取10个文件夹作为map输入，url做key，reduce计算url的sum，
下一个job map用url作key，运用sum作二次排序，reduce中取top10000000
1：首先进行wordcount计算
2：进行二次排序
如何启动两个job代码如下：
public class TopUrl {
public static void main(String[] args) throws Exception {
depending();
}

public static void depending() throws Exception{
Configuration conf = new Configuration();
//排序
Job job2 = initJob2(conf, SecondSortMapper.class,SecondSortReduce.class);
//读取url
Job job1 = initJob(conf, FirstMapper.class, FirstReduce.class);

JobControl jobControl = new JobControl("groupName");
List<ControlledJob> dependingJobs = new ArrayList<ControlledJob>();
//进行排序
ControlledJob controlledJob1 = new ControlledJob(conf);
controlledJob1.setJob(job1);
// dependingJobs.add(controlledJob1);

//排序
ControlledJob controlledJob2 = new ControlledJob(conf);
controlledJob2.setJob(job2);
controlledJob2.addDependingJob(controlledJob1);
jobControl.addJob(controlledJob2);
jobControl.addJob(controlledJob1);


Thread jcThread = new Thread(jobControl);
jcThread.start();
while(true){
if(jobControl.allFinished()){
System.out.println(jobControl.getSuccessfulJobList());
jobControl.stop();
break;
}
if(jobControl.getFailedJobList().size() > 0){
System.out.println(jobControl.getFailedJobList());
jobControl.stop();
break;
}
}

FileSystem fs = FileSystem.get(conf);
boolean ret = fs.deleteOnExit(new Path("hdfs://master:9000/user/hadoop/20130601/output"));
System.out.println(ret);
}

public static Job initJob2(Configuration conf,Class o1,Class o2) throws Exception{
Job job = new Job(conf, "Join2");
job.setJarByClass(TopUrl.class);
job.setMapperClass(o1);
job.setMapOutputKeyClass(Text.class);//map输出key
job.setOutputValueClass(IntWritable.class);
job.setReducerClass(o2);
job.setOutputKeyClass(Text.class); //reduce输出key
job.setOutputValueClass(IntWritable.class);

// job.setPartitionerClass(cls);
job.setSortComparatorClass(TextIntComparator.class);
// job.setGroupingComparatorClass(cls);

FileInputFormat.addInputPath(job, new Path
("hdfs://master:9000/user/hadoop/20130601/output"));
FileOutputFormat.setOutputPath(job, new Path
("hdfs://master:9000/user/hadoop/20130601/output2"));
// System.exit(job.waitForCompletion(true) ? 0 : 1);
return job;
}

public static Job initJob(Configuration conf,Class o1,Class o2) throws Exception{
Job job = new Job(conf, "Join1");
job.setJarByClass(TopUrl.class);
job.setMapperClass(o1);
job.setMapOutputKeyClass(Text.class);//map输出key
job.setOutputValueClass(IntWritable.class);
job.setReducerClass(o2);
job.setOutputKeyClass(Text.class); //reduce输出key
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path
("hdfs://master:9000/user/hadoop/20130601/ippaixv.txt"));
FileOutputFormat.setOutputPath(job, new Path
("hdfs://master:9000/user/hadoop/20130601/output"));
// System.exit(job.waitForCompletion(true) ? 0 : 1);
return job;
}
}

方法二：
建hive表A，挂分区channel，每个文件夹是一个分区.
select x.url,x.c from(select url,count(1) as c from A where channel ='' group by url) x order by x.c desc limit 1000000;

还可以用treeMap, 到1000000了每来一个都加进去, 删掉最小的

2.8. hadoop中Combiner的作用?
combiner是reduce的实现，在map端运行计算任务，减少map端的输出数据。
作用就是优化。
但是combiner的使用场景是mapreduce的map和reduce输入输出一样。

2.9. 简述hadoop安装
1）创建hadoop用户
2）改IP，修改HOSTS文件域名
3）安装SSH，配置无密钥通信
4）安装JAVA，配置JAVA的环境变量
5）解压hadoop
6）配置conf下的core-site.xml，hdfs-site.xml，mapred-site.xml，yarn-site.xml
7）配置hadoop的环境变量
8）hadoop namenode -format
9）start-all.sh

2.10. 请列出hadoop进程名
Namenode：管理集群，并记录datanode文件信息。
Secondarynamenode：可以做冷备，对一定范围内数据做快照性备份。
Datanode：存储数据。
Resourcemanager：管理任务，并将任务分配给MRAppMaster。
NodeManager：任务执行方。

2.11. 解决下面的错误
1、权限问题，可能曾经用root启动过集群。(例如hadoop搭建的集群,是tmp/hadoop-hadoop/.....)
2、可能是文件夹不存在
3、解决: 删掉tmp下的那个文件,或改成当前用户

2.12. 写出下面的命令
1）杀死一个job
hadoop job -list 拿到job-id，
hadoop job -kill job-id
2）删除hdfs上的/tmp/aaa目录
hadoop fs -rmr /tmp/aaa
3）加入一个新的存储节点和删除一个计算节点需要刷新集群状态命令
加薪节点时：
hadoop-daemon.sh start datanode
yarn-daemon.sh start nodemanager
删除时：
hadoop mradmin -refreshnodes
hadoop dfsadmin -refreshnodes

大数据工程师面试题—2

猜你喜欢