[面试题]从shell脚本到MapReduce

问题：统计一个网站的日活跃度、周活跃度、月活跃度。
数据格式如下：

[时间][用户ID][操作数名称][其他参数]...

Shell

在小量数据的情况下，我们可以使用Shell或者Python直接进行统计：
某天的访问日志文件“access.log”shell计算日活跃度：

cat access.log | awk '{print $2}' | sort | uniq | wc -l

这行命令，使用awk把access.log中的第二列过滤出来，然后sort排序，再经过uniq命令处理去重，最后用wc命令计算行数，就是日活跃度。
同时，我们可以计算a和b两个文件的并集，输出到a_b.union：

cat a b | sort | uniq | > a_b.union

计算交集(uniq -d命令表示：只打印相邻重复行)：

cat a b | aort | uniq -d > a_b.intersect

计算差集(a-b)：

cat a_b.union b | sort | uniq -u >a_b.diff

其中uniq -u 把相邻没有重复的数据打印出来。

计算用户的留存率，先将用户两天的数据去重，计算第一天和第二天的交集中的用户数，除以第一天的去重后的用户数量就是留存率。
下面是计算两天的用户交集数量

cat uniq_day_1.log uniq_day_2.log | awk '{print $2}' | sort | uniq -d | wc -l

Python

我们也可以通过Python脚本来处理用户数据，计算留存率：

#coding = utf-8

def count():
    set_1 = set()
    set_2 = set()
    with open(r'access1.log',encoding='utf-8') as f:
        for line in f:
            set_1.add(line.split(" ")[1])

    with open(r'access2.log',encoding='utf-8') as f:
        for line in f:
            set_2.add(line.split(" ")[1])

    tmp = set_1.intersection(set_2)
    print((len(tmp)/len(set_1)))
    return

if __name__ == '__main__':
    count()

Python里面的set集合可以很方便的去重，进行集合的各种操作参考Python 集合set操作

MapReduce

在海量数据的情况下，我们使用Hadoop/Spark等大数据计算框架来解决这个问题。MapReduce是Hadoop的计算框架，流程如下图所示：
这里写图片描述
在上面计算留存率的例子上，我们用MapReduce的思想可以进行设计：
1.Map阶段，需要将用户数据读入，进行排序分区。Map的输入是

public class MapperRate extends Mapper<LongWritable, Text, Text, Text> {

    public void map(LongWritable ikey, Text ivalue, Context context) throws IOException, InterruptedException {

        String[] line = ivalue.toString().split(" ");
        if(line.length == 2)
            context.write(new Text(line[0]), new Text(line[1]));

    }

}

public class ReduceRate extends Reducer<Text, Text, Text, Text> {

    private byte[] lock = new byte[1];

    private long num = 0;

    public void reduce(Text _key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        // process values
        boolean label = false;
        Iterator<Text> iter = values.iterator();
        while(iter.hasNext()) {
            Text q1 = iter.next();
            if(iter.hasNext()) {
                Text q2 = iter.next();
                if(!q1.equals(q2)) {
                    label = true;
                }
            }

        }
        if(label) {
            synchronized (lock){
                num++;
                context.write(new Text(String.valueOf(num)),_key);
            }
        }

    }

}