MR练习之uid的去重

版权声明:原创作品,允许转载,转载时请务必以超链接形式标明文章 原始出处 、作者信息和本声明。否则将追究法律责任。http://blog.csdn.net/qq_34291777 https://blog.csdn.net/qq_34291777/article/details/75760890

MR练习之uid的去重

此次用了map阶段了,reduce阶段只是用来不同的key(uid)写入了文件。

数据

20111230000005 57375476989eea12893c0c3811607bcf    奇艺高清    1   1   http://www.qiyi.com/
20111230000005  57375476989eea12893c0c3811607bcf    凡人修仙传   3   1   http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1
20111230000007  b97920521c78de70ac38e3713f524b50    本本联盟    1   1   http://www.bblianmeng.com/
20111230000008  6961d0c97fe93701fc9c0d861d096cd9    华南师范大学图书馆   1   1   http://lib.scnu.edu.cn/
20111230000008  f2f5a21c764aebde1e8afcc2871e086f    在线代理    2   1   http://proxyie.cn/
20111230000009  96994a0480e7e1edcaef67b20d8816b7    伟大导演    1   1   http://movie.douban.com/review/1128960/
20111230000009  698956eb07815439fe5f46e9a4503997    youku   1   1   http://www.youku.com/
20111230000009  599cd26984f72ee68b2b6ebefccf6aed    安徽合肥365房产网  1   1   http://hf.house365.com/
20111230000010  f577230df7b6c532837cd16ab731f874    哈萨克网址大全 1   1   http://www.kz321.com/
20111230000010  285f88780dd0659f5fc8acc7cc4949f2    IQ数码    1   1   http://www.iqshuma.com/

10条数据,第二列是uid,前两行是2个重复的uid

代码

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class UuidMain {

    /**
     * @param args
     */

    public static  class UuidMapper extends Mapper<LongWritable , Text, Text, NullWritable>{
        private static  Text  val=  new Text(); 
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            //切分一行
            String [] line = value.toString().split("\t");
            //取uid
            String  uid = line[1];
            val.set(uid);
            System.out.println("-------------uid :"+uid);
            context.write(val, NullWritable.get());
        }

    }
    public static class UuidReduce extends Reducer<Text, NullWritable, Text, NullWritable>{

        @Override
        protected void reduce(Text key, Iterable values,
                org.apache.hadoop.mapreduce.Reducer.Context context)
                throws IOException, InterruptedException {
        System.out.println("Reduce.......");
        System.out.println("key:"+key);
            context.write(key, NullWritable.get());

        }

    }
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        if (null == args || args.length != 2) {

            System.err.println(": UidCollectot ");

            System.exit(1);

            }

            Path inputPath = new Path(args[0]);

            Path outputPath = new Path(args[1]);



            Job job = new Job(new Configuration(), "Uuid");

            // jarClass

            job.setJarByClass(UuidMain.class);

            // mapper class

            job.setMapperClass(UuidMapper.class);

            // reducer class

            job.setReducerClass(UuidReduce.class);

            // 设置输入输出格式

            job.setOutputKeyClass(Text.class);

            job.setOutputValueClass(NullWritable.class);

            // 设置输入输出路径

            FileInputFormat.addInputPath(job, inputPath);

            FileOutputFormat.setOutputPath(job, outputPath);



            System.exit(job.waitForCompletion(true) ? 0 : 1);


    }

}

控制台输出

2017-07-22 16:38:40,404 WARN  [main] util.NativeCodeLoader (NativeCodeLoader.java:<clinit>(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-07-22 16:38:41,701 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1019)) - session.id is deprecated. Instead, use dfs.metrics.session-id
2017-07-22 16:38:41,712 INFO  [main] jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM Metrics with processName=JobTracker, sessionId=
2017-07-22 16:38:42,154 WARN  [main] mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(150)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2017-07-22 16:38:42,159 WARN  [main] mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(259)) - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2017-07-22 16:38:42,440 INFO  [main] input.FileInputFormat (FileInputFormat.java:listStatus(281)) - Total input paths to process : 1
2017-07-22 16:38:42,583 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(396)) - number of splits:1
2017-07-22 16:38:43,161 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(479)) - Submitting tokens for job: job_local1393818339_0001
2017-07-22 16:38:43,273 WARN  [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-zkpk/mapred/staging/zkpk1393818339/.staging/job_local1393818339_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2017-07-22 16:38:43,284 WARN  [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-zkpk/mapred/staging/zkpk1393818339/.staging/job_local1393818339_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2017-07-22 16:38:43,924 WARN  [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-zkpk/mapred/local/localRunner/zkpk/job_local1393818339_0001/job_local1393818339_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2017-07-22 16:38:43,982 WARN  [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-zkpk/mapred/local/localRunner/zkpk/job_local1393818339_0001/job_local1393818339_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2017-07-22 16:38:44,035 INFO  [main] mapreduce.Job (Job.java:submit(1289)) - The url to track the job: http://localhost:8080/
2017-07-22 16:38:44,038 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1334)) - Running job: job_local1393818339_0001
2017-07-22 16:38:44,058 INFO  [Thread-12] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(471)) - OutputCommitter set in config null
2017-07-22 16:38:44,093 INFO  [Thread-12] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(489)) - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2017-07-22 16:38:44,435 INFO  [Thread-12] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for map tasks
2017-07-22 16:38:44,441 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:run(224)) - Starting task: attempt_local1393818339_0001_m_000000_0
2017-07-22 16:38:44,535 INFO  [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:initialize(587)) -  Using ResourceCalculatorProcessTree : [ ]
2017-07-22 16:38:44,540 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:runNewMapper(733)) - Processing split: hdfs://master:9000/user/wordcount/input/sogou.10.utf8:0+966
2017-07-22 16:38:44,578 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:createSortingCollector(388)) - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2017-07-22 16:38:44,842 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:setEquator(1182)) - (EQUATOR) 0 kvi 26214396(104857584)
2017-07-22 16:38:44,842 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(975)) - mapreduce.task.io.sort.mb: 100
2017-07-22 16:38:44,842 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(976)) - soft limit at 83886080
2017-07-22 16:38:44,842 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(977)) - bufstart = 0; bufvoid = 104857600
2017-07-22 16:38:44,842 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:init(978)) - kvstart = 26214396; length = 6553600
2017-07-22 16:38:45,042 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1355)) - Job job_local1393818339_0001 running in uber mode : false
2017-07-22 16:38:45,043 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1362)) -  map 0% reduce 0%
2017-07-22 16:38:45,728 INFO  [LocalJobRunner Map Task Executor #0] input.LineRecordReader (LineRecordReader.java:skipUtfByteOrderMark(156)) - Found UTF-8 BOM and skipped it
Mapper......
-------------uid :57375476989eea12893c0c3811607bcf
Mapper......
-------------uid :57375476989eea12893c0c3811607bcf
Mapper......
-------------uid :b97920521c78de70ac38e3713f524b50
Mapper......
-------------uid :6961d0c97fe93701fc9c0d861d096cd9
Mapper......
-------------uid :f2f5a21c764aebde1e8afcc2871e086f
Mapper......
-------------uid :96994a0480e7e1edcaef67b20d8816b7
Mapper......
-------------uid :698956eb07815439fe5f46e9a4503997
Mapper......
-------------uid :599cd26984f72ee68b2b6ebefccf6aed
Mapper......
-------------uid :f577230df7b6c532837cd16ab731f874
Mapper......
-------------uid :285f88780dd0659f5fc8acc7cc4949f2
2017-07-22 16:38:45,732 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 
2017-07-22 16:38:46,078 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1437)) - Starting flush of map output
2017-07-22 16:38:46,079 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1455)) - Spilling map output
2017-07-22 16:38:46,079 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1456)) - bufstart = 0; bufend = 330; bufvoid = 104857600
2017-07-22 16:38:46,079 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:flush(1458)) - kvstart = 26214396(104857584); kvend = 26214360(104857440); length = 37/6553600
2017-07-22 16:38:46,191 INFO  [LocalJobRunner Map Task Executor #0] mapred.MapTask (MapTask.java:sortAndSpill(1641)) - Finished spill 0
2017-07-22 16:38:46,203 INFO  [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:done(1001)) - Task:attempt_local1393818339_0001_m_000000_0 is done. And is in the process of committing
2017-07-22 16:38:46,223 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - map
2017-07-22 16:38:46,224 INFO  [LocalJobRunner Map Task Executor #0] mapred.Task (Task.java:sendDone(1121)) - Task 'attempt_local1393818339_0001_m_000000_0' done.
2017-07-22 16:38:46,224 INFO  [LocalJobRunner Map Task Executor #0] mapred.LocalJobRunner (LocalJobRunner.java:run(249)) - Finishing task: attempt_local1393818339_0001_m_000000_0
2017-07-22 16:38:46,224 INFO  [Thread-12] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - map task executor complete.
2017-07-22 16:38:46,229 INFO  [Thread-12] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for reduce tasks
2017-07-22 16:38:46,229 INFO  [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:run(302)) - Starting task: attempt_local1393818339_0001_r_000000_0
2017-07-22 16:38:46,241 INFO  [pool-6-thread-1] mapred.Task (Task.java:initialize(587)) -  Using ResourceCalculatorProcessTree : [ ]
2017-07-22 16:38:46,250 INFO  [pool-6-thread-1] mapred.ReduceTask (ReduceTask.java:run(362)) - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@10080e18
2017-07-22 16:38:46,274 INFO  [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:<init>(193)) - MergerManager: memoryLimit=304244320, maxSingleShuffleLimit=76061080, mergeThreshold=200801264, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2017-07-22 16:38:46,283 INFO  [EventFetcher for fetching Map Completion Events] reduce.EventFetcher (EventFetcher.java:run(61)) - attempt_local1393818339_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2017-07-22 16:38:46,591 INFO  [localfetcher#1] reduce.LocalFetcher (LocalFetcher.java:copyMapOutput(140)) - localfetcher#1 about to shuffle output of map attempt_local1393818339_0001_m_000000_0 decomp: 352 len: 356 to MEMORY
2017-07-22 16:38:46,614 INFO  [localfetcher#1] reduce.InMemoryMapOutput (InMemoryMapOutput.java:shuffle(100)) - Read 352 bytes from map-output for attempt_local1393818339_0001_m_000000_0
2017-07-22 16:38:46,637 INFO  [localfetcher#1] reduce.MergeManagerImpl (MergeManagerImpl.java:closeInMemoryFile(307)) - closeInMemoryFile -> map-output of size: 352, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->352
2017-07-22 16:38:46,656 INFO  [EventFetcher for fetching Map Completion Events] reduce.EventFetcher (EventFetcher.java:run(76)) - EventFetcher is interrupted.. Returning
2017-07-22 16:38:46,657 INFO  [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2017-07-22 16:38:46,658 INFO  [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(667)) - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2017-07-22 16:38:46,665 INFO  [pool-6-thread-1] mapred.Merger (Merger.java:merge(591)) - Merging 1 sorted segments
2017-07-22 16:38:46,666 INFO  [pool-6-thread-1] mapred.Merger (Merger.java:merge(690)) - Down to the last merge-pass, with 1 segments left of total size: 317 bytes
2017-07-22 16:38:46,670 INFO  [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(742)) - Merged 1 segments, 352 bytes to disk to satisfy reduce memory limit
2017-07-22 16:38:46,670 INFO  [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(772)) - Merging 1 files, 356 bytes from disk
2017-07-22 16:38:46,678 INFO  [pool-6-thread-1] reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(787)) - Merging 0 segments, 0 bytes from memory into reduce
2017-07-22 16:38:46,678 INFO  [pool-6-thread-1] mapred.Merger (Merger.java:merge(591)) - Merging 1 sorted segments
2017-07-22 16:38:46,680 INFO  [pool-6-thread-1] mapred.Merger (Merger.java:merge(690)) - Down to the last merge-pass, with 1 segments left of total size: 317 bytes
2017-07-22 16:38:46,681 INFO  [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2017-07-22 16:38:46,775 INFO  [pool-6-thread-1] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1019)) - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
Reduce.......
key:285f88780dd0659f5fc8acc7cc4949f2
Reduce.......
key:57375476989eea12893c0c3811607bcf
Reduce.......
key:599cd26984f72ee68b2b6ebefccf6aed
Reduce.......
key:6961d0c97fe93701fc9c0d861d096cd9
Reduce.......
key:698956eb07815439fe5f46e9a4503997
Reduce.......
key:96994a0480e7e1edcaef67b20d8816b7
Reduce.......
key:b97920521c78de70ac38e3713f524b50
Reduce.......
key:f2f5a21c764aebde1e8afcc2871e086f
Reduce.......
key:f577230df7b6c532837cd16ab731f874
2017-07-22 16:38:47,078 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1362)) -  map 100% reduce 0%
2017-07-22 16:38:47,319 INFO  [pool-6-thread-1] mapred.Task (Task.java:done(1001)) - Task:attempt_local1393818339_0001_r_000000_0 is done. And is in the process of committing
2017-07-22 16:38:47,323 INFO  [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2017-07-22 16:38:47,323 INFO  [pool-6-thread-1] mapred.Task (Task.java:commit(1162)) - Task attempt_local1393818339_0001_r_000000_0 is allowed to commit now
2017-07-22 16:38:47,375 INFO  [pool-6-thread-1] output.FileOutputCommitter (FileOutputCommitter.java:commitTask(439)) - Saved output of task 'attempt_local1393818339_0001_r_000000_0' to hdfs://master:9000/user/wordcount/output1/_temporary/0/task_local1393818339_0001_r_000000
2017-07-22 16:38:47,379 INFO  [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - reduce > reduce
2017-07-22 16:38:47,379 INFO  [pool-6-thread-1] mapred.Task (Task.java:sendDone(1121)) - Task 'attempt_local1393818339_0001_r_000000_0' done.
2017-07-22 16:38:47,380 INFO  [pool-6-thread-1] mapred.LocalJobRunner (LocalJobRunner.java:run(325)) - Finishing task: attempt_local1393818339_0001_r_000000_0
2017-07-22 16:38:47,380 INFO  [Thread-12] mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - reduce task executor complete.
2017-07-22 16:38:48,079 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1362)) -  map 100% reduce 100%
2017-07-22 16:38:48,080 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1373)) - Job job_local1393818339_0001 completed successfully
2017-07-22 16:38:48,201 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1380)) - Counters: 38
    File System Counters
        FILE: Number of bytes read=1086
        FILE: Number of bytes written=458862
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1932
        HDFS: Number of bytes written=297
        HDFS: Number of read operations=15
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=4
    Map-Reduce Framework
        Map input records=10
        Map output records=10
        Map output bytes=330
        Map output materialized bytes=356
        Input split bytes=118
        Combine input records=0
        Combine output records=0
        Reduce input groups=9
        Reduce shuffle bytes=356
        Reduce input records=10
        Reduce output records=9
        Spilled Records=20
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=0
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0
        Total committed heap usage (bytes)=394264576
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=966
    File Output Format Counters 
        Bytes Written=297

成功输出的文件 part-r-00000

285f88780dd0659f5fc8acc7cc4949f2
57375476989eea12893c0c3811607bcf
599cd26984f72ee68b2b6ebefccf6aed
6961d0c97fe93701fc9c0d861d096cd9
698956eb07815439fe5f46e9a4503997
96994a0480e7e1edcaef67b20d8816b7
b97920521c78de70ac38e3713f524b50
f2f5a21c764aebde1e8afcc2871e086f
f577230df7b6c532837cd16ab731f874

可以看出没有重复的uid了

猜你喜欢

转载自blog.csdn.net/qq_34291777/article/details/75760890