hadoop streaming

[hadoop@master test]$ hadoop jar /home/hadoop/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar -info
Warning: $HADOOP_HOME is deprecated.

14/12/15 14:06:32 ERROR streaming.StreamJob: Missing required options: input, output
Usage: $HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
-partitioner JavaClassName Optional.
-numReduceTasks <num> Optional.
-inputreader <spec> Optional.
-cmdenv <n>=<v> Optional. Pass env.var to streaming commands
-mapdebug <path> Optional. To run this script when a map task fails
-reducedebug <path> Optional. To run this script when a reduce task fails
-io <identifier> Optional.
-verbose

Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

In -input: globbing on <path> is supported and can have multiple -input
Default Map input format: a line is a record in UTF-8
the key part ends at first TAB, the rest of the line is the value
Custom input format: -inputformat package.MyInputFormat
Map output format, reduce input/output format:
Format defined by what the mapper command outputs. Line-oriented

The files named in the -file argument[s] end up in the
working directory when the mapper and reducer are run.
The location of this working directory is unspecified.

To set the number of reduce tasks (num. of output files):
-D mapred.reduce.tasks=10
To skip the sort/combine/shuffle/sort/reduce step:
Use -numReduceTasks 0
A Task's Map output then becomes a 'side-effect output' rather than a reduce input
This speeds up processing, This also feels more like "in-place" processing
because the input filename and the map input order are preserved
This equivalent -reducer NONE

To speed up the last maps:
-D mapred.map.tasks.speculative.execution=true
To speed up the last reduces:
-D mapred.reduce.tasks.speculative.execution=true
To name the job (appears in the JobTracker Web UI):
-D mapred.job.name='My Job'
To change the local temp directory:
-D dfs.data.dir=/tmp/dfs
-D stream.tmpdir=/tmp/streaming
Additional local temp directories with -cluster local:
-D mapred.local.dir=/tmp/local
-D mapred.system.dir=/tmp/system
-D mapred.temp.dir=/tmp/temp
To treat tasks with non-zero exit status as SUCCEDED:
-D stream.non.zero.exit.is.failure=false
Use a custom hadoopStreaming build along a standard hadoop install:
$HADOOP_HOME/bin/hadoop jar /path/my-hadoop-streaming.jar [...]\
[...] -D stream.shipped.hadoopstreaming=/path/my-hadoop-streaming.jar
For more details about jobconf parameters see:
http://wiki.apache.org/hadoop/JobConfFile
To set an environement variable in a streaming command:
-cmdenv EXAMPLE_DIR=/home/example/dictionaries/

Shortcut:
setenv HSTREAMING "$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar"

Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl"
-file /local/filter.pl -input "/logs/0604*/*" [...]
Ships a script, invokes the non-shipped perl interpreter
Shipped files go to the working directory so filter.pl is found by perl
Input files are all the daily logs for days in month 2006-04

Streaming Command Failed!

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[hadoop@master test]$ hadoop jar /home/hadoop/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar -Dmapred.reduce.tasks=1 -input /test/t1* -output /testoutput -mapper cat -reducer cat

Warning: $HADOOP_HOME is deprecated.

packageJobJar: [/jeep/hadoop/tmp/hadoop-unjar3744254068843470100/] [] /tmp/streamjob4491453334356355069.jar tmpDir=null
14/12/15 12:27:07 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/12/15 12:27:07 WARN snappy.LoadSnappy: Snappy native library not loaded
14/12/15 12:27:07 INFO mapred.FileInputFormat: Total input paths to process : 1
14/12/15 12:27:07 INFO streaming.StreamJob: getLocalDirs(): [/jeep/hadoop/tmp/mapred/local]
14/12/15 12:27:07 INFO streaming.StreamJob: Running job: job_201412150933_0007
14/12/15 12:27:07 INFO streaming.StreamJob: To kill this job, run:
14/12/15 12:27:07 INFO streaming.StreamJob: /home/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job -Dmapred.job.tracker=master:9001 -kill job_201412150933_0007
14/12/15 12:27:07 INFO streaming.StreamJob: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201412150933_0007
14/12/15 12:27:08 INFO streaming.StreamJob: map 0% reduce 0%
14/12/15 12:27:13 INFO streaming.StreamJob: map 50% reduce 0%
14/12/15 12:27:14 INFO streaming.StreamJob: map 100% reduce 0%
14/12/15 12:27:21 INFO streaming.StreamJob: map 100% reduce 33%
14/12/15 12:27:22 INFO streaming.StreamJob: map 100% reduce 100%
14/12/15 12:27:24 INFO streaming.StreamJob: Job complete: job_201412150933_0007
14/12/15 12:27:24 INFO streaming.StreamJob: Output: /testoutput
[hadoop@master test]$ hadoop dfs -cat /testoutput/part-00000
Warning: $HADOOP_HOME is deprecated.

In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.

hadoop jar \
    $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>
    -Dmapred.reduce.tasks=1 \
    -Dmapred.job.queue.name=$QUEUE \
    -input "$INPUT" \
    -output "$OUTPUT" \
    -mapper cat \
    -reducer cat

If you want compression add
-Dmapred.output.compress=true \ -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

猜你喜欢