Hadoop Context 写文件遇到的问题(reduce 100% 后失败)

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/zpf336/article/details/79480179

场景:

        我在一个reduce中同时使用了context.write 和multipleOutputs.write, 

        结果却打出了如下信息:

2018-03-07 17:45:39,425 INFO [submiter1] org.apache.hadoop.mapreduce.Job:  map 100% reduce 98%
2018-03-07 17:45:44,449 INFO [submiter1] org.apache.hadoop.mapreduce.Job:  map 100% reduce 99%
2018-03-07 17:46:30,637 INFO [submiter1] org.apache.hadoop.mapreduce.Job:  map 100% reduce 100%
2018-03-07 17:55:05,124 INFO [submiter1] org.apache.hadoop.mapreduce.Job: Task Id : attempt_1519244607863_106078_r_000088_0, Status : FAILED
2018-03-07 17:55:06,153 INFO [submiter1] org.apache.hadoop.mapreduce.Job:  map 100% reduce 99%
2018-03-07 17:55:14,359 INFO [submiter1] org.apache.hadoop.mapreduce.Job: Task Id : attempt_1519244607863_106078_r_000088_1, Status : FAILED
2018-03-07 17:55:20,400 INFO [submiter1] org.apache.hadoop.mapreduce.Job: Task Id : attempt_1519244607863_106078_r_000088_2, Status : FAILED
2018-03-07 17:55:36,752 INFO [submiter1] org.apache.hadoop.mapreduce.Job:  map 100% reduce 100%
2018-03-07 17:55:37,794 INFO [submiter1] org.apache.hadoop.mapreduce.Job: Job job_1519244607863_106078 failed with state FAILED due to: Task failed task_1519244607863_106078_r_000088

        首先是在reduce达到100%的时候回停顿很长时间,然后终于超时了,开始失败,再尝试,结果还是失败,最终失败。

        接着看失败的reduce task的信息,如下

stdout:

          Mar 7, 2018 5:23:06 PM INFO: parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
Mar 7, 2018 5:23:06 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Mar 7, 2018 5:23:06 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Mar 7, 2018 5:23:06 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Mar 7, 2018 5:23:06 PM INFO: parquet.hadoop.ParquetOutputFormat: Dictionary is on
Mar 7, 2018 5:23:06 PM INFO: parquet.hadoop.ParquetOutputFormat: Validation is off
Mar 7, 2018 5:23:06 PM INFO: parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
Mar 7, 2018 5:23:06 PM INFO: parquet.hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
Mar 7, 2018 5:23:07 PM INFO: parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
Mar 7, 2018 5:23:07 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Mar 7, 2018 5:23:07 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Mar 7, 2018 5:23:07 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Mar 7, 2018 5:23:07 PM INFO: parquet.hadoop.ParquetOutputFormat: Dictionary is on
Mar 7, 2018 5:23:07 PM INFO: parquet.hadoop.ParquetOutputFormat: Validation is off
Mar 7, 2018 5:23:07 PM INFO: parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
Mar 7, 2018 5:23:07 PM INFO: parquet.hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
Mar 7, 2018 5:23:07 PM INFO: parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 0
        看到这里的最后一行,让我一直以为是内存不够用,因为在失败之前,我看到了我的目标路径下的 tempoary存在并且里边的文件基本上应该是齐全的,所以我就

        误认为是从临时目录刷到最终目录时内存不够用,于是我就调大了内存,可是结果一样。此时我仔细想了想,应该不对,从temporary目录到最终目录只是move的操作,应该不需要大内存,所以思考的方向变了。

        我看到正常结束的reduce task的信息是这样的:

          Mar 8, 2018 9:11:48 AM INFO: parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
Mar 8, 2018 9:11:48 AM INFO: parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Mar 8, 2018 9:11:48 AM INFO: parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Mar 8, 2018 9:11:48 AM INFO: parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Mar 8, 2018 9:11:48 AM INFO: parquet.hadoop.ParquetOutputFormat: Dictionary is on
Mar 8, 2018 9:11:48 AM INFO: parquet.hadoop.ParquetOutputFormat: Validation is off
Mar 8, 2018 9:11:48 AM INFO: parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
Mar 8, 2018 9:11:48 AM INFO: parquet.hadoop.ParquetOutputFormat: Maximum row group padding size is 8388608 bytes
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 264,951
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 49B for [level_1_id] BINARY: 3,008 values, 3B raw, 5B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 1 entries, 13B raw, 1B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 50B for [level_2_id] BINARY: 3,008 values, 16B raw, 18B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 5 entries, 38B raw, 5B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 274B for [level_3_id] BINARY: 3,008 values, 230B raw, 234B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 82 entries, 1,140B raw, 82B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 881B for [level_4_id] BINARY: 3,008 values, 1,296B raw, 841B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 199 entries, 2,778B raw, 199B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 31B for [level_5_id] BINARY: 3,008 values, 3B raw, 5B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 1 entries, 4B raw, 1B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 811B for [demographic_id] INT32: 3,008 values, 943B raw, 775B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 21 entries, 84B raw, 21B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 70B for [level_type_code] INT32: 3,008 values, 34B raw, 36B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 5 entries, 20B raw, 5B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 39B for [country_id] INT32: 3,008 values, 3B raw, 5B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 1 entries, 4B raw, 1B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 39B for [platform_type_code] INT32: 3,008 values, 3B raw, 5B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 1 entries, 4B raw, 1B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 39B for [device_type_code] INT32: 3,008 values, 3B raw, 5B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 1 entries, 4B raw, 1B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 571B for [frequency] BINARY: 3,008 values, 1,511B raw, 541B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 11 entries, 58B raw, 11B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 47B for [intab_period_id] INT64: 3,008 values, 3B raw, 5B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 1 entries, 8B raw, 1B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 39B for [intab_period_type_code] INT32: 3,008 values, 3B raw, 5B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 1 entries, 4B raw, 1B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 523B for [desktop_reach] DOUBLE: 3,008 values, 587B raw, 479B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 115 entries, 920B raw, 115B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 572B for [desktop_impressions] DOUBLE: 3,008 values, 684B raw, 528B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 129 entries, 1,032B raw, 129B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 685B for [mobile_reach] DOUBLE: 3,008 values, 828B raw, 641B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 153 entries, 1,224B raw, 153B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 754B for [mobile_impressions] DOUBLE: 3,008 values, 920B raw, 710B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 188 entries, 1,504B raw, 188B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 174B for [ott_reach] DOUBLE: 3,008 values, 128B raw, 130B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 2 entries, 16B raw, 2B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 174B for [ott_impressions] DOUBLE: 3,008 values, 128B raw, 130B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 2 entries, 16B raw, 2B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 751B for [total_digital_reach] DOUBLE: 3,008 values, 878B raw, 707B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 196 entries, 1,568B raw, 196B comp}
Mar 8, 2018 9:11:49 AM INFO: parquet.hadoop.ColumnChunkPageWriteStore: written 834B for [total_digital_impressions] DOUBLE: 3,008 values, 997B raw, 790B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 230 entries, 1,840B raw, 230B comp}

        所以,我就猜测是写temporary目录时出的问题,没有内容可写,于是我猜测应该是程序出了问题。


        然后再看

syslog:

2018-03-07 17:23:07,294 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.fs.FileAlreadyExistsException: /user/darren/xxx/part-r-00149.snappy.parquet for client 10.251.26.21 already exists
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2847)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2739)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2624)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:595)
	at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.create(AuthorizationProviderProxyClientProtocol.java:112)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:395)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

        虽然这个问题看似更加明确,但是我知道这个问题的原因,并不能说这是根本原因,请参考: Hadoop多文件(目录)输出 以及MultipleInputs存在的问题 这个问题的原因是因为使用了MultipleOutputs,由于task超时失败,再次尝试的时候,原来写的文件并没有删除,出现了重名文件而报的错,它的原因是因为失败重试导致的,所以失败的根本原因不是这个。

        于是,更加坚定了我之前的猜测,程序除了问题,导致使用context.write没有写,应该是写之前出了问题,并且出现task超时重试,于是我就猜测程序是不是有死循环了,于是我就加log信息到可能会出死循环的地方,终于,发现了问题,确实是死循环。

        此时,案件告破。


        希望对大家有所帮助。

猜你喜欢

转载自blog.csdn.net/zpf336/article/details/79480179
今日推荐