hive表字段里有换行符,导致一行变多行或者字段错乱

1,问题

字段内容有换行符,导致内容多一行: https://issues.apache.org/jira/browse/HIVE-1898
( LazySimpleSerDe.SerDeParameters --> which eliminates the newline chars during deserialization )

#################### hbase数据:字段有换行符号
hbase(main):001:0> get 'test','r2'
COLUMN                                   CELL
 f:age                                   timestamp=1599799051999, value=23
 f:name                                  timestamp=1600840740995, value=test2\x0Atest1\x0Aa\x0Aa\x0Aa\x0Aa
2 row(s) in 0.1930 seconds

hbase(main):003:0> get 'test','r4'
COLUMN                                   CELL
 f:id                                    timestamp=1600842800999, value=2
 f:name                                  timestamp=1600842800999, value=tom1\x0Dtom2\x0Dtom3
2 row(s) in 0.0050 seconds


#################### hbase数据外表:hive查询换行
hive> show create table test_hbase_ext;
CREATE EXTERNAL TABLE `test_hbase_ext`(
  `rowkey` string COMMENT 'from deserializer',
  `name` string COMMENT 'from deserializer',
  `age` string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
  'hbase.columns.mapping'=':key,f:name,f:age',
  'serialization.format'='1')
TBLPROPERTIES (
  'hbase.table.name'='test',
  'transient_lastDdlTime'='1600840936')
Time taken: 0.095 seconds, Fetched: 14 row(s)


hive> select * from test_hbase_ext where rowkey ='r2';
Stage-Stage-1: Map: 1   Cumulative CPU: 5.51 sec   HDFS Read: 11818 HDFS Write: 26 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 510 msec
OK
test_hbase_ext.rowkey   test_hbase_ext.name     test_hbase_ext.age
r2      test2   NULL
test1   NULL    NULL
a       NULL    NULL
a       NULL    NULL
a       NULL    NULL
a       23      NULL
Time taken: 22.267 seconds, Fetched: 6 row(s)

hive> select * from test_hbase_ext where rowkey ='r4';
Stage-Stage-1: Map: 1   Cumulative CPU: 5.37 sec   HDFS Read: 4368 HDFS Write: 21 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 370 msec
OK
test_hbase_ext.rowkey   test_hbase_ext.name     test_hbase_ext.age
r4      tom1    NULL
tom2    NULL    NULL
tom3    NULL    NULL
Time taken: 16.522 seconds, Fetched: 3 row(s)
hive>

2,产生原因

#-- 问题原因: (hbaes:\x0A|D  --> hive: data1换行data2换行data3--> 多一行 )
#-- 考虑: hive读取数据时,把数据里的换行转换为MYEOF这种自定义的换行符号
\x0A: 10  ASCII是\n
\x0D: 13  ASCII是\r

3,解决办法

hive(data1MYEOFdata2MYEOFdata3) --> java中间处理环节:string.replaceAll(MYEOF,"\n") —> 数据显示正常

hive> SELECT rowkey,regexp_replace(name,'\r|\n|\r\n','MYEOF') from test_hbase_ext2;
Stage-Stage-1: Map: 1   Cumulative CPU: 5.46 sec   HDFS Read: 4206 HDFS Write: 139 SUCCESS
rowkey  _c1
r1      test
r2      test2MYEOFtest1MYEOFaMYEOFaMYEOFaMYEOFa
r3      tom1MYEOFtom2MYEOFtom3
r4      tom1MYEOFtom2MYEOFtom3
Time taken: 15.117 seconds, Fetched: 4 row(s)

猜你喜欢

转载自blog.csdn.net/eyeofeagle/article/details/108780507