1,问题
字段内容有换行符,导致内容多一行: https://issues.apache.org/jira/browse/HIVE-1898
( LazySimpleSerDe.SerDeParameters --> which eliminates the newline chars during deserialization )
#################### hbase数据:字段有换行符号
hbase(main):001:0> get 'test','r2'
COLUMN CELL
f:age timestamp=1599799051999, value=23
f:name timestamp=1600840740995, value=test2\x0Atest1\x0Aa\x0Aa\x0Aa\x0Aa
2 row(s) in 0.1930 seconds
hbase(main):003:0> get 'test','r4'
COLUMN CELL
f:id timestamp=1600842800999, value=2
f:name timestamp=1600842800999, value=tom1\x0Dtom2\x0Dtom3
2 row(s) in 0.0050 seconds
#################### hbase数据外表:hive查询换行
hive> show create table test_hbase_ext;
CREATE EXTERNAL TABLE `test_hbase_ext`(
`rowkey` string COMMENT 'from deserializer',
`name` string COMMENT 'from deserializer',
`age` string COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping'=':key,f:name,f:age',
'serialization.format'='1')
TBLPROPERTIES (
'hbase.table.name'='test',
'transient_lastDdlTime'='1600840936')
Time taken: 0.095 seconds, Fetched: 14 row(s)
hive> select * from test_hbase_ext where rowkey ='r2';
Stage-Stage-1: Map: 1 Cumulative CPU: 5.51 sec HDFS Read: 11818 HDFS Write: 26 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 510 msec
OK
test_hbase_ext.rowkey test_hbase_ext.name test_hbase_ext.age
r2 test2 NULL
test1 NULL NULL
a NULL NULL
a NULL NULL
a NULL NULL
a 23 NULL
Time taken: 22.267 seconds, Fetched: 6 row(s)
hive> select * from test_hbase_ext where rowkey ='r4';
Stage-Stage-1: Map: 1 Cumulative CPU: 5.37 sec HDFS Read: 4368 HDFS Write: 21 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 370 msec
OK
test_hbase_ext.rowkey test_hbase_ext.name test_hbase_ext.age
r4 tom1 NULL
tom2 NULL NULL
tom3 NULL NULL
Time taken: 16.522 seconds, Fetched: 3 row(s)
hive>
2,产生原因
#-- 问题原因: (hbaes:\x0A|D --> hive: data1换行data2换行data3--> 多一行 )
#-- 考虑: hive读取数据时,把数据里的换行转换为MYEOF这种自定义的换行符号
\x0A: 10 ASCII是\n
\x0D: 13 ASCII是\r
3,解决办法
hive(data1MYEOFdata2MYEOFdata3) --> java中间处理环节:string.replaceAll(MYEOF,"\n") —> 数据显示正常
hive> SELECT rowkey,regexp_replace(name,'\r|\n|\r\n','MYEOF') from test_hbase_ext2;
Stage-Stage-1: Map: 1 Cumulative CPU: 5.46 sec HDFS Read: 4206 HDFS Write: 139 SUCCESS
rowkey _c1
r1 test
r2 test2MYEOFtest1MYEOFaMYEOFaMYEOFaMYEOFa
r3 tom1MYEOFtom2MYEOFtom3
r4 tom1MYEOFtom2MYEOFtom3
Time taken: 15.117 seconds, Fetched: 4 row(s)