一、a1.sinks.s1.type = hive
(1) When using hive as a flume sink, the requirements for the hive table are:
- The table must be a transaction table
- The table must be a partitioned table
- The table must be a bucketed table
- Table stored as orc
That is, clustered bucketing, transactional transactions, orc storage format.
(2) Copy hive's jar dependency package to the lib path of the flume installation directory to avoid failure when starting hive sink.
cp /usr/local/apache-hive-3.1.2-bin/hcatalog/share/hcatalog/*.jar /usr/local/apache-flume-1.9.0-bin/lib/ |
(3) Modify the hive configuration file to support transaction processing.
vi /usr/local/apache-hive-3.1.2-bin/conf/hive-site.xml |
Add the following content.
<property> <name>hive.support.concurrency</name> <value>true</value> </property> <property> <name>hive.exec.dynamic.partition.mode</name> <value>nonstrict</value> </property> <property> <name>hive.txn.manager</name> <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value> </property> |
(4) Start Hadoop, MySQL, HiveMetaStore
Start the hadoop cluster.
cd /usr/local/hadoop-3.1.4/sbin/ ./start-all.sh |
Start the mysql service.
service mysqld start |
Start hive metadata service.
hive --service metastore & |
Create table:
create database flume;
use flume;
create table people(
id int,
name string,
age int)
clustered by (id) into 2 buckets
row format delimited lines terminated by '\n'
stored as orc
tblproperties("transactional"='true');
Collection plan:
# Define the names of the three major components Source, Channel, and Sink in this agent
a1.sources = r1
a1.channels = c1
a1.sinks = s1
# Configure the Source component: collect data from port 44444
a1.sources.r1.type = netcat
a1.sources.r1.bind = master
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = ic
a1.sources. r1.interceptors.ic.type = timestamp
a1.sources.r1.interceptors.ic.headerName = time
a1.sources.r1.interceptors.ic.preserveExisting = false
# Configure the Channel component: the intermediate cache uses memory cache
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
a1.channels.c1.keep-alive = 5
# Configure Sink component: data storage is Hive
a1.sinks.s1.type = hive
# URL of hive metastore
a1.sinks.s1.hive.metastore = thrift://master:9083
# hive database name
a1.sinks.s1 .hive.database = flume
# hive table name
a1.sinks.s1.hive.table = people
a1.sinks.s1.batchSize = 150
a1.sinks.s1.serializer = DELIMITED
a1.sinks.s1.serializer.delimiter = " \t"
a1.sinks.s1.serializer.serdeSeparator = '\t'
a1.sinks.s1.serializer.fieldnames = id,name,age
# Describe and configure the connection relationship between source channel sink
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1
二、a1.sinks.sk1.type=hdfs
创建表:
create database flume;
use flume;
create table student(
source string,
name string,
grade string
)
row format delimited fields terminated by '\t';
Collection plan:
a1.sources=s1
a1.channels=c1
a1.sinks=sk1
a1.sources.s1.type = netcat
a1.sources.s1.bind = master
a1.sources.s1.port = 8888
a1.channels.c1.type=memory
a1.sinks.sk1.type=hdfs
a1.sinks.sk1.hdfs.path=/user/hive/warehouse/flume.db/student
a1.sinks.sk1.hdfs.fileType=DataStream
a1.sinks.sk1.channel=c1
a1.sources.s1.channels = c1