Flume stores the data received on port 8888 into hive

一、a1.sinks.s1.type = hive

(1) When using hive as a flume sink, the requirements for the hive table are:

  • The table must be a transaction table
  • The table must be a partitioned table
  • The table must be a bucketed table
  • Table stored as orc

That is, clustered bucketing, transactional transactions, orc storage format.

(2) Copy hive's jar dependency package to the lib path of the flume installation directory to avoid failure when starting hive sink.

cp /usr/local/apache-hive-3.1.2-bin/hcatalog/share/hcatalog/*.jar /usr/local/apache-flume-1.9.0-bin/lib/

(3) Modify the hive configuration file to support transaction processing.

vi /usr/local/apache-hive-3.1.2-bin/conf/hive-site.xml

Add the following content.

<property>

    <name>hive.support.concurrency</name>

    <value>true</value>

  </property>

  <property>

    <name>hive.exec.dynamic.partition.mode</name>

    <value>nonstrict</value>

  </property>

  <property>

    <name>hive.txn.manager</name>

    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>

  </property>

(4) Start Hadoop, MySQL, HiveMetaStore

Start the hadoop cluster.

cd /usr/local/hadoop-3.1.4/sbin/

./start-all.sh

Start the mysql service.

service mysqld start

Start hive metadata service.

hive --service metastore &

Create table:

create database flume;
use flume;

create table people(
id int,
name string,
age int)
clustered by (id) into 2 buckets
row format delimited lines terminated by '\n'
stored as orc
tblproperties("transactional"='true');

Collection plan:

# Define the names of the three major components Source, Channel, and Sink in this agent
a1.sources = r1
a1.channels = c1
a1.sinks = s1

# Configure the Source component: collect data from port 44444
a1.sources.r1.type = netcat
a1.sources.r1.bind = master
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = ic
a1.sources. r1.interceptors.ic.type = timestamp
a1.sources.r1.interceptors.ic.headerName = time
a1.sources.r1.interceptors.ic.preserveExisting = false

# Configure the Channel component: the intermediate cache uses memory cache
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
a1.channels.c1.keep-alive = 5

# Configure Sink component: data storage is Hive
a1.sinks.s1.type = hive
# URL of hive metastore
a1.sinks.s1.hive.metastore = thrift://master:9083
# hive database name
a1.sinks.s1 .hive.database = flume
# hive table name
a1.sinks.s1.hive.table = people
a1.sinks.s1.batchSize = 150
a1.sinks.s1.serializer = DELIMITED
a1.sinks.s1.serializer.delimiter = " \t"
a1.sinks.s1.serializer.serdeSeparator = '\t'
a1.sinks.s1.serializer.fieldnames = id,name,age

# Describe and configure the connection relationship between source channel sink
a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

二、a1.sinks.sk1.type=hdfs

创建表:
create database flume;
use flume;
create table student(
source string,
name string,
grade string
)
row format delimited fields terminated by '\t';

Collection plan:
a1.sources=s1
a1.channels=c1
a1.sinks=sk1

a1.sources.s1.type = netcat
a1.sources.s1.bind = master
a1.sources.s1.port = 8888

a1.channels.c1.type=memory

a1.sinks.sk1.type=hdfs
a1.sinks.sk1.hdfs.path=/user/hive/warehouse/flume.db/student
a1.sinks.sk1.hdfs.fileType=DataStream
a1.sinks.sk1.channel=c1
a1.sources.s1.channels = c1

Guess you like

Origin blog.csdn.net/GX_0824/article/details/126964559