kafka 总结

kafka-零字节拷贝
1.数据从内核复制到套接字缓冲区
2.从套接字缓冲区复制到NIC(网络适配器)缓冲区--网络传输

传统:
1.数据从磁盘读取到内核空间的pagecache中
2.应用程序从内核空间读取数据到用户空间缓冲区
3.应用程序将数据从内核空间复制到套接字缓冲区
4.从套接字缓冲区复制到NIC(网络适配器)缓冲区

Spark Streaming + Kafka 整合

Receiver-based Approach
1.Kafka 的topic分区和 Spark Streaming 中生成的RDD分区没有关系.
2.KafkaUtils.createStream中增加的分区数量智慧增加单个receiver 的线程数,不会增加spark 的并行度.
3.可以创建多个Kafka的输入DStream,使用不同的group和topic,使用多个receiver 并行接受数据(提高spark并行度)
4.如果启动hdfs等容错性存储系统,并启用写入日志,则接收到的数据已经被复制到日志中.
因此,输入流的存储级别设置StorageLevel.MEMORY_AND_DISK_SER (即使用KafkaUtils.createStream(...,Storage.MEMORY_AND_DISK_SER)的存储级别

Direct Approach (No receivers: 直连方式,无receiver进行接收消息)
简化的并行性:不需要创建多个输入kafka流并将其合并.使用directStream,Spark Streaming 将创建与使用Kafka分区一样多的RDD分区(提供1:1的kakfa,Rdd分区),这些分区将全部从kfaka并行读取数据.所以在kafka和RDD分区之间有一对一的映射关系; RDD分区和 kafka分区一一对应

import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

val kafkaParams = Map[String,Object](
"bootstrap.servers"->"localhost:9092,anotherhost:9092"
"key.deserializer"->classOf[StringDeserializer], //使用 StringDeserializer进行key,value反序列化
"val.deserializer"->classOf[StringDeserializer],
"group.id"->"为每个stream使用一个分割group_id",
"auto.offset.reset"->"latest", //自动重置为最新偏移量
"enable.auto.commit"->(false:java.lang.Boolean)
)

val topics = Array("topicA","topicB")
val stream = KafkaUtils.createDirectStream[String,String](
streamingContext,
PreferConsistent,
Subscribe[String,String](topics,KafkaParams)
)

stream.map(record =>(record.key,record.value))

创建定义offset范围的RDD,用于批处理
val offsetRanges = Array(
//topic, partition,inclusive offset,exclusive ending offset
OffsetRange("test",0,0,100) //从0分区,读取offset 0-99
OffsetRange("test",1,0,100) //从1分区,读取offset 0-99

)

val rdd = KafkaUtils.createRDD[String,String](sparkContext,kafkaParams,offsetRanges,PreferConsistent)

Obtaining Offsets

stream.foreachRDD{ rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition{ iter=>
val o:OffsetRange = offsetRanges(TaskContext.get.partitionId)
println(s"${o.topic},${o.partition},${o.fromOffset}${o.untilOffset}")
}
}

效率: 在第一种方法中实现另数据丢失时,需要将数据存储在预写日志中,这回进一步复制数据.实际是效率地下--数据被复制两次,一次是kafka,另一次写入预写日志(Write Ahead Log)复制.直连方式消除了这个方式,无Receiver,不需要预先写预写日志(WAL),前提是 kafka数据保留时间足够长.

Exactly-once:
1-Receiver: 使用kafka的高级API来在Zookeeper中存储消耗的偏移两,传统上这是kafkfa消费数据方式.可以通过结合WAL来确保零数据丢失(at least once),但是存在失败情况下,消息被重复消费的问题.发生这种情况是因为Spark Streaming 可靠接收的数据与Zookeeper 跟踪的偏移之间的不一致.
2-direct: 不使用zookeeper跟踪消费记录的偏移量,在其检查点内,spark Streaming 跟踪偏移量.这消除了Spark Streaming 和 zookeeper 的读取和跟踪offset不一致.因此Spark Streaming每次记录都会在发生故障的情况下有效收到一次.为了在发生故障下也能保证输出结果的一次语义,讲数据保存到外部数据存储区的输出操作必须是幂等(每次输出都是相同数据)
或者保存结果和偏移量的原子事务.

使用外部存储保存offset
(一) checkpoint
1.启用Spar Streaming 的checkpoint 是存储偏移量最简单的方法(仍然会不一致:记录偏移量后,未成功处理)
缺点:
1.Spark无法跨应用程序进行恢复
2.Spark升级将无法导致恢复
3.在关键生产应用,不建议使用spark 检查管理offset(软件升级,无法跨应用)

2.流式checkpoint专门用户保存应用程序的状态,比如保存HDFS上,在故障时能恢复
与zk,hbase相比,hdfs有更高延迟；如管理不当,在hdfs上写入每个批次的offsetRanges可能会导致小文件问题

(二).Hbase
1.基于Hbase 的通用设计,使用同一张标保存可以跨越多个spark streaming 的程序topic的offset

2.rowkey = topicName + groupid + batchTiemOfstreaming (miliSeconds) :尽管batchtime.miliSeconds不是必须的,但是可以看到历史的批处理任何对offset的管理情况.

3.kafka 的offset保存在如下标,30天自动过期
create 'spark_kafka_offsets',{NAME=>'offsets',TTL=>2592000}

4.offset获取场景
场景1:Streaming作业首次启动,通过zookeeper来查找给定topic中分区数量,然后返回"0"作为所有topic分区的offset

场景2: 长时间运行的Streaming作业已经停止,新的分区被添加到kafka 的topic中,通过zookeeper 来查找给顶topic中分区的数量；对于所有旧的topic分区,将offset 设置为HBase 中的最新偏移量,对于所有新的topic,她将分会"0"作为offset.

场景3:长时间运行的Streaming已经停止,topic没有任何更改.此种情况下,Hbase中发现最新的偏移量作为每个topic分区的offset返回.

hbase (main):009:0 > scan 'spark_kafka_offsets'

stream.foreachRDD{
rdd =>
rdd.asInstanceOf[HasOffsetRanges].offsetRanges

//some tim later,after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(OffsetRanges)

}

(三).Zookeeper

1.路径:
val zkPath = s"${kafkaOffsetRootPath}/${groupName}/${o.topic}/${o.partition}"

2.如果zookeeper中未保存offset,根据kafkaParam 配置使用最新或者最旧的offset

3.如果zookeeper中保存offset 我们会利用这个offset作为kfakaStream的起始位置
zkCli 登陆
ls /kafka0.9/mykafka/consumer/offsets/testp/mytest1

get /kafka0.9/mykafka/consumer/offsets/testp/mytest1/0

缺点:如果是hadoop,hive,spark,hbase都是集群方式部署,依赖zookeeper,就会给本来负载较大的zookeeper带来更大的压力,容易造成zookeeper故障,影响集群正常工作

(四).kafka
kafka自身通过 enable.auto-commit 参数定期体检offset,可以保证offset存储
但是仍有问题:当自定义批处理业务未成功产生spark operation,消息读取发生污染,导致一个未定义语义.所以spark默认该功能禁用(enable.auto-commit=false)
当然可以可以使用commitAsync API,与checkpoint相比的好处是,kafka 是定期,无视应用代码改变(checkpoint敏感代码改变,重新指定offset)保存offset,但是kafka非事务处理,必须仍然像
checkpoint保证幂等性.

stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HashOffsetRanges].offsetRanges

//some time later,after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) //CanCommitOffsets仅在 CreateStream结果上调用返回成功,而非transformation之后
// commitAsync调用是线程安全,但是必须发生在 createStream结果输出之后,如果要有意义语义.

}

(五)自身业务存储:
对于支持事务处理的数据存储,在相同的事务中保存offset,即使在fail 情况下,仍然能够保持两者同步,一致.
如果你不关心探测重复,跳跃的offset 范围,回滚事务阻止重复消息提交,丢失消息影响.此能保证 exactly once semantic (仅且一次语义)
对于聚合操作结果往往很难保证幂等性, 使用此策略为聚合操作结果也是可能的.

//the details depend on your data store ,but the general ideal look this,但这都是理想情况,很少消息数据是需要事务支持的

//begin from the offsets commited to the database

val fromOffsets = selectOffsetsFromYourDatabase.map{ resultSet=> //offset topic+partition+batchTimeOfMiliseconds
new TopicPartition(resultSet.string("topic"),resultSet.int("partition")) -> resultSet.long("offset")
}.toMap

val stream = KafkaUtils.createDirectStream[String,String](
streamingContext,
PreferConsistent,
Assign[String,String](fromOffsets.keys.toList,kafkaParams,fromOffsets)
)

stream.foreachRDD{ rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
val results = yourCalculation(rdd)

//begin transaction

//insert,update results

// update offset where the end of existing offsets matches the begining of this batch of offset

//assert that offsets were updated correctly

//end your transaction

SSL/TLS :spark 与 kafka 之间的communication
val kafkaParams = Map[String,Object](
//the usual params ,make sure to change the port in bootstrap.servers if 9002 is not TLS
"security.protocol" -> "SSL"
"ssl.truststore.location" -> "/some-directory/kafka.client.truststore.jks"
"ssl.truststore.password" -> "test1234"
"ssl.keystore.location" -> "/some-directory/kafka.client.keystore.jks"
"ssl.keystore.password" -> "test1234"
"ssl.key.password" -> "test1234"

)


}

(六) 不保存kafka offset :容忍部分数据丢失

(七)根据业务需求是否管理offset
1. 如实时活动监控,只需当前最新数据,不需要管理offset；此情况下使用之前 Low-level API,将参数 auto.offset.reset 设置为largest或者smallest.新的API:auto.offset.reset="earliest"

一个brokder 对应一个brokerid, server.properties

kafka 安装,一个机器可以对应一个或多个broker,根据需求,生产业务进行制定

1. 下载spark,hdfs相匹配的kafka版本

2.解压

tar xzvf kafka.tar.gz

3.解压安装(简略) ,配置好zookeeper zoo.config

创建zk data,log目录
sudo mkdir /usr/cdh/spark/zkdata/
sudo mkdir /usr/cdh/spark/zkdata/zklogs

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.

#zookeeper logs
dataDir=/usr/cdh/zookeeper/data/

#hadoop zk logs
dataDir=/usr/cdh/hadoop/zkdata
dataLogDir=/usr/cdh/hadoop/zkdata/zklogs

#spark zk logs
dataDir=/usr/cdh/spark/zkdata/
dataLogDir=/usr/cdh/spark/zkdata/zklogs

# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1

#assign servral hostname:port to server.id
# 以下配置在集群(多台机器) 中
#server.0=Master:2888:3888
#server.1=Worker1:2888:3888
#server.2=Worker2:2888:3888

4.对应用户下,在配置文件中配置kafka对应环境变量
vim .profile

export KAFKA_HOME=/usr/cdh/kafka
export PATH=$PATH:$KAFKA_HOME/bin:$SCALA_HOME/bin:$JAVA_HOME/bin

生效配置
. .profile

5.配置kafka broker 配置文件(一个配置文件对应一个broker)

每个 server.properties 的 broker.id,listeners , logdirs 不能相同

server.properties

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

# The id of the broker. This must be set to a unique integer for each broker.
# 每一个broker对应一个broker.id ,kafka集群不可有重复id,否则重复的无法启动
broker.id=0

############################# Socket Server Settings #############################

#broker 如果存在同一台机器broker ,则每个broker监听端口不能相同,否则第二个重复broker.id无法启动
listeners=PLAINTEXT://:9092

# The port the socket server listens on
#port=9092

# Hostname the broker will bind to. If not set, the server will bind to all interfaces
#host.name=localhost

# Hostname the broker will advertise to producers and consumers. If not set, it uses the
# value for "host.name" if configured. Otherwise, it will use the value returned from
# java.net.InetAddress.getCanonicalHostName().
#advertised.host.name=<hostname routable by clients>

# The port to publish to ZooKeeper for clients to use. If this is not set,
# it will publish the same port that the broker binds to.
#advertised.port=<port accessible by clients>

# The number of threads handling network requests
num.network.threads=3

# The number of threads doing disk I/O
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600

############################# Log Basics #############################

# A comma seperated list of directories under which to store log files
# 存储kakfa 消息文件的目录,非常重要；同样如果一台机器有多个broker,该目录不能相同,否则仍是上述问题
log.dirs=/tmp/kafka-logs

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
# 1. Durability: Unflushed data may be lost if you are not using replication.
# 2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
# 3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.

# The minimum age of a log file to be eligible for deletion
log.retention.hours=168

# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
# segments don't drop below log.retention.bytes.
#log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
#存储为kafka连接zookeeperurl,推荐应该在url后面多加个目录,这样可以将kafka各组件目录放在一起
zookeeper.connect=hadoop:2181/kafka0.9

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000

同一台机器第二个broker server1.properties