发布于 2017/07/25 16:25
字数 3242
阅读 7
收藏 1
点赞 0
评论 0

1.1 Introduction
Kafka is a distributed streaming platform. What exactly does that mean?

We think of a streaming platform as having three key capabilities:
  It let's you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system.
  It let's you store streams of records in a fault-tolerant way.
  It let's you process streams of records as they occur. 
What is Kafka good for?
  It gets used for two broad classes of application:
  Building real-time streaming data pipelines that reliably get data between systems or applications
  Building real-time streaming applications that transform or react to the streams of data 

  To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.
First a few concepts:
  Kafka is run as a cluster on one or more servers.
  The Kafka cluster stores streams of records in categories called topics.
  Each record consists of a key, a value, and a timestamp. 
  Kafka has four core APIs:
  The Producer API allows an application to publish a stream records to one or more Kafka topics.
  The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
  The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
  这个Streams API允许应用去作为一个流处理器,消费一个来至于一个或多个主题的输入流,生产一个输出流到一个或多个输出流主题,有效地将输入流转换为输出流。
  The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to 
  Connector API允许建立和允许可重用的生产者或消费者去连接kafka主题到存在的应用或数据系统。例如,关系数据库的连接器可能捕获每一个变化。

  In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages. 

Topics and Logs 主题和日志
  Let's first dive into the core abstraction Kafka provides for a stream of records—the topic.

  A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

  For each topic, the Kafka cluster maintains a partitioned log that looks like this:

  Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

  The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.


  In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".
  事实上,唯一的元数据保留在每个消费者的基础上 偏移量是通过消费者进行控制:通常当消费者读取一个记录后会线性的增加他的偏移量。但是,事实上,自从记录的位移由消费者控制后,消费者可以在任何顺序消费记录。例如,一个消费者可以重新设置偏移量为之前使用的偏移量来重新处理数据或者跳到最近的记录开始消费。
  This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.
  The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.

  The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.

  Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.
  每个分区都有一个服务器充当“领导者”和零个或多个服务器充当“追随者”。leader处理所有对分区读写请求时followers就会被动复制这个leader的分区。如果这个leader发送故障,这些followers中的一个将自动的成为一个新的leader。Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

  Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!


  Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

  If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
  If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes. 

  A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.
  More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.

  更常见的,我们发现主题有一个小数量的消费群体one for each "logical subscriber"。

  The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

  Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

  At a high-level Kafka gives the following guarantees:
  Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
  A consumer instance sees records in the order they are stored in the log.
  For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log. 

More details on these guarantees are given in the design section of the documentation. 


老铁,你的--->推荐,--->关注,--->评论--->是我继续写作的动力。 作者:刘洋 合作微信号:intsmaze 本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。


共有 人打赏支持
粉丝 15
博文 23
码字总数 11609
作品 0
源码圈 365 胖友的书单整理

🙂🙂🙂关注微信公众号:【芋道源码】有福利: RocketMQ / MyCAT / Sharding-JDBC 所有源码分析文章列表 RocketMQ / MyCAT / Sharding-JDBC 中文注释源码 GitHub 地址 您对于源码的疑问...

芋道源码掘金Java群217878901 ⋅ 2017/09/21 ⋅ 0


书籍列表 《Effective Java 中文版》 豆瓣评分:9.1【1235 人评价】 推荐理由:本书介绍了在Java编程中78条极具实用价值的经验规则,这些经验规则涵盖了大多数开发人员每天所面临的问题的解决...

yunlielai ⋅ 01/09 ⋅ 0

一起学Google Daydream VR开发,快速入门开发基础教程一:Android端开发环境配置一

原文因涉及翻墙信息,被强制删除,此文为补发! 准备工作 进入Google Daydream开发者官网,开启准备工作,官网地址:https://vr.google.com/daydream/developers/ ------------------------...

jaikydota163 ⋅ 2017/01/26 ⋅ 0


目录 语言无关类 操作系统 智能系统 分布式系统 编译原理 函数式概念 计算机图形学 WEB服务器 版本控制 编辑器 NoSQL PostgreSQL MySQL 管理和监控 项目相关 设计模式 Web 大数据 编程艺术 ...

zting科技 ⋅ 2017/12/11 ⋅ 0

总有你要的编程书单(GitHub )

目录 IDE IntelliJ IDEA 简体中文专题教程 MySQL 21分钟MySQL入门教程 MySQL索引背后的数据结构及算法原理 NoSQL Disque 使用教程 Neo4j .rb 中文資源 Redis 命令参考 Redis 设计与实现 The ...

汇智网 ⋅ 2017/11/22 ⋅ 0


apache kafka在数据处理中特别是日志和消息的处理上会有很多出色的表现,这里写个索引,关于kafka的文章暂时就更新到这里,最近利用空闲时间在对kafka做一些功能性增强,并java化,虽然现在已...

Gaischen ⋅ 2013/03/25 ⋅ 7


语言无关类 优质博客 PyTab在线手册中心 ImportNew 廖雪峰的官方网站 程序员博客墙 操作系统 开源世界旅行手册 鸟哥的Linux私房菜 Linux 系统高级编程 The Linux Command Line (中英文版) L...

yonghu86 ⋅ 2015/04/15 ⋅ 0


本文之前发布过,近半年后,本文作者(迷渡,JustJavaC)大幅度进行了更新,因此再次分享给大家。感谢 @justjavac 和众多的分享者!感谢写作这些文档、手册的人们! 语言无关类 操作系统 开源...

山哥 ⋅ 2015/01/22 ⋅ 4


免费的编程中文书籍索引 免费的编程中文书籍索引,欢迎投稿。 国外程序员在 stackoverflow 推荐的程序员必读书籍,中文版。 stackoverflow 上的程序员应该阅读的非编程类书籍有哪些? 中文版...

Fanta ⋅ 2016/11/14 ⋅ 0


apache kafka在数据处理中特别是日志和消息的处理上会有很多出色的表现,这里写个索引,关于kafka的文章暂时就更新到这里,最近利用空闲时间在对 kafka做一些功能性增强,并java化,虽然现在...

老先生二号 ⋅ 2017/05/28 ⋅ 0





【elasticsearch】 随笔 Date datatype

一。时间类型的本质 首先json是没有时间类型的,对于es来说,时间类型的标示可以是下面三种情况 1.一个时间格式的字符串,如:"2014-11-27T08:05:32Z","2015-01-01" or "2015/01/01 12:10:3...

xiaomin0322 ⋅ 10分钟前 ⋅ 0


阿里云资源编排ROS详细内容: 阿里云资源编排ROS使用教程 资源编排(Resource Orchestration)是一种简单易用的云计算资源管理和自动化运维服务。用户通过模板描述多个云计算资源的依赖关系、...

mcy0425 ⋅ 12分钟前 ⋅ 0


1、适配器模式 把一个类的接口变换成客户端所期待的另一种接口 使原本因接口不匹配而无法在一起工作的两个类能够在一起工作 分为类的适配器模式和对象的适配器模式 2、类适配器模式 类的适配...

职业搬砖20年 ⋅ 17分钟前 ⋅ 0

npm操作报错 _stream_writable.js:61

有一天 不知道什么原因(估计和node的版本有关),无论你做什么npm的操作 都会报错/usr/local/lib/node_modules/npm/node_modules/readable-stream/lib/_stream_writable.js:61 这时候只要执...

lilugirl ⋅ 20分钟前 ⋅ 0


Eclipse魅力之一就是支持可扩展的插件,来丰富自身的功能,这种方式也是建立在开源思想之上的。具体使用什么方式去安装插件,要看我们拿到的是什么。 1. 拿到的是一串URL,如http://subclips...

GordonNemo ⋅ 23分钟前 ⋅ 0


css实现代码如下: <div style="position: relative;"><!--这个层为外面的父层,需设置相对位置样式--> <div style="position: absolute;"><!--子层,需设置绝对位置样式--> <i......

niithub ⋅ 24分钟前 ⋅ 0


如果父组件需要使用子组件中的内容怎么办,比如父组件需要控制子组件的显示 <div id="root"><child><template slot-scope="props"><h1>{{props.item}} <div>编辑</div></h1><......

金于虎 ⋅ 27分钟前 ⋅ 1

HongHu commonservice-eureka 项目构建过程

上一篇我们回顾了关于 spring cloud eureka的相关基础知识,现在我们针对于HongHu cloud的eureka项目做以下构建,整个构建的过程很简单,我会将每一步都构建过程记录下来,希望可以帮助到大家...

明理萝 ⋅ 30分钟前 ⋅ 1


@Data//setter和getter方法,toString和equals,hashcode方法@EqualsAndHashCode//代表重写equals和hashcode方法@XmlAccessorType(XmlAccessType.FIELD)public class Classroom {@X......

拐美人 ⋅ 30分钟前 ⋅ 0

tableView cell的高度 分组头部尾部的高度 自适应

@property (nonatomic) CGFloat rowHeight; // default is UITableViewAutomaticDimension@property (nonatomic) CGFloat sectionHeaderHeight; // default is UITableViewA......

娜一片蓝色星海 ⋅ 31分钟前 ⋅ 0