Before we delve in the topic of choice lets cover the basics of Kafka, its terminology and architecture
Kafka Architecture
Overall Recommended Layout
Terminology
Zookeeper
ZooKeeper is a centralized service for managing distributed processes. It is a mandatory component in every Apache Kafka cluster. To provide high availability, you will need at least three ZooKeeper nodes (allowing for one-node failure) or five nodes (allowing for two-node failures). All ZooKeeper nodes are equivalent, so they usually will run on identical nodes. The number of ZooKeeper nodes MUST be odd
Broker / Confluent Server
Apache Kafka uses messaging semantics, and Kafka brokers are Kafka’s main storage and messaging components. Kafka Brokers are referred to as Confluent servers in Confluent setup. The Kafka cluster maintains streams of messages, called topics; the topics are sharded into partitions (ordered, immutable logs of messages), and the partitions are replicated and distributed for high availability. The servers that run the Kafka cluster are called brokers. You usually will want at least three Kafka brokers in a cluster, each running on a separate server. This enables you to replicate each Kafka partition at least three times and have a cluster that will survive a failure of two nodes without data loss. Confluent Platform, a Kafka broker is referred to as Confluent Server)
Confluent REST Proxy
The Confluent REST Proxy is a HTTP server that provides a RESTful interface to a Kafka cluster. You can use it to produce and consume messages, view the state of the cluster, and perform administrative actions without using the native Kafka protocol or clients. The REST Proxy is not a mandatory component of the platform. Deploy the REST Proxy if you wish to produce and consume messages to or from Kafka using a RESTful HTTP protocol. If your applications only use the native clients mentioned above, you can choose to not deploy the REST Proxy.
For additional throughput and high availability, it is recommended that you deploy multiple REST Proxy servers behind a sticky load balancer. Confluent Server contains its own embedded REST Proxy, which can be used for administrative operations but not for producing or consuming messages.
Confluent Replicator
Confluent Replicator is a component added to Confluent Platform to help manage multi-cluster deployments of Confluent Platform and Apache Kafka. Replicator provides centralized configuration of cross-cluster replication. Unlike Apache Kafka’s MirrorMaker, it replicates topic configuration in addition to topic messages.
Confluent Replicator is integrated with the Kafka Connect framework and should be installed on the Connect nodes in the destination cluster. If there are multiple Connect worker nodes, install Replicator on all of them. When installed on a larger number of nodes, Replicator can scale to replicate at higher throughput and will be highly available through its built-in failover mechanism.
Confluent Self-Balancing Clusters
Confluent Self-Balancing Clusters is a component of Confluent Server that optimizes resource utilization and helps to scale Kafka clusters. Self-Balancing Clusters evaluates information about the number of brokers, partitions, leaders, and partition sizes, decides on a balanced placement of partitions on brokers, and modifies the replicas assigned to each broker to achieve a balanced placement. For example, when a new broker is added to the cluster, Self-Balancing Clusters will move partitions to the new broker to balance the load between all brokers available in the cluster. To avoid impact on production workloads, Self-Balancing Clusters throttles the re-balancing traffic to a fraction of the available network capacity.
You can enable Self-Balancing Clusters via a configuration option in Confluent Server. When enabled, it runs continuously in the background.
Confluent Control Center
Confluent Control Center is Confluent’s web-based tool for managing and monitoring Apache Kafka. It is part of Confluent Platform and provides three key types of functionality for building and monitoring production data pipelines and streaming applications:
Data stream monitoring and alerting
You can use the Control Center to monitor your data streams end to end, from producer to consumer. Use Control Center to verify that every message sent is received (and received only once), and to measure end-to-end system performance. Drill down to better understand cluster usage and identify any problems. Configure alerts to notify you when end-to-end performance does not match SLAs, and measure whether messages sent were received.
Multi-cluster monitoring and management
A single Control Center node can monitor data flows in multiple clusters and can manage data replication between the clusters.
Kafka Connect configuration
You can also use Control Center to manage and monitor Kafka Connect, the open source toolkit for connecting external systems to Kafka. You can easily add new sources to load data from external data systems and add new sinks to write data into external data systems. Additionally, you can use Confluent Control Center to manage, monitor, and configure connectors.
Kafka Latency View
Kafka Implementation Options
Summary of Comparison
Features | Kafka Plain Vanilla | Confluent Kafka |
Apache Kafka | 2.7 | 2.5 |
Client | Java | C,C++,C#,Java |
File Connector | Available | Available |
Other Connectors | Not available * | Available |
REST endpoint | Not available * | Available |
KSQL | Not available * | Available |
Schema Registry | Not available * | Available |
Replication | Not available | Available |
Auto Load Balance | Not available | Available |
Dashboard | Not available | Available |
* => Manual provisioning can be done using community license
For most small and medium business Confluent Kafka seems to be a better option. For large enterprises with a dedicated devops team and data clusters, apache Kafka seems to be better choice.
Why is Confluent a better option for Small & Medium Businesses
REST endpoint
Confluent environment has built in support for a rest endpoint which is critical especially to write and read from Kafka on an Android or iphone tablet. This is also helpful on mainframe and other platforms
Replication
For any production installation you need to be DR-1 compliant and to that you would need more than 1 cluster. Once you have that you need to be able to replicate all data that is written to all clusters. Confluent takes care of this need automatically
Control Center
This is another crucial piece to monitor the health of your kafka installation and recognize trends and patterns and be proactive about future needs and also monitor current problems.
Self balancer
This is critical as more applications are on boarded onto kafka with more topics and more traffic and varied data packet sizes
Role-Based Access Control (RBAC)
Centrally manages access to critical resources across the entire platform, with fine-tuned authorization granularity
Structured Audit Logs
Trace user actions in a set of dedicated Kafka topics, to help operators detect abnormal behavior and security threats
Secret Protection
Avoids exposure of secrets by encrypting sensitive information (e.g. passwords) that is stored in configuration files
Encryption of data
Data at rest can be encrypted on both solutions using different third party encryption providers or home grown encryption packages. If you have your own key management service , these encryption options can integrate with it. If cloud is opted for, data can be sent over a secure connection using a REST endpoint with https
Data Privacy
Data privacy can be achieved on both solutions by third party products such as Privitar which de-identifies datasets to avoid PII. If not privitar then one can also integrate an internally developed de-identifying API with kafka as a connector.
Network Latency & Throughput
Reading or writing data to a local onprem kafka from an internal server will have lower latency than writing to a kafka cloud endpoint. Since the use case is for writing / reading to / from kafka from internal servers as devices on the go connect to the internal network, onprem latency should not be a bottleneck. Overall throughput will depend on many different factors such as latency, packet size, encryption, data schema, data payload format, de / serialization and others. However if there are issues with kafka such as lag, uneven distribution or similar, having a Confluent plan will help to get the support and resolve the issue
Project completion, rollout and enablement
The completion date and hence the cost of implementation will depend on options chosen. Based on the reasons given above, Confluent Kafka onprem implementation might take the shortest time. Overall implementation cost might also be cheaper for confluent kafka given the ease of rollout, maintenance, upgrades, notifications, load balancing and many more as given above.
Kafka Implementation Calculation
There are some open questions that need to be addressed before we move forward with installation, deployment and rollout
Number of hosts that will be used for rollout
Number of cores per host
Network Card Type
Storage Volumes Per Host
Storage Capacity Per Volume
Storage Volume Configuration
Volume Read / Write Throughput (Per Volume)
Memory
Linux Flavor & Version
Kafka OnPrem Hardware Recommendation
Ideal Configuration
Component | VM | Storage | Min Memory | Min Cores |
Zookeeper | 5 | Transaction log: 512 GB SSD Storage: 2 x 1TB SATARAID 10 | 32 GB | 2-4 |
Broker | 5 | 12 x 1 TB disk. | 64 GB | Dual 12 core |
Connect | 2 | Any | Any | 2 |
REST Proxy | 2 | Any | 64 MB * Num Producer + 16 MB * Num Consumers + 2 | 16 |
Control Center | 1 | 300GB Min | 64 GB | 8 |
Basic Configuration
Component | VM | Storage | Min Memory | Min Cores |
Zookeeper | 3 | 1 SSD 64GB | 4GB | 1 |
Broker | 3 | RAID10 | 64 GB | 12 core |
Connect | 2 | Any | Any | 2 |
Control Center | 1 | 300GB Min | 4 GB | 2 |
Zookeeper
Zookeeper uses Paxos algorithm to arrive at a consensus by majority of nodes. Although kafka is a very stable system there are always possibilities of hardware failures or other outages. Thus it makes sense to accommodate for failures where some VMs will not be available. To have a fault tolerant system, an odd number of nodes is always better for consensus since an odd and the next even number of nodes will allow the same number of failures. So it’s more cost effective to have an odd number of nodes for Zookeeper as well as the broker.
For e.g if there are 3 nodes then the majority is 2 and even if there are 4 nodes the majority is still 2. So having 3 or 4 nodes give us a fault tolerance of 2
Broker
Kafka relies heavily on the filesystem for storing and caching messages. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. Thus multiple dedicated drives are needed to maximize throughput. Drives should not be shared with application logs or other OS filesystem activity to ensure good latency. You can either combine these drives together into a single volume (RAID) or format and mount each drive as its own directory. RAID 10 should be used if the additional cost is acceptable. Otherwise, Kafka server can be configured with multiple log directories, each directory mounted on a separate drive.
A fast and reliable network is an essential performance component in a distributed system. Low latency ensures that nodes can communicate easily, while high bandwidth helps shard movement and recovery.
Control Center
Confluent Control Center is good for out-of-the-box Kafka cluster monitoring, making it easy to manage the entire Confluent Platform. Control Center is a web-based application that allows you to manage your cluster, to monitor Kafka system health in predefined dashboards and to alert on triggers. Additionally, Control Center reports end-to-end stream monitoring to assure that every message is delivered from producer to consumer, measures how long messages take to be delivered, and determines the source of any issues in your cluster.
3 steps to a healthy deployment of Kafka
Pick an optimization goal
Throughput
To optimize for throughput, the producers, brokers, and consumers need to move as much data as they can within a given amount of time. For high throughput, try to maximize the rate at which this data moves. This data rate should be as fast as possible.
Latency
Keeping the latency to a minimum does not necessarily mean higher throughput
Durability
Durability is all about reducing the chance for a message to get lost. The most important feature that enables durability is replication, which ensures that messages are copied to multiple brokers. If a broker has a failure, the data is available from at least one other broker.
Availability
To optimize for high availability, kafka should be tuned to recover as quickly as possible from failure scenarios. Higher partition counts may increase parallelism, but having more partitions can also increase recovery time in the event of a broker failure
Configure cluster to achieve the goal
Throughput Optimization
Producer:
Batch.size | 100000 – 200000 (default 16384) |
Linger.ms | 10 – 100 |
Compression.type | lz4 |
Acks | 1 |
Buffer.memory | default 33554432 |
Consumer
Fetch.min.bytes | ~100000 |
Latency Optimization
Producer
Linger.ms | 0 |
Durability Optimization
Producer
Replication.factor | 3 |
Acks | All |
Enable.idempotence | True |
Max.in.flight.requests.per.connection | 1 |
Consumer
Enable.auto.commit | False |
Isolation.level | read_committed (when using EOS transactions) |
Broker
Default.replication.factor | 3 |
Auto.create.topics.enable | False |
Min.insync.replicas | 2 |
Unclean.leader.election.enable | False |
Broker.rack | rack of the broker |
Log.flush.interval.messages and Log.flush.interval.ms | Low value |
Availability Optimization
Consumer
session.timeout.ms | As low as possible (Default 10000) |
Broker
unclean.leader.election.enable | True |
num.recovery.threads.per.data.dir | Same as log.dirs |
Benchmark Monitor and Tune
Use Confluent Control Center on Onprem or the perf dashboard on the cloud. A few important metrics to measure
Kafka Options and Pricing
Based on the comparison given below Confluent On the cloud seems to be the best option. It’s almost free and comes with a handsome uptime guarantee and 24×7 Confluent support. So cheap and a great throughput and performance. On Prem plain vanilla kafka will turn out to be magnitudes higher in initial cost as well as maintenance. Confluent Kafka on Prem is 70K more than On Prem plain vanilla kafka.
OnPrem Confluent
- Maintenance cost of hardware
- Additionally 70K / year with 2 years commitment for Starter Essentials Package which entails
● 9 prod nodes
● 9 pre-prod or DR nodes
● Unlimited Dev nodes
● 1 CP Connector Pack (for both Prod & Pre-Prod)
● Support
● 24×7 (P1: 60 Min, P2: 4 Hrs)
● 3 cases / month limit
● Kafka streams support
Plain Vanilla Kafka
Local implementation and maintenance cost of hardware
Confluent On AWS Cloud
200$ / month based on 200KB writes and reads per second and a kafka cluster with 30 partitions for a basic cluster which is much more than needed for low to medium value usage case. There is no other hardware or any system maintenance cost. This comes with 24×7 support from Confluent and a guarantee of 99.5% uptime. This all appears to be more than what is needed to start with. If there are more partitions needed than the 30 specified above then the cost is $ 0.004 / partition
Another option is to use a standard confluent cloud cluster. This will have a cost of 1000 $ / month but that will allow 500 free partitions and a better uptime of 99.95%
Amazing
[…] Kafka is an event channel. Event channels store the message for a certain period of time and make it available for any consumer to consume and process the data. Kafka can be configured to store messages for 24 hours or for a long period of time. A message that has already been processed is occupying wasteful space and hence it’s better to have short retention for the messages. For details on kafka see the other blog https://aqibtech.net/apache-kafka-vs-confluent-kafka/ […]