synchronous

How to choose between Apache & Confluent Kafka

Before we delve in the topic of choice lets cover the basics of Kafka, its terminology and architecture

Kafka Architecture

Overall Recommended Layout

Terminology

Zookeeper

ZooKeeper is a centralized service for managing distributed processes. It is a mandatory component in every Apache Kafka cluster. To provide high availability, you will need at least three ZooKeeper nodes (allowing for one-node failure) or five nodes (allowing for two-node failures). All ZooKeeper nodes are equivalent, so they usually will run on identical nodes. The number of ZooKeeper nodes MUST be odd

Broker / Confluent Server

Apache Kafka uses messaging semantics, and Kafka brokers are Kafka’s main storage and messaging components. Kafka Brokers are referred to as Confluent servers in Confluent setup. The Kafka cluster maintains streams of messages, called topics; the topics are sharded into partitions (ordered, immutable logs of messages), and the partitions are replicated and distributed for high availability. The servers that run the Kafka cluster are called brokers. You usually will want at least three Kafka brokers in a cluster, each running on a separate server. This enables you to replicate each Kafka partition at least three times and have a cluster that will survive a failure of two nodes without data loss. Confluent Platform, a Kafka broker is referred to as Confluent Server)

Confluent REST Proxy

The Confluent REST Proxy is a HTTP server that provides a RESTful interface to a Kafka cluster. You can use it to produce and consume messages, view the state of the cluster, and perform administrative actions without using the native Kafka protocol or clients. The REST Proxy is not a mandatory component of the platform. Deploy the REST Proxy if you wish to produce and consume messages to or from Kafka using a RESTful HTTP protocol. If your applications only use the native clients mentioned above, you can choose to not deploy the REST Proxy.

For additional throughput and high availability, it is recommended that you deploy multiple REST Proxy servers behind a sticky load balancer. Confluent Server contains its own embedded REST Proxy, which can be used for administrative operations but not for producing or consuming messages.

Confluent Replicator

Confluent Replicator is a component added to Confluent Platform to help manage multi-cluster deployments of Confluent Platform and Apache Kafka. Replicator provides centralized configuration of cross-cluster replication. Unlike Apache Kafka’s MirrorMaker, it replicates topic configuration in addition to topic messages.

Confluent Replicator is integrated with the Kafka Connect framework and should be installed on the Connect nodes in the destination cluster. If there are multiple Connect worker nodes, install Replicator on all of them. When installed on a larger number of nodes, Replicator can scale to replicate at higher throughput and will be highly available through its built-in failover mechanism.

Confluent Self-Balancing Clusters

Confluent Self-Balancing Clusters is a component of Confluent Server that optimizes resource utilization and helps to scale Kafka clusters. Self-Balancing Clusters evaluates information about the number of brokers, partitions, leaders, and partition sizes, decides on a balanced placement of partitions on brokers, and modifies the replicas assigned to each broker to achieve a balanced placement. For example, when a new broker is added to the cluster, Self-Balancing Clusters will move partitions to the new broker to balance the load between all brokers available in the cluster. To avoid impact on production workloads, Self-Balancing Clusters throttles the re-balancing traffic to a fraction of the available network capacity.

You can enable Self-Balancing Clusters via a configuration option in Confluent Server. When enabled, it runs continuously in the background.

Confluent Control Center

Confluent Control Center is Confluent’s web-based tool for managing and monitoring Apache Kafka. It is part of Confluent Platform and provides three key types of functionality for building and monitoring production data pipelines and streaming applications:

Data stream monitoring and alerting

You can use the Control Center to monitor your data streams end to end, from producer to consumer. Use Control Center to verify that every message sent is received (and received only once), and to measure end-to-end system performance. Drill down to better understand cluster usage and identify any problems. Configure alerts to notify you when end-to-end performance does not match SLAs, and measure whether messages sent were received.

Multi-cluster monitoring and management

A single Control Center node can monitor data flows in multiple clusters and can manage data replication between the clusters.

Kafka Connect configuration

You can also use Control Center to manage and monitor Kafka Connect, the open source toolkit for connecting external systems to Kafka. You can easily add new sources to load data from external data systems and add new sinks to write data into external data systems. Additionally, you can use Confluent Control Center to manage, monitor, and configure connectors.

Kafka Latency View

Kafka Implementation Options

Summary of Comparison

FeaturesKafka Plain VanillaConfluent Kafka
Apache Kafka2.72.5
ClientJavaC,C++,C#,Java
File ConnectorAvailableAvailable
Other ConnectorsNot available *Available
REST endpointNot available *Available
KSQLNot available *Available
Schema Registry Not available *Available
ReplicationNot availableAvailable
Auto Load BalanceNot availableAvailable
DashboardNot availableAvailable


* => Manual provisioning can be done using community license

For most small and medium business Confluent Kafka seems to be a better option. For large enterprises with a dedicated devops team and data clusters, apache Kafka seems to be better choice.

Why is Confluent a better option for Small & Medium Businesses

REST endpoint

Confluent environment has built in support for a rest endpoint which is critical especially to write and read from Kafka on an Android or iphone tablet. This is also helpful on mainframe and other platforms

Replication

For any production installation you need to be DR-1 compliant and to that you would need more than 1 cluster. Once you have that you need to be able to replicate all data that is written to all clusters. Confluent takes care of this need automatically

Control Center

This is another crucial piece to monitor the health of your kafka installation and recognize trends and patterns and be proactive about future needs and also monitor current problems. 

Self balancer 

This is critical as more applications are on boarded onto kafka with more topics and more traffic and varied data packet sizes

Role-Based Access Control (RBAC)

Centrally manages access to critical resources across the entire platform, with fine-tuned authorization granularity

Structured Audit Logs

Trace user actions in a set of dedicated Kafka topics, to help operators detect abnormal behavior and security threats

Secret Protection

Avoids exposure of secrets by encrypting sensitive information (e.g. passwords) that is stored in configuration files

Encryption of data

Data at rest can be encrypted on both solutions using different third party encryption providers or home grown encryption packages. If you have your own key management service , these encryption options can integrate with it. If cloud is opted for, data can be sent over a secure connection using a REST endpoint with https

Data Privacy

Data privacy can be achieved on both solutions by third party products such as Privitar which de-identifies datasets to avoid PII. If not privitar then one can also integrate an internally developed de-identifying API with kafka as a connector.

Network Latency & Throughput

Reading or writing data to a local onprem kafka from an internal server will have lower latency than writing to a kafka cloud endpoint. Since the use case is for writing / reading to / from kafka from internal servers as devices on the go connect to the internal network, onprem latency should not be a bottleneck. Overall throughput will depend on many different factors such as latency, packet size, encryption, data schema, data payload format, de / serialization and others. However if there are issues with kafka such as lag, uneven distribution or similar, having a Confluent plan will help to get the support and resolve the issue

Project completion, rollout and enablement

The completion date and hence the cost of implementation will depend on options chosen. Based on the reasons given above, Confluent Kafka onprem implementation might take the shortest time. Overall implementation cost might also be cheaper for confluent kafka given the ease of rollout, maintenance, upgrades, notifications, load balancing and many more as given above.

Kafka Implementation Calculation

There are some open questions that need to be addressed before we move forward with installation, deployment and rollout

Number of hosts that will be used for rollout

Number of cores per host

Network Card Type

Storage Volumes Per Host

Storage Capacity Per Volume

Storage Volume Configuration

Volume Read / Write Throughput (Per Volume)

Memory 

Linux Flavor & Version

Kafka OnPrem Hardware Recommendation

Ideal Configuration

ComponentVMStorageMin MemoryMin Cores
Zookeeper5Transaction log: 512 GB SSD Storage: 2 x 1TB SATARAID 1032 GB2-4
Broker512 x 1 TB disk.64 GBDual 12 core
Connect2AnyAny2
REST Proxy2Any64 MB * Num Producer + 16 MB * Num Consumers + 216
Control Center1300GB Min64 GB8

Basic Configuration

ComponentVMStorageMin MemoryMin Cores
Zookeeper31 SSD 64GB4GB1
Broker3RAID1064 GB12 core
Connect2AnyAny2
Control Center1300GB Min4 GB2

Zookeeper

Zookeeper uses Paxos algorithm to arrive at a consensus by majority of nodes. Although kafka is a very stable system there are always possibilities of hardware failures or other outages. Thus it makes sense to accommodate for failures where some VMs will not be available. To have a fault tolerant system, an odd number of nodes is always better for consensus since an odd and the next even number of nodes will allow the same number of failures. So it’s more cost effective to have an odd number of nodes for Zookeeper as well as the broker. 

For e.g if there are 3 nodes then the majority is 2 and even if there are 4 nodes the majority is still 2. So having 3 or 4 nodes give us a fault tolerance of 2

Broker

Kafka relies heavily on the filesystem for storing and caching messages. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. Thus multiple dedicated drives are needed to maximize throughput. Drives should not be shared with application logs or other OS filesystem activity to ensure good latency. You can either combine these drives together into a single volume (RAID) or format and mount each drive as its own directory. RAID 10 should be used if the additional cost is acceptable. Otherwise, Kafka server can be configured with multiple log directories, each directory mounted on a separate drive. 

A fast and reliable network is an essential performance component in a distributed system. Low latency ensures that nodes can communicate easily, while high bandwidth helps shard movement and recovery.

Control Center

Confluent Control Center is good for out-of-the-box Kafka cluster monitoring, making it easy to manage the entire Confluent Platform. Control Center is a web-based application that allows you to manage your cluster, to monitor Kafka system health in predefined dashboards and to alert on triggers. Additionally, Control Center reports end-to-end stream monitoring to assure that every message is delivered from producer to consumer, measures how long messages take to be delivered, and determines the source of any issues in your cluster.

3 steps to a healthy deployment of Kafka

Pick an optimization goal

Throughput

To optimize for throughput, the producers, brokers, and consumers need to move as much data as they can within a given amount of time. For high throughput, try to maximize the rate at which this data moves. This data rate should be as fast as possible.

Latency

Keeping the latency to a minimum does not necessarily mean higher throughput

Durability

Durability is all about reducing the chance for a message to get lost. The most important feature that enables durability is replication, which ensures that messages are copied to multiple brokers. If a broker has a failure, the data is available from at least one other broker.

Availability

To optimize for high availability, kafka should be tuned to recover as quickly as possible from failure scenarios. Higher partition counts may increase parallelism, but having more partitions can also increase recovery time in the event of a broker failure

Configure cluster to achieve the goal

Throughput Optimization

Producer:
Batch.size100000 – 200000 (default 16384)
Linger.ms10 – 100
Compression.typelz4
Acks1
Buffer.memorydefault 33554432
Consumer
Fetch.min.bytes~100000

Latency Optimization

Producer
Linger.ms0

Durability Optimization

Producer
Replication.factor3
AcksAll
Enable.idempotenceTrue
Max.in.flight.requests.per.connection1
Consumer
Enable.auto.commitFalse
Isolation.levelread_committed (when using EOS transactions) 
Broker
Default.replication.factor3
Auto.create.topics.enableFalse
Min.insync.replicas2
Unclean.leader.election.enableFalse
Broker.rackrack of the broker
Log.flush.interval.messages and Log.flush.interval.msLow value

Availability Optimization

Consumer
session.timeout.msAs low as possible (Default 10000)
Broker
unclean.leader.election.enableTrue
num.recovery.threads.per.data.dirSame as log.dirs

Benchmark Monitor and Tune

Use Confluent Control Center on Onprem or the perf dashboard on the cloud. A few important metrics to measure

Kafka Options and Pricing

Based on the comparison given below Confluent On the cloud seems to be the best option. It’s almost free and comes with a handsome uptime guarantee and 24×7 Confluent support. So cheap and a great throughput and performance. On Prem plain vanilla kafka will turn out to be magnitudes higher in initial cost as well as maintenance. Confluent Kafka on Prem is 70K more than On Prem plain vanilla kafka. 

OnPrem Confluent

  1. Maintenance cost of hardware
  2. Additionally 70K / year with 2 years commitment for Starter Essentials Package which entails

● 9 prod nodes

● 9 pre-prod or DR nodes

● Unlimited Dev nodes

● 1 CP Connector Pack (for both Prod & Pre-Prod)

● Support

● 24×7 (P1: 60 Min, P2: 4 Hrs)

● 3 cases / month limit

● Kafka streams support

Plain Vanilla Kafka

Local implementation and maintenance cost of hardware

Confluent On AWS Cloud 

200$ / month based on 200KB writes and reads per second and a kafka cluster with 30 partitions for a basic cluster which is much more than needed for low to medium value usage case. There is no other hardware or any system maintenance cost. This comes with 24×7 support from Confluent and a guarantee of 99.5% uptime. This all appears to be more than what is needed to start with. If there are more partitions needed than the 30 specified above then the cost is $ 0.004 / partition

Another option is to use a standard confluent cloud cluster. This will have a cost of 1000 $ / month but that will allow 500 free partitions and a better uptime of 99.95%

Tags: No tags

2 Responses

Leave A Comment

Your email address will not be published. Required fields are marked *