Kafka Stream Processing

Till now, we learned about topics, partitions, sending data to Kafka, and consuming data from the Kafka. This could be a lower level of abstraction. Thus, a higher level of abstraction is required. This consequently introduces the concept of Kafka streams.

Kafka Streams

Generally, streams define the flow of data elements which are provided over time. In Apache Kafka, streams are the continuous real-time flow of the facts or records(key-value pairs). Kafka Streams is a light-weight in-built client library which is used for building different applications and microservices. The input, as well as output data of the streams get stored in Kafka clusters. Kafka Streams integrates the simplicity to write as well as deploy standard java and scala applications on the client-side.

Why Kafka Streams?

There are the following properties that describe the use of Kafka Streams:

  1. Kafka Streams are highly scalable as well as elastic in nature.
  2. Can be deployed to containers, cloud, bare metals, etc.
  3. It is operable for any size of use case, i.e., small, medium, or large.
  4. It has the capability of fault tolerance. If any failure occurs, it can be handled by the Kafka Streams.
  5. It allows writing standard java and scala applications.
  6. For streaming, it does not require any separate processing cluster.
  7. Kafka Streams are supported in Mac, Linux, as well as Windows operating systems.
  8. It does not have any external dependencies except Kafka itself.

Stream Processing

Similar to the data-flow programming, Stream processing allows few applications to exploit a limited form of parallel processing more simply and easily. Thus, stream processing makes parallel execution of applications simple. The business parties implement the core functions using the software known as Stream Processing software/applications.

Stream Processing Topology

Apache Kafka provides streams as the most important abstraction. Streams are repayable, ordered as well as the fault-tolerant sequence of immutable records.

The stream processing application is a program which uses the Kafka Streams library. It requires one or more processor topologies to define its computational logic. Processor topologies are represented graphically where 'stream processors' are its nodes, and each node is connected by 'streams' as its edges.

The stream processor represents the steps to transform the data in streams. It receives one input record at a time from its upstream processors present in the topology, applies its operations, and finally produces one or more output records to its downstream processors.

Kafka Stream Processing

There are following two major processors present in the topology:

  1. Source Processor: The type of stream processor which does not have any upstream processors. This processor consumes data from one or more topics and produces an input stream to its topologies.
  2. Sink Processor: This is the type of stream processor which does not have downstream processors. The work of this processor is to send the received data from its upstream processors to the specified topic.

In addition, Kafka Streams provides two ways to represent the stream processing topology:

  1. Kafka Streams DSL: It is built on top of Stream Processors API. Here, DSL extends for 'Domain Specific Language'. It is mostly recommended for beginners.
  2. Processor API: This API is mostly used by the developers to define arbitrary stream processors, which processes one received record at a time. Further, it connects these processors with their state stores for composing processor topology. This composed topology represents a customized processing logic.