Real-time event streaming at a global scale, with persistent storage and stream processing instead of ad-hoc queries like in a relational database.
It's a distributed log.
We can grab those events from anywhere and integrate them anywhere.
There are things that happen in the world and its our job to process those events.
This leads to: Single platform to connect everyone to every event. Not just a file system to store things.
Real-time stream of events.
All events stored for historical view.
Event driven architecture. The paradigm shift is from static snapshots of data to a continuous stream of events.
Eg., An occasional call from a friend > a constant feed about the activities of all your friends.
Daily news report > Real time news feeds, accessible online anytime, anywhere.
Kafka is not made for ad-hoc queries, it is made for quick inserts and reads.
Cluster management Failure detection and recovery Store ACLS and secrets
Kafka is composed of the broker. It listens for connections (port 9092) We have the producer, which produces content, and consumer which consumes content from the broker.
There is a connection from the producer to the broker, it is bi-directional. (TCP)
Streams of related messages/events. Logical representation. Categorises messages into groups. Developers define topics. Unlimited number of topics. Producer <> topic: N to N relation.
A topic is basically a log. (technically a partition is a log) A topic can have partitions, within a partition you have multiple segments. (like a rolling file) these partitions can then be allocated to different brokers, which enables us to scale out read/write performance.
Partition location is handled automatically by Kafka.
A segment exists on physical disk as a file (a bunch of them actually).
Topics are logical partitions that you write content to. A producer has to specify which topic it wants to write to. Consumers specify which topic they want to consume. You can only add to the end of the log.
When you consume, it will read sequentially from position zero.
When a table grows larger, we shard. (called partitions in kafka)
So, producers and consumers need to specify a topic, partition and position.
Every event is a key value pair. Every message has a timestamp. Optional headers. (metadata).
Brokers manage partitions. It does storage and pub/sub.
Producers send messages to brokers. Brokers receive and store messages. Kafka cluster can have many brokers. Each broker manages multiple partitions.
Each partition is replicated a configurable number of times on other brokers. The replication factor is typically 3. One of these is called the leader, the others the followers.
When you produce, you do so to the leader
Queue: Message is published once, and then consumed once. Good for job execution. Jobs won't be executed twice. Pub sub: Published once, but consumed many times. (It isn't removed from the source).
Answer: Consumer group It removes the awareness of the partition from the user. By default it acts like a queue. You create a consumer, add it to a group, it will consume all content from that topic, and either all partitions (if there is only one consumer) or the partitions will be split amongst the number of consumers in the group. (up to the limit of partitions). Content from that partition will only be visible to consumers that match the partition. To act like pub/sub, put each consumer in a unique group.
In a system with multiple brokers, there is the abstraction of a leader and a follower. You can write to a leader, but not a follower. A follower is for reading.
Topics are replicated to both machines, but the partition is only designated to be a leader on machine 1, and a follower on machine 2.
The zookeeper herds the 'cats' to the appropriate leader.
open terminal:
and run: docker compose -f docker-compose.yml up -d
shell into your kafka:
docker exec -it kafka /bin/sh
nav to folder:
cd opt/kafka/bin
in this folder is a list of kafka shell scripts, to create a new topic run this command:
kafka-topic.sh --create --zookeeper zookeeper:2181 --replication-factor 1 --partitions 1 --topic first_kafka_topic
to list topics:
kafka-topic.sh --list --zookeeper zookeeper:2181
Note: I tried to spin up my docker environment with the below commands but I could not resolve the hostname of my machine. (docker compose was my way around this)
docker run --name zookeeper -p 2181:2181 -d zookeeper
docker run -p 9092:9092 --name kafka -e KAFKA_ZOOKEEPER_CONNECT=localhost:2181 -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092 -e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 -d confluentinc/cp-kafka
Install docker extension Under containers right click and click attach to visual studio code You can also attach shell in the same way
Kafka Topics: Create, list, describe, delete Kafka Producers: send data to topic Kafka consumers: read data from topic
Create another topic in the shell:
kafka-topics.sh --create --zookeeper zookeeper:2181 --replication-factor 1 --partitions 1 --topic dummy_topic
List the topics:
kafka-topics.sh --list --zookeeper zookeeper:2181
List info on topic:
kafka-topics.sh --describe --zookeeper zookeeper:2181 --topic dummy_topic
Delete a topic:
kafka-topics.sh --delete --zookeeper zookeeper:2181 --topic dummy_topic
Create a new message topic:
kafka-topics.sh --create --zookeeper zookeeper:2181 --replication-factor 1 --partitions 1 --topic messages
kafka-console-producer.sh --broker-list kafka:9092 --topic messages
{'user_id': 1, 'recipient_id':2, 'message': 'Hi.'} {'user_id': 2, 'recipient_id':1, 'message': 'Hello there.'}
Ctrl-c to escape.
Open consumer to read messages:
New terminal window > docker exec -it kafka /bin/sh
cd opt/kafka/bin
kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic messages
It will not show existing messages, but only new ones that arrive, so add another message on the producer and it will appear in the consumer console:
kafka-console-producer.sh --broker-list kafka:9092 --topic messages
{'user_id': 1, 'recipient_id':2, 'message': 'Hi.'} {'user_id': 2, 'recipient_id':1, 'message': 'Hello there.'}
To list all messages:
kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic messages --from-beginning
Install kafka python
python3 -m pip install kafka-python
touch data_generator.py
touch producer.py
touch consumer.py