Introduction
This document is the first in a series exploring Apache Flink and its capabilities. In this
installment, we will provide a step-by-step guide to installing PyFlink with Kafka, using a Conda
environment for easy reproducibility. By following this guide, you will set up PyFlink and Kafka in
a controlled environment, ensuring compatibility and ease of use. This foundation will be essential
for the next parts of the series, where we will dive deeper into building and deploying Flink
applications.
Assumptions
Before proceeding with the installation, we assume the following about the user and their system:
- Conda Knowledge: Familiarity with Conda and has it installed on their system.
- Java Installation: Java 11 or above installed and configured. If Java is not installed, follow this link for installation instructions. Please ensure that you install Oracle-JDK as OpenJDK is not currently supported by PyFlink.
- Understanding of JAR Files: A basic understanding of JAR files and their use in Java environments is required.
- Operating System: The system is Unix-based (e.g., Linux, macOS), not Windows.
- Kafka Familiarity: A basic understanding of message brokers, particularly Kafka, is necessary.
Setting up Conda and Installing PyFlink
To create a Conda environment similar to the one used in our demo, follow these steps:
conda create -n flink_env python=3.11 -y
conda activate flink_env
pip install git+https://github.com/dpkp/kafka-python.git
pip install apache-flink
Note: The reason why we are pip installing from the repo is because of this bug
Congratulations you have successfully set up your Conda environment with PyFlink!
Installing Kafka
To set up Kafka, follow these steps:
-
Download Kafka: Visit the official Apache Kafka website and download version
3.2.0 with Scala 2.13. It’s crucial to use this specific version, as explained later in the
documentation.
You can download Kafka from the from this link - Unzip and Place Kafka: Once the download is complete, unzip the Kafka package and place it in a directory of your choice.
-
Start Zookeeper and Kafka Server: To start Kafka, you'll need to run both the
Zookeeper service and the Kafka server. Open two terminal windows and follow these steps:
-
Navigate to Kafka Directory: In both terminal windows, navigate to your
Kafka installation directory:
cd path/to/location/kafka_2.13-3.2.0/
-
Start Zookeeper: In the first terminal, start Zookeeper by running the
following command:
bin/zookeeper-server-start.sh config/zookeeper.properties
-
Start Kafka Server: In the second terminal, start the Kafka server
using the command:
bin/kafka-server-start.sh config/server.properties
-
Navigate to Kafka Directory: In both terminal windows, navigate to your
Kafka installation directory:
Once both services are running, Kafka will be up and ready for use.
Installing Flink
Download Apache Flink: Download Apache Flink version 1.19.1 from the official website and move it to a location of your choice.
Note: We are using Flink v1.19.1 because PyFlink is not supported in the latest version as of the writing of this article
cd path/to/kafka
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
Download the Kafka SQL Connector JAR: Download the
flink-sql-connector-kafka-3.2.0-1.19.jar
from the Maven
repository and move it to a folder of your choice.
wget https://archive.apache.org/dist/flink/flink-1.19.1/flink-1.19.1-bin-scala_2.12.tgz
tar -xvzf flink-1.19.1-bin-scala_2.12.tgz
cd flink-1.19.1-bin-scala_2.12
./bin/start-cluster.sh
This JAR file is essential for integrating PyFlink with Kafka and SQL. The naming convention of the JAR file provides important information:
3.2.0: This refers to the version of Kafka that is compatible with the connector.
1.19: This indicates the version of Flink that is compatible with the JAR.
Explanation: The reason for downloading these specific versions of Kafka (3.2.0) and Flink (1.19.1) is to ensure compatibility with the connector JAR, which allows Flink to interact with Kafka and handle SQL queries.
Starting cluster: Once you have downloaded Flink, open your terminal and enter the following commands
cd /path/to/your/flink/folder
./bin/start-cluster.sh
Access UI: Once the cluster has started running open a browser and enter the following URL to see Flink’s Web UI
http://localhost:8081/
Congratulations you have successfully set up your environment to use Flink with Kafka
Want a Cloud-Native, Fully-Managed Solution?
Maintaining Kafka brokers and Zookeeper nodes to ensuring compatibility with Flink, the process can become cumbersome and time-consuming.
To streamline your Kafka and Flink management, consider using the Confluent Platform. To get started with Confluent, you have two primary options:
For a comprehensive overview of the benefits, features, and integrations offered by Confluent, please refer to the Official Confluent Documentation.