Setting Up PyFlink with Kafka - Part 1

15 September, 2023

Introduction
This document is the first in a series exploring Apache Flink and its capabilities. In this installment, we will provide a step-by-step guide to installing PyFlink with Kafka, using a Conda environment for easy reproducibility. By following this guide, you will set up PyFlink and Kafka in a controlled environment, ensuring compatibility and ease of use. This foundation will be essential for the next parts of the series, where we will dive deeper into building and deploying Flink applications.

Assumptions

Before proceeding with the installation, we assume the following about the user and their system:

  • Conda Knowledge: Familiarity with Conda and has it installed on their system.
  • Java Installation: Java 11 or above installed and configured. If Java is not installed, follow this link for installation instructions. Please ensure that you install Oracle-JDK as OpenJDK is not currently supported by PyFlink.
  • Understanding of JAR Files: A basic understanding of JAR files and their use in Java environments is required.
  • Operating System: The system is Unix-based (e.g., Linux, macOS), not Windows.
  • Kafka Familiarity: A basic understanding of message brokers, particularly Kafka, is necessary.

Setting up Conda and Installing PyFlink

To create a Conda environment similar to the one used in our demo, follow these steps:

conda create -n flink_env python=3.11 -y
conda activate flink_env
pip install git+https://github.com/dpkp/kafka-python.git
pip install apache-flink

Note: The reason why we are pip installing from the repo is because of this bug

Congratulations you have successfully set up your Conda environment with PyFlink!

Installing Kafka

To set up Kafka, follow these steps:

  • Download Kafka: Visit the official Apache Kafka website and download version 3.2.0 with Scala 2.13. It’s crucial to use this specific version, as explained later in the documentation.
    You can download Kafka from the from this link
  • Unzip and Place Kafka: Once the download is complete, unzip the Kafka package and place it in a directory of your choice.
  • Start Zookeeper and Kafka Server: To start Kafka, you'll need to run both the Zookeeper service and the Kafka server. Open two terminal windows and follow these steps:
    • Navigate to Kafka Directory: In both terminal windows, navigate to your Kafka installation directory: cd path/to/location/kafka_2.13-3.2.0/
    • Start Zookeeper: In the first terminal, start Zookeeper by running the following command: bin/zookeeper-server-start.sh config/zookeeper.properties
    • Start Kafka Server: In the second terminal, start the Kafka server using the command: bin/kafka-server-start.sh config/server.properties

Once both services are running, Kafka will be up and ready for use.

Installing Flink

Download Apache Flink: Download Apache Flink version 1.19.1 from the official website and move it to a location of your choice.

Note: We are using Flink v1.19.1 because PyFlink is not supported in the latest version as of the writing of this article

cd path/to/kafka
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

Download the Kafka SQL Connector JAR: Download the flink-sql-connector-kafka-3.2.0-1.19.jar from the Maven repository and move it to a folder of your choice.

wget https://archive.apache.org/dist/flink/flink-1.19.1/flink-1.19.1-bin-scala_2.12.tgz
tar -xvzf flink-1.19.1-bin-scala_2.12.tgz
cd flink-1.19.1-bin-scala_2.12
./bin/start-cluster.sh

This JAR file is essential for integrating PyFlink with Kafka and SQL. The naming convention of the JAR file provides important information:

3.2.0: This refers to the version of Kafka that is compatible with the connector.

1.19: This indicates the version of Flink that is compatible with the JAR.

Explanation: The reason for downloading these specific versions of Kafka (3.2.0) and Flink (1.19.1) is to ensure compatibility with the connector JAR, which allows Flink to interact with Kafka and handle SQL queries.

Starting cluster: Once you have downloaded Flink, open your terminal and enter the following commands

cd /path/to/your/flink/folder
./bin/start-cluster.sh

Access UI: Once the cluster has started running open a browser and enter the following URL to see Flink’s Web UI

http://localhost:8081/

Congratulations you have successfully set up your environment to use Flink with Kafka

Want a Cloud-Native, Fully-Managed Solution?

Maintaining Kafka brokers and Zookeeper nodes to ensuring compatibility with Flink, the process can become cumbersome and time-consuming.

To streamline your Kafka and Flink management, consider using the Confluent Platform. To get started with Confluent, you have two primary options:

  • Self-Managed Installation: Visit the Confluent Website and explore the various installation options , including Docker images and package installers. This will help you set up the Confluent Platform tailored to your on-premises deployment needs.
  • Confluent Cloud: For a cloud-based setup, head to the Confluent Cloud Documentation to learn about creating a Confluent Cloud account, setting up Kafka clusters, and configuring cloud providers (AWS, GCP, Azure).

For a comprehensive overview of the benefits, features, and integrations offered by Confluent, please refer to the Official Confluent Documentation.

Our next guide will go into building your first Flink application. Till then, hope you find this helpful and enjoy building robust data pipelines!