Flink Operations Playground #

mcxiaoracle發表於2023-03-11

This playground consists of a long living   and a Kafka Cluster.

A Flink Cluster always consists of a   and one or more  . The JobManager is responsible for handling   submissions, the supervision of Jobs as well as resource management. The Flink TaskManagers are the worker processes and are responsible for the execution of the actual  Tasks which make up a Flink Job. In this playground you will start with a single TaskManager, but scale out to more TaskManagers later. Additionally, this playground comes with a dedicated  client container, which we use to submit the Flink Job initially and to perform various operational tasks later on. The  client container is not needed by the Flink Cluster itself but only included for ease of use.

The Kafka Cluster consists of a Zookeeper server and a Kafka Broker.


When the playground is started a Flink Job called  Flink Event Count  will be submitted to the JobManager. Additionally, two Kafka Topics  input  and  output  are created.


The Job consumes  ClickEvent s from the  input  topic, each with a  timestamp  and a  page . The events are then keyed by  page  and counted in 15 second  windows . The results are written to the  output  topic.


There are six different pages and we generate 1000 click events per page and 15 seconds. Hence, the output of the Flink job should show 1000 views per page and window.


Starting the Playground 

The playground environment is set up in just a few steps. We will walk you through the necessary commands and show how to validate that everything is running correctly.


We assume that you have   (1.12+) and   (2.1+) installed on your machine.


The required configuration files are available in the   repository. First checkout the code and build the docker image:


Then before starting the playground, create the checkpoint and savepoint directories on the Docker host machine (these volumes are mounted by the jobmanager and taskmanager, as specified in docker-compose.yaml):


Then start the playground:

Afterwards, you can inspect the running Docker containers with the following command:


This indicates that the client container has successfully submitted the Flink Job ( Exit 0) and all cluster components as well as the data generator are running ( Up).

You can stop the playground environment by calling:



Entering the Playground 


There are many things you can try and check out in this playground. In the following two sections we will show you how to interact with the Flink Cluster and demonstrate some of Flink’s key features.



Flink WebUI 

The most natural starting point to observe your Flink Cluster is the WebUI exposed under  . If everything went well, you’ll see that the cluster initially consists of one TaskManager and executes a Job called  Click Event Count .


The Flink WebUI contains a lot of useful and interesting information about your Flink Cluster and its Jobs (JobGraph, Metrics, Checkpointing Statistics, TaskManager Status,…).


Logs 

JobManager

The JobManager logs can be tailed via  docker-compose.


TaskManager

The TaskManager log can be tailed in the same way.



Flink CLI 

The   can be used from within the client container. For example, to print the  help  message of the Flink CLI you can run

docker-compose run --no-deps client flink --help

Flink REST API 

The   is exposed via  localhost:8081  on the host or via  jobmanager:8081  from the client container, e.g. to list all currently running jobs, you can run:



curl localhost:8081/jobs




Kafka Topics 


You can look at the records that are written to the Kafka Topics by running


You can look at the records that are written to the Kafka Topics by running




Time to Play! 

Now that you learned how to interact with Flink and the Docker containers, let’s have a look at some common operational tasks that you can try out on our playground. All of these tasks are independent of each other, i.e. you can perform them in any order. Most tasks can be executed via the   and the  .


Listing Running Jobs 

Command

docker-compose run --no-deps client flink list


Expected Output

The JobID is assigned to a Job upon submission and is needed to perform actions on the Job via the CLI or REST API.


Observing Failure & Recovery

Flink provides exactly-once processing guarantees under (partial) failure. In this playground you can observe and - to some extent - verify this behavior.


Step 1: Observing the Output 

As described  , the events in this playground are generated such that each window contains exactly one thousand records. So, in order to verify that Flink successfully recovers from a TaskManager failure without data loss or duplication you can tail the output topic and check that - after recovery - all windows are present and the count is correct.


For this, start reading from the  output topic and leave this command running until after recovery (Step 3).


docker-compose exec kafka kafka-console-consumer.sh \   --bootstrap-server localhost:9092 --topic output


In order to simulate a partial failure you can kill a TaskManager. In a production setup, this could correspond to a loss of the TaskManager process, the TaskManager machine or simply a transient exception being thrown from the framework or user code (e.g. due to the temporary unavailability of an external resource).


After a few seconds, the JobManager will notice the loss of the TaskManager, cancel the affected Job, and immediately resubmit it for recovery. When the Job gets restarted, its tasks remain in the  SCHEDULED  state, which is indicated by the purple colored squares (see screenshot below).



At this point, the tasks of the Job cannot move from the  SCHEDULED  state to  RUNNING  because there are no resources (TaskSlots provided by TaskManagers) to the run the tasks. Until a new TaskManager becomes available, the Job will go through a cycle of cancellations and resubmissions.



In the meantime, the data generator keeps pushing  ClickEvent s into the  input  topic. This is similar to a real production setup where data is produced while the Job to process it is down.


Step 3: Recovery 

Once you restart the TaskManager, it reconnects to the JobManager.

docker-compose up -d taskmanager


When the JobManager is notified about the new TaskManager, it schedules the tasks of the recovering Job to the newly available TaskSlots. Upon restart, the tasks recover their state from the last successful   that was taken before the failure and switch to the  RUNNING state.

The Job will quickly process the full backlog of input events (accumulated during the outage) from Kafka and produce output at a much higher rate (> 24 records/minute) until it reaches the head of the stream. In the  output you will see that all keys ( pages) are present for all time windows and that every count is exactly one thousand. Since we are using the   in its “at-least-once” mode, there is a chance that you will see some duplicate output records.

Note: Most production setups rely on a resource manager (Kubernetes, Yarn) to automatically restart failed processes.


Upgrading & Rescaling a Job 

Upgrading a Flink Job always involves two steps: First, the Flink Job is gracefully stopped with a  . A Savepoint is a consistent snapshot of the complete application state at a well-defined, globally consistent point in time (similar to a checkpoint). Second, the upgraded Flink Job is started from the Savepoint. In this context “upgrade” can mean different things including the following:

  • An upgrade to the configuration (incl. the parallelism of the Job)
  • An upgrade to the topology of the Job (added/removed Operators)
  • An upgrade to the user-defined functions of the Job

Before starting with the upgrade you might want to start tailing the  output topic, in order to observe that no data is lost or corrupted in the course the upgrade.



Variants 

You might have noticed that the  Click Event Count  application was always started with  --checkpointing  and  --event-time  program arguments. By omitting these in the command of the  client  container in the  docker-compose.yaml , you can change the behavior of the Job.


  • --checkpointing enables  , which is Flink’s fault-tolerance mechanism. If you run without it and go through  , you should will see that data is actually lost.

  • --event-time enables   for your Job. When disabled, the Job will assign events to windows based on the wall-clock time instead of the timestamp of the  ClickEvent. Consequently, the number of events per window will not be exactly one thousand anymore.

The  Click Event Count  application also has another option, turned off by default, that you can enable to explore the behavior of this job under backpressure. You can add this option in the command of the  client  container in  docker-compose.yaml .


  • --backpressure adds an additional operator into the middle of the job that causes severe backpressure during even-numbered minutes (e.g., during 10:12, but not during 10:13). This can be observed by inspecting various   such as  outputQueueLength and  outPoolUsage, and/or by using the   available in the WebUI.


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/69949806/viewspace-2939216/,如需轉載,請註明出處,否則將追究法律責任。

相關文章