Get Started Free
‹ Back to courses
course: Apache Flink® 101

The Flink Runtime

5 min
David Anderson

David Anderson

Software Practice Lead

The Flink Runtime

Overview

This video explains how a Flink cluster is organized, and the roles of Job Managers and Task Managers in running your applications.

Flink supports both batch and stream processing. These two execution modes place different demands on the runtime. Understanding these differences is helpful in appreciating what problems the Flink runtime has to solve, and why it is organized the way it is.

Topics:

  • Flink Client
  • Job Manager
  • Task Manager
  • Job Submission
  • Streaming vs. Batch Execution

Resources

Use the promo code FLINK101 to get $25 of free Confluent Cloud usage

Be the first to get updates and new content

We will only share developer content and updates, including notifications when new content is added. We will never send you sales emails. 🙂 By subscribing, you understand we will process your personal information in accordance with our Privacy Statement.

The Flink Runtime

I'm David, from Confluent, and now I'm going to tell you about the Flink Runtime. When you write a Flink application using one of the Flink APIs, that application becomes a Flink client. When you execute that Flink client, the API assembles the job graph for that job and submits it to the Job Manager. When the Job Manager receives your job, it will find or create the resources needed to run the job. For example, in a containerized environment, the Job Manager will spin up as many task manager pods as are needed to provide the desired parallelism. Each task manager provides some task slots, each of which can execute one parallel instance of the job graph. Once the job is running, the Job Manager remains responsible for coordinating the activities of the Flink cluster. For example, it coordinates checkpointing, and restarts task managers if they fail. The task managers run the job. They pull data from the sources, transform it, send data to each other as needed for repartitioning and rebalancing, and push results out to the sinks. A good way to understand stream processing is to contrast it with batch processing. Flink does, in fact, support both stream and batch processing. You write your code once, and depending on the context in which it is run, it runs as a streaming job or as a batch job. For Flink, batch is nothing more than a special case in the runtime. The effect it has is to enable a bunch of optimizations. Those optimizations for batch mode are only possible with bounded streams. Unbounded streams require streaming. We expect a stream processor to produce results quickly, with minimal end-to-end latency for the entire pipeline. For this to work, the entire pipeline must be left running continuously. By contrast, when Flink is running in batch mode, it can reduce its resource requirements by running the pipeline stages sequentially, rather than concurrently. If you think back to the earlier example where the first step was to filter out the orange events, in batch mode we could first remove all of the orange events from the dataset, and then push the filtered dataset through the next step in the pipeline, and so on. One of the available optimizations for batch processing is to pre-sort the input to make it easier to process. For example, databases sometimes implement joins by first sorting the input tables, after which the join is easier to do. Flink can do the same thing, but only in batch mode. When Flink is operating in streaming mode, it must accept each event as it arrives. Sometimes Flink is unable to immediately process an event, and may have to buffer it until other necessary data has arrived. These buffers have to be stored in some sort of durable state store, so their contents aren't lost if the task manager fails and has to be restarted. In batch mode, processing can continue until the job is complete, at which point the final results can be produced. In streaming mode, each event could turn out to be the last, so results are produced incrementally, after every event, or results can be produced periodically, based on timers. In batch mode, Flink can recover from failures by restarting the job from the beginning. In streaming mode, Flink tries to minimize the downtime by resuming from a recent snapshot. Later in the course, I'll explain how this works in more detail. Flink provides effective, exactly-once guarantees in both batch and streaming modes, but achieving that for streaming is rather more complex, since restarts need to happen quickly and without significant re-processing. As we've seen, stream processing puts some interesting demands on the runtime, and in some ways, stream processing is more complex to support than batch processing. Is it worth going to that trouble? Yes, it is, absolutely. Some use cases are only possible with a stream processor. For example, any use case where low-latency is a critical requirement, such as data center monitoring or fraud detection. For use cases that demand an immediate response to every event, the latency from batch processing can make it unworkable. Okay, so some use cases need a stream processor. Are there use cases that need a batch processor? Not really. Stream processing is more powerful than batch processing, and any use case that can be satisfied with batch processing could be handled by a stream processor instead. However, it is worthwhile having a runtime that explicitly supports both batch and stream processing because batch processing can be much more efficient. If you aren't already on Confluent Developer, head there now using the link in the video description to access other courses, hands-on exercises, and many other resources for continuing your learning journey.