Skip to content

Latest commit

 

History

History

streaming

Streaming

Some philosophy

Spark streaming techniques fall into two broad areas that don't have much to do with each other until you get to the advanced topics:

  • How to get the data streamed into Spark from some external system like a database or a messaging system -- and sometimes how to get it streamed back out again.
  • How to transform the data within Spark while making full use of the streaming features.

After the single "getting started" example below, we'll look at these two areas separately. Eventually there may need to ba a section of "advanced" examples that tie them together again.

Of course, to transform streaming data we need to set up a streaming data source. Many of the sources you'll encounter in practice take considerable set up, so I've chosen to use Spark's file streaming mechanism and provide a utility class for generating a stream of files containing random data. My hope is that for most users of these examples it will need no setup at all, and it has the useful side effect of bringing streaming "down to earth" by using such a "low tech" mechanism.

Utilities

File Purpose
CVSFileStreamGenerator.java

A utility for creating a sequence of files of integers in the file system so that Spark can treat them like a stream. This follows a standard pattern to ensure correctness: each file is first created in another folder and then atomically renamed into the destination folder so that the file's point of creation is unambiguous, and is correctly recognized by the streaming mechanism.

Each generated file has the same number of key/value pairs, where the keys have the same names from file to file, and the values are random numbers, and thus vary from file to file.

This class is used by several of the streaming examples.

StreamingItem.java

An item of data to be streamed. This is used to generate the records int he CSV files and also to parse them. Several of the example stream processing pipelines will parse the text data into these objects for further processing.

Getting started

File What's Illustrated
FileBased.java How to create a stream of data from files appearing in a directory. Start here.

Processing the Data

File What's Illustrated
MultipleTransformations.java

How to establish multiple streams on the same source of data and register multiple processing functions on a single stream.

Filtering.java

Much of the processing we require on streams is agnostic about batch boundaries. It's convenient to have methods on JavaDStream that allow us to transform the streamed data item by item (using map()), or filter it item by item (using filter()) without being concerned about batch boundaries as embodied by individual RDDs. This example again uses map() to parse the records int he ext files and then filter() to filter out individual entries, so that by the time we receive batch RDDs only the desired items remain.

Windowing.java

This example creates two derived streams with different window and slide durations. All three streams print their batch size every time they produce a batch, so you can compare the number of records across streams and batches.

StateAccumulation.java

This example uses an accumulator to keep a running total of the number of records processed. Every batch that is processed is added to it, and the running total is printed.

Streaming Sources

TBD

Advanced Topics

File What's Illustrated
SimpleRecoveryFromCheckpoint.java

This example demonstrates how to persist configured JavaDStreams across a failure and restart. It simulates failure by destroying the first streaming context (for which a checkpoint directory is configured) and creating a second one, not from scratch, but by reading the checkpoint directory.

MapWithState.java (In progress)
Pairs.java (In progress)