Name		Name	Last commit message	Last commit date
parent directory ..
util		util
FileBased.java		FileBased.java
Filtering.java		Filtering.java
MapWithState.java		MapWithState.java
MulitpleTransformations.java		MulitpleTransformations.java
Pairs.java		Pairs.java
README.md		README.md
SimpleRecoveryFromCheckpoint.java		SimpleRecoveryFromCheckpoint.java
StateAccumulation.java		StateAccumulation.java
Windowing.java		Windowing.java

README.md

Streaming

Some philosophy

Spark streaming techniques fall into two broad areas that don't have much to do with each other until you get to the advanced topics:

How to get the data streamed into Spark from some external system like a database or a messaging system -- and sometimes how to get it streamed back out again.
How to transform the data within Spark while making full use of the streaming features.

After the single "getting started" example below, we'll look at these two areas separately. Eventually there may need to ba a section of "advanced" examples that tie them together again.

Of course, to transform streaming data we need to set up a streaming data source. Many of the sources you'll encounter in practice take considerable set up, so I've chosen to use Spark's file streaming mechanism and provide a utility class for generating a stream of files containing random data. My hope is that for most users of these examples it will need no setup at all, and it has the useful side effect of bringing streaming "down to earth" by using such a "low tech" mechanism.

Utilities

File	Purpose
CVSFileStreamGenerator.java	A utility for creating a sequence of files of integers in the file system so that Spark can treat them like a stream. This follows a standard pattern to ensure correctness: each file is first created in another folder and then atomically renamed into the destination folder so that the file's point of creation is unambiguous, and is correctly recognized by the streaming mechanism. Each generated file has the same number of key/value pairs, where the keys have the same names from file to file, and the values are random numbers, and thus vary from file to file. This class is used by several of the streaming examples.
StreamingItem.java	An item of data to be streamed. This is used to generate the records int he CSV files and also to parse them. Several of the example stream processing pipelines will parse the text data into these objects for further processing.

File

Purpose

CVSFileStreamGenerator.java

A utility for creating a sequence of files of integers in the file system so that Spark can treat them like a stream. This follows a standard pattern to ensure correctness: each file is first created in another folder and then atomically renamed into the destination folder so that the file's point of creation is unambiguous, and is correctly recognized by the streaming mechanism.

Each generated file has the same number of key/value pairs, where the keys have the same names from file to file, and the values are random numbers, and thus vary from file to file.

This class is used by several of the streaming examples.

StreamingItem.java

An item of data to be streamed. This is used to generate the records int he CSV files and also to parse them. Several of the example stream processing pipelines will parse the text data into these objects for further processing.

Getting started

File	What's Illustrated
FileBased.java	How to create a stream of data from files appearing in a directory. Start here.

Processing the Data

File	What's Illustrated
MultipleTransformations.java	How to establish multiple streams on the same source of data and register multiple processing functions on a single stream.
Filtering.java	Much of the processing we require on streams is agnostic about batch boundaries. It's convenient to have methods on JavaDStream that allow us to transform the streamed data item by item (using map()), or filter it item by item (using filter()) without being concerned about batch boundaries as embodied by individual RDDs. This example again uses map() to parse the records int he ext files and then filter() to filter out individual entries, so that by the time we receive batch RDDs only the desired items remain.
Windowing.java	This example creates two derived streams with different window and slide durations. All three streams print their batch size every time they produce a batch, so you can compare the number of records across streams and batches.
StateAccumulation.java	This example uses an accumulator to keep a running total of the number of records processed. Every batch that is processed is added to it, and the running total is printed.

Streaming Sources

TBD

Advanced Topics

File	What's Illustrated
SimpleRecoveryFromCheckpoint.java	This example demonstrates how to persist configured JavaDStreams across a failure and restart. It simulates failure by destroying the first streaming context (for which a checkpoint directory is configured) and creating a second one, not from scratch, but by reading the checkpoint directory.
MapWithState.java	(In progress)
Pairs.java	(In progress)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streaming

streaming

README.md

Streaming

Some philosophy

Utilities

Getting started

Processing the Data

Streaming Sources

Advanced Topics

Files

streaming

Directory actions

More options

Directory actions

More options

Latest commit

History

streaming

Folders and files

parent directory

README.md

Streaming

Some philosophy

Utilities

Getting started

Processing the Data

Streaming Sources

Advanced Topics