Skip to content

Commit

Permalink
Created Notebook on Chaining
Browse files Browse the repository at this point in the history
  • Loading branch information
tirthajyoti committed Oct 20, 2018
1 parent 7592c14 commit a9ea34b
Showing 1 changed file with 198 additions and 0 deletions.
198 changes: 198 additions & 0 deletions RDD_Chaining_Execution.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Chaining\n",
"We can **chain** transformations and aaction to create a computation **pipeline**\n",
"Suppose we want to compute the sum of the squares\n",
"$$ \\sum_{i=1}^n x_i^2 $$\n",
"where the elements $x_i$ are stored in an RDD."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Start the `SparkContext`"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from pyspark import SparkContext\n",
"sc = SparkContext(master=\"local[4]\")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"7, 6, 4, 7, 9, 0, 3, 7, 6, 2, 7, 7, 1, 5, 3, 0, 3, 9, 2, 4, 4, 9, 5, 8, 9, 8, 3, 9, 3, 5, 1, 1, 9, 9, 0, 0, 5, 9, 2, 1, 1, 6, 9, 6, 3, 0, 4, 1, 3, 4, 1, 6, 3, 9, 1, 3, 7, 7, 1, 3, 3, 8, 6, 5, 5, 8, 3, 0, 6, 2, 7, 7, 7, 2, 0, 3, 7, 4, 4, 4, 7, 1, 9, 2, 7, 8, 8, 4, 7, 1, 9, 9, 9, 6, 5, 2, 7, 7, 3, 3, 0, 0, 9, 9, 7, 3, 6, 5, 1, 5, 9, 4, 8, 3, 2, 6, 7, 4, 8, 6, 8, 7, 5, 2, 5, 5, 9, 0, 6, 4, 7, 8, 2, 6, 6, 0, 7, 7, 1, 7, 6, 0, 8, 8, 1, 8, 0, 3, 9, 1, 1, 8, 4, 6, 0, 1, 4, 8, 0, 0, 6, 4, 4, 4, 8, 4, 1, 0, 0, 1, 5, 1, 1, 6, 6, 4, 7, 8, 2, 1, 6, 6, 1, 7, 9, 0, 3, 9, 3, 6, 7, 1, 5, 2, 3, 9, 1, 0, 9, 6, 3, 7, 9, 4, 2, 0, 5, 0, 8, 2, 8, 5, 8, 8, 1, 0, 3, 9, 2, 3, 6, 0, 0, 7, 6, 6, 5, 4, 2, 4, 8, 0, 4, 2, 6, 7, 8, 5, 9, 7, 0, 2, 0, 6, 1, 0, 0, 6, 3, 9, 1, 8, 7, 9, 5, 9, 0, 2, 9, 3, 7, 3, 3, 4, 8, 3, 5, 6, 5, 4, 4, 3, 9, 0, 0, 4, 1, 9, 9, 7, 9, 7, 0, 9, 8, 2, 6, 1, 4, 3, 3, 7, 8, 1, 1, 9, 1, 8, 4, 1, 3, 0, 3, 3, 1, 2, 2, 5, 9, 9, 1, 9, 4, 1, 4, 7, 8, 5, 8, 3, 9, 4, 6, 0, 5, 3, 6, 1, 6, 3, 8, 7, 3, 9, 0, 5, 9, 8, 5, 0, 7, 2, 8, 7, 6, 5, 2, 9, 3, 5, 6, 3, 4, 9, 3, 4, 3, 9, 9, 0, 5, 0, 2, 5, 7, 6, 7, 6, 7, 7, 6, 6, 0, 3, 5, 3, 7, 5, 9, 6, 3, 9, 2, 5, 1, 5, 7, 0, 5, 8, 5, 9, 0, 8, 0, 2, 5, 5, 1, 2, 9, 3, 1, 7, 2, 2, 6, 2, 9, 4, 0, 7, 9, 8, 9, 4, 2, 0, 7, 5, 5, 4, 2, 4, 8, 0, 3, 8, 0, 2, 3, 8, 5, 1, 5, 9, 4, 6, 8, 5, 8, 4, 0, 0, 0, 6, 1, 1, 8, 8, 5, 9, 0, 5, 9, 2, 8, 9, 4, 2, 4, 6, 2, 2, 6, 4, 4, 8, 1, 9, 6, 0, 7, 0, 5, 9, 1, 0, 6, 1, 6, 8, 7, 1, 8, 4, 8, 7, 1, 0, 8, 4, 8, 1, 9, 9, 1, 5, 4, 6, 7, 4, 4, 0, 1, 3, 0, 0, 1, 8, 8, 4, 5, 5, 4, 4, 8, 2, 0, 5, 1, 8, 5, 2, 0, 9, 8, 8, 7, 1, 0, 8, 6, 8, 3, 3, 6, 7, 6, 6, 6, 6, 9, 8, 6, 8, 5, 8, 6, 9, 2, 1, 0, 6, 8, 7, 6, 5, 8, 3, 3, 4, 3, 7, 9, 3, 8, 7, 8, 7, 5, 0, 3, 4, 7, 6, 5, 8, 9, 9, 5, 4, 0, 8, 4, 7, 5, 8, 7, 1, 4, 2, 6, 5, 8, 5, 7, 9, 8, 6, 0, 0, 8, 6, 1, 0, 0, 6, 4, 6, 1, 7, 7, 9, 0, 8, 7, 8, 0, 8, 8, 6, 3, 3, 8, 7, 3, 2, 1, 7, 7, 5, 7, 8, 3, 1, 7, 2, 7, 8, 5, 6, 3, 7, 8, 0, 6, 8, 7, 7, 3, 7, 0, 7, 7, 8, 5, 6, 4, 2, 0, 8, 7, 6, 3, 2, 3, 9, 4, 2, 3, 1, 1, 0, 1, 9, 3, 2, 6, 4, 8, 7, 0, 4, 2, 1, 2, 5, 4, 8, 6, 2, 2, 7, 7, 8, 6, 8, 2, 8, 9, 4, 7, 2, 1, 5, 5, 1, 9, 2, 1, 1, 4, 2, 2, 1, 6, 8, 2, 4, 2, 3, 7, 8, 9, 8, 6, 1, 1, 1, 9, 3, 6, 8, 2, 1, 2, 4, 1, 9, 4, 9, 6, 1, 0, 0, 1, 3, 8, 1, 3, 9, 8, 9, 5, 4, 9, 1, 6, 6, 3, 1, 2, 7, 5, 4, 7, 7, 4, 7, 1, 0, 0, 3, 4, 5, 4, 4, 5, 4, 1, 5, 2, 8, 1, 8, 0, 0, 2, 5, 5, 2, 2, 0, 3, 4, 3, 6, 0, 2, 6, 7, 7, 6, 7, 9, 6, 3, 6, 2, 4, 5, 0, 6, 7, 6, 4, 6, 5, 1, 1, 6, 9, 7, 6, 7, 6, 5, 2, 2, 7, 3, 7, 7, 2, 7, 1, 1, 2, 1, 9, 3, 8, 4, 5, 1, 7, 0, 3, 6, 5, 1, 2, 1, 8, 8, 0, 7, 2, 4, 6, 6, 8, 0, 5, 4, 6, 4, 2, 3, 2, 2, 8, 5, 2, 8, 2, 8, 2, 2, 4, 0, 4, 2, 9, 7, 9, 1, 2, 0, 4, 4, 9, 6, 3, 5, 3, 7, 1, 2, 6, 0, 7, 8, 7, 8, 1, 6, 9, 4, 1, 5, 5, 9, 3, 6, 9, 5, 1, 7, 3, 0, 8, 7, 5, 5, 2, 4, 2, 0, 9, 6, 0, 1, 5, 9, 2, 7, 1, 8, 3, 2, 9, 8, 6, 9, 4, 5, 0, 0, 5, 7, 0, 7, 0, 9, 1, 4, 7, 1, 7, 8, 6, 3, 8, 1, 1, 1, 7, 1, 6, 4, 3, 7, 7, 4, 1, 0, 5, 2, 1, 8, 4, 7, 2, 8, 1, 4, 6, 8, 8, 5, 2, 6, 2, 9, 7, 1, 6, 2, "
]
}
],
"source": [
"B=sc.parallelize(np.random.randint(0,10,size=1000))\n",
"lst = B.collect()\n",
"for i in lst: \n",
" print(i,end=', ')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sequential syntax for chaining\n",
"Perform assignment after each computation"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 15.2 ms, sys: 11.9 ms, total: 27.1 ms\n",
"Wall time: 1.01 s\n"
]
}
],
"source": [
"%%time\n",
"Squares=B.map(lambda x:x*x)\n",
"summation = Squares.reduce(lambda x,y:x+y)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"29395\n"
]
}
],
"source": [
"print(summation)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cascaded syntax for chaining\n",
"Combine computations into a single cascaded command"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 13.9 ms, sys: 13 ms, total: 26.9 ms\n",
"Wall time: 304 ms\n"
]
},
{
"data": {
"text/plain": [
"29395"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"B.map(lambda x:x*x).reduce(lambda x,y:x+y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Both syntaxes mean exactly the same thing\n",
"The only difference:\n",
"* In the sequential syntax the intermediate RDD has a name `Squares`\n",
"* In the cascaded syntax the intermediate RDD is *anonymous*\n",
"\n",
"The execution is identical!\n",
"\n",
"### Sequential execution\n",
"The standard way that the map and reduce are executed is\n",
"* perform the map\n",
"* store the resulting RDD in memory\n",
"* perform the reduce\n",
"\n",
"### Disadvantages of Sequential execution\n",
"\n",
"1. Intermediate result (`Squares`) requires memory space.\n",
"2. Two scans of memory (of `B`, then of `Squares`) - double the cache-misses.\n",
"\n",
"### Pipelined execution\n",
"Perform the whole computation in a single pass. For each element of **`B`**\n",
"1. Compute the square\n",
"2. Enter the square as input to the `reduce` operation.\n",
"\n",
"### Advantages of Pipelined execution\n",
"\n",
"1. Less memory required - intermediate result is not stored.\n",
"2. Faster - only one pass through the Input RDD."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"sc.stop()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit a9ea34b

Please sign in to comment.