Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
tirthajyoti authored Jul 6, 2020
1 parent 5285b99 commit 0e84d6d
Showing 1 changed file with 61 additions and 10 deletions.
71 changes: 61 additions & 10 deletions Dataframe_introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### The `describe` method \n",
"### The `describe` and `summary` methods\n",
"\n",
"Similar to Pandas, the `describe` method is used for the statistical summary. But unlike Pandas, calling only `describe()` returns a DataFrame! This is due to the **[lazy evaluation](https://data-flair.training/blogs/apache-spark-lazy-evaluation/)** - the actual computation is delayed as much as possible."
]
Expand Down Expand Up @@ -296,15 +296,66 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### How you can define your own Data Schema\n",
"### The `take` and `collect` methods to read/collect rows\n",
"\n",
"Import data types and structure types to build the data schema yourself"
"These methods return some or all rows as a Python list."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Row(age=None, name='Michael'), Row(age=30, name='Andy')]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.take(2)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Row(age=None, name='Michael'),\n",
" Row(age=30, name='Andy'),\n",
" Row(age=19, name='Justin')]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.collect()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defining your own Data Schema\n",
"\n",
"Import data types and structure types to build the data schema yourself"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"from pyspark.sql.types import StructField, IntegerType, StringType, StructType"
Expand All @@ -314,12 +365,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Define your data schema by supplying name and data types to the structure fields you will be importing"
"Define your data schema by supplying name and data types to the structure fields you will be importing. It will be a simple Python list of `StructField` objects. You have to use Spark data types like `IntegerType` and `StringType`."
]
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -331,12 +382,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now create a `StrucType` with this schema as field"
"Now create a `StrucType` object called `final_struc` with this schema as field"
]
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -347,12 +398,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now read in the same old JSON with this new schema"
"Now read in the same old JSON with this new schema `final_struc`"
]
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 17,
"metadata": {},
"outputs": [
{
Expand All @@ -379,7 +430,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now when you print the schema, you will see that the `age` is read as int and not long. By default Spark could not figure out for this column the exact data type that you wanted, so it went with long. But this is how you can build your own schema and instruct Spark to read the data accoridngly."
"Now when you print the schema, **you will see that the `age` is read as `int` and not `long`**. By default Spark could not figure out for this column the exact data type that you wanted, so it went with `long`. But this is how you can build your own schema and instruct Spark to read the data accoridngly."
]
}
],
Expand Down

0 comments on commit 0e84d6d

Please sign in to comment.