diff --git a/01-intro-to-computing.Rmd b/01-intro-to-computing.Rmd index 53aeb60..536cced 100644 --- a/01-intro-to-computing.Rmd +++ b/01-intro-to-computing.Rmd @@ -195,7 +195,9 @@ Some types, such as ints, are able to use a more efficient algorithm when invoked using the three argument form. ``` -This shows the function takes in three input arguments: `base`, `exp`, and `mod=None`. When an argument has an assigned value of `mod=None`, that means the input argument already has a value, and you don't need to specify anything, unless you want to. +We can also find a similar help document, in a [nicer rendered form online.](https://docs.python.org/3/library/functions.html#pow) We will practice looking at function documentation throughout the course, because that is a fundamental skill to learn more functions on your own. + +The documentation shows the function takes in three input arguments: `base`, `exp`, and `mod=None`. When an argument has an assigned value of `mod=None`, that means the input argument already has a value, and you don't need to specify anything, unless you want to. The following ways are equivalent ways of using the `pow()` function: @@ -219,11 +221,11 @@ And there is an operational equivalent: We will mostly look at functions with input arguments and return types in this course, but not all functions need to have input arguments and output return. Let's look at some examples of functions that don't always have an input or output: -| Function call | What it takes in | What it does | Returns | -|---------------|---------------|----------------------------|---------------| -| `pow(a, b)` | integer `a`, integer `b` | Raises `a` to the `b`th power. | Integer | -| `print(x)` | any data type `x` | Prints out the value of `x` to the console. | None | -| `dir()` | Nothing | Gives a list of all the variables defined in the environment. | List | +| Function call | What it takes in | What it does | Returns | +|----------------|----------------|-------------------------|----------------| +| [`pow(a, b)`](https://docs.python.org/3/library/functions.html#pow) | integer `a`, integer `b` | Raises `a` to the `b`th power. | Integer | +| [`print(x)`](https://docs.python.org/3/library/functions.html#print) | any data type `x` | Prints out the value of `x` to the console. | None | +| [`dir()`](https://docs.python.org/3/library/functions.html#dir) | Nothing | Gives a list of all the variables defined in the environment. | List | ## Tips on writing your first code diff --git a/02-data-structures.Rmd b/02-data-structures.Rmd index ad3e1dc..ea3b2c2 100644 --- a/02-data-structures.Rmd +++ b/02-data-structures.Rmd @@ -105,12 +105,12 @@ Object methods are functions that does something with the object you are using i Here are some more examples of methods with lists: -| Function method | What it takes in | What it does | Returns | -|----------------|----------------|-------------------------------------|------------------| -| `chrNum.count(x)` | list `chrNum`, data type `x` | Counts the number of instances `x` appears as an element of `chrNum`. | Integer | -| `chrNum.append(x)` | list `chrNum`, data type `x` | Appends `x` to the end of the `chrNum`. | None (but `chrNum` is modified!) | -| `chrNum.sort()` | list `chrNum` | Sorts `chrNum` by ascending order. | None (but `chrNum` is modified!) | -| `chrNum.reverse()` | list `chrNum` | Reverses the order of `chrNum`. | None (but `chrNum` is modified!) | +| Function method | What it takes in | What it does | Returns | +|---------------|---------------|---------------------------|---------------| +| [`chrNum.count(x)`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum`, data type `x` | Counts the number of instances `x` appears as an element of `chrNum`. | Integer | +| [`chrNum.append(x)`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum`, data type `x` | Appends `x` to the end of the `chrNum`. | None (but `chrNum` is modified!) | +| [`chrNum.sort()`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum` | Sorts `chrNum` by ascending order. | None (but `chrNum` is modified!) | +| [`chrNum.reverse()`](https://docs.python.org/3/tutorial/datastructures.html) | list `chrNum` | Reverses the order of `chrNum`. | None (but `chrNum` is modified!) | ## Dataframes @@ -118,7 +118,7 @@ A Dataframe is a two-dimensional data structure that stores data like a spreadsh The Dataframe data structure is found within a Python module called "Pandas". A Python module is an organized collection of functions and data structures. The `import` statement below gives us permission to access the "Pandas" module via the variable `pd`. -To load in a Dataframe from existing spreadsheet data, we use the function `pd.read_csv()`: +To load in a Dataframe from existing spreadsheet data, we use the function [`pd.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html): ```{python} import pandas as pd @@ -127,7 +127,7 @@ metadata = pd.read_csv("classroom_data/metadata.csv") type(metadata) ``` -There is a similar function `pd.read_excel()` for loading in Excel spreadsheets. +There is a similar function [`pd.read_excel()`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) for loading in Excel spreadsheets. Let's investigate the Dataframe as an object: @@ -166,7 +166,7 @@ metadata.shape ### What can a Dataframe do (in terms of operations and functions)? -We can use the `head()` and `tail()` functions to look at the first few rows and last few rows of `metadata`, respectively: +We can use the [`.head()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [`.tail()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods to look at the first few rows and last few rows of `metadata`, respectively: ```{python} metadata.head() @@ -179,7 +179,7 @@ Both of these functions (without input arguments) are considered as **methods**: Perhaps the most important operation you will can do with Dataframes is subsetting them. There are two ways to do it. The first way is to subset by numerical indicies, exactly like how we did for lists. -You will use the `iloc` and bracket operations, and you give two slices: one for the row, and one for the column. +You will use the [`iloc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) and bracket operations, and you give two slices: one for the row, and one for the column. Let's start with a small dataframe to see how it works before returning to `metadata`: diff --git a/03-data-wrangling1.Rmd b/03-data-wrangling1.Rmd index f15538d..eeac127 100644 --- a/03-data-wrangling1.Rmd +++ b/03-data-wrangling1.Rmd @@ -65,7 +65,7 @@ expression.head() ``` | Dataframe | The observation is | Some variables are | Some values are | -|------------------|------------------|-------------------|------------------| +|-----------------|-----------------|--------------------|------------------| | metadata | Cell line | ModelID, Age, OncotreeLineage | "ACH-000001", 60, "Myeloid" | | expression | Cell line | KRAS_Exp | 2.4, .3 | | mutation | Cell line | KRAS_Mut | TRUE, FALSE | @@ -94,7 +94,7 @@ To subset for rows implicitly, we will use the conditional operators on Datafram metadata['OncotreeLineage'] == "Lung" ``` -Then, we will use the `.loc` operation (which is different than `.iloc` operation!) and subsetting brackets to subset rows and columns Age and Sex at the same time: +Then, we will use the [`.loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) operation (which is different than [`.iloc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) operation!) and subsetting brackets to subset rows and columns Age and Sex at the same time: ```{python} metadata.loc[metadata['OncotreeLineage'] == "Lung", ["Age", "Sex"]] @@ -126,12 +126,12 @@ Now that your Dataframe has be transformed based on your scientific question, yo If we look at the data structure of a Dataframe's column, it is actually not a List, but an object called Series. It has methods can compute summary statistics for us. Let's take a look at a few popular examples: -| Function method | What it takes in | What it does | Returns | -|----------------|----------------|-------------------------|----------------| -| `metadata.Age.mean()` | `metadata.Age` as a numeric Series | Computes the mean value of the `Age` column. | Float (NumPy) | -| `metadata['Age'].median()` | `metadata['Age']` as a numeric Series | Computes the median value of the `Age` column. | Float (NumPy) | -| `metadata.Age.max()` | `metadata.Age` as a numeric Series | Computes the max value of the `Age` column. | Float (NumPy) | -| `metadata.OncotreeSubtype.value_counts()` | `metadata.OncotreeSubtype` as a string Series | Creates a frequency table of all unique elements in `OncotreeSubtype` column. | Series | +| Function method | What it takes in | What it does | Returns | +|----------------|----------------|------------------------|----------------| +| [`metadata.Age.mean()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.mean.html) | `metadata.Age` as a numeric Series | Computes the mean value of the `Age` column. | Float (NumPy) | +| [`metadata['Age'].median()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.median.html) | `metadata['Age']` as a numeric Series | Computes the median value of the `Age` column. | Float (NumPy) | +| [`metadata.Age.max()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.max.html) | `metadata.Age` as a numeric Series | Computes the max value of the `Age` column. | Float (NumPy) | +| [`metadata.OncotreeSubtype.value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) | `metadata.OncotreeSubtype` as a string Series | Creates a frequency table of all unique elements in `OncotreeSubtype` column. | Series | Let's try it out, with some nice print formatting: @@ -144,10 +144,10 @@ Notice that the output of some of these methods are Float (NumPy). This refers t ## Simple data visualization -We will dedicate extensive time later this course to talk about data visualization, but the Dataframe's column, Series, has a method called `.plot()` that can help us make simple plots for one variable. The `.plot()` method will by default make a line plot, but it is not necessary the plot style we want, so we can give the optional argument `kind` a String value to specify the plot style. We use it for making a histogram or bar plot. +We will dedicate extensive time later this course to talk about data visualization, but the Dataframe's column, Series, has a method called [`.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.html) that can help us make simple plots for one variable. The `.plot()` method will by default make a line plot, but it is not necessary the plot style we want, so we can give the optional argument `kind` a String value to specify the plot style. We use it for making a histogram or bar plot. | Plot style | Useful for | kind = | Code | -|-----------|-----------|-----------|--------------------------------------| +|-------------|-------------|-------------|---------------------------------| | Histogram | Numerics | "hist" | `metadata.Age.plot(kind = "hist")` | | Bar plot | Strings | "bar" | `metadata.OncotreeSubtype.value_counts().plot(kind = "bar")` | diff --git a/04-data-wrangling2.Rmd b/04-data-wrangling2.Rmd new file mode 100644 index 0000000..3e51640 --- /dev/null +++ b/04-data-wrangling2.Rmd @@ -0,0 +1,205 @@ +```{r, include = FALSE} +ottrpal::set_knitr_image_path() +``` + +# Data Wrangling, Part 2 + +We will continue to learn about data analysis with Dataframes. Let's load our three Dataframes from the Depmap project in again: + +```{python} +import pandas as pd +import numpy as np + +metadata = pd.read_csv("classroom_data/metadata.csv") +mutation = pd.read_csv("classroom_data/mutation.csv") +expression = pd.read_csv("classroom_data/expression.csv") +``` + +## Creating new columns + +Often, we want to perform some kind of transformation on our data's columns: perhaps you want to add the values of columns together, or perhaps you want to represent your column in a different scale. + +To create a new column, you simply modify it as if it exists using the bracket operation `[ ]`, and the column will be created: + +```{python} +metadata['AgePlusTen'] = metadata['Age'] + 10 +expression['KRAS_NRAS_exp'] = expression['KRAS_Exp'] + expression['NRAS_Exp'] +expression['log_PIK3CA_Exp'] = np.log(expression['PIK3CA_Exp']) +``` + +where [`np.log(x)`](https://numpy.org/doc/stable/reference/generated/numpy.log.html) is a function imported from the module NumPy that takes in a numeric and returns the log-transformed value. + +Note: you cannot create a new column referring to the attribute of the Dataframe, such as: `expression.KRAS_Exp_log = np.log(expression.KRAS_Exp)`. + +## Merging two Dataframes together + +Suppose we have the following Dataframes: + +`expression` + +| ModelID | PIK3CA_Exp | log_PIK3CA_Exp | +|--------------|------------|----------------| +| "ACH-001113" | 5.138733 | 1.636806 | +| "ACH-001289" | 3.184280 | 1.158226 | +| "ACH-001339" | 3.165108 | 1.152187 | + +`metadata` + +| ModelID | OncotreeLineage | Age | +|--------------|-----------------|-----| +| "ACH-001113" | "Lung" | 69 | +| "ACH-001289" | "CNS/Brain" | NaN | +| "ACH-001339" | "Skin" | 14 | + +Suppose that I want to compare the relationship between `OncotreeLineage` and `PIK3CA_Exp`, but they are columns in different Dataframes. We want a new Dataframe that looks like this: + +| ModelID | PIK3CA_Exp | log_PIK3CA_Exp | OncotreeLineage | Age | +|--------------|------------|----------------|-----------------|-----| +| "ACH-001113" | 5.138733 | 1.636806 | "Lung" | 69 | +| "ACH-001289" | 3.184280 | 1.158226 | "CNS/Brain" | NaN | +| "ACH-001339" | 3.165108 | 1.152187 | "Skin" | 14 | + +We see that in both dataframes, + +- the rows (observations) represent cell lines. + +- there is a common column `ModelID`, with shared values between the two dataframes that can faciltate the merging process. We call this an **index**. + +We will use the method [`.merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) for Dataframes. It takes a Dataframe to merge with as the required input argument. The method looks for a common index column between the two dataframes and merge based on that index. + +```{python} +merged = metadata.merge(expression) +``` + +It's usually better to specify what that index column to avoid ambiguity, using the `on` optional argument: + +```{python} +merged = metadata.merge(expression, on='ModelID') +``` + +If the index column for the two Dataframes are named differently, you can specify the column name for each Dataframe: + +```{python} +merged = metadata.merge(expression, left_on='ModelID', right_on='ModelID') +``` + +One of the most import checks you should do when merging dataframes is to look at the number of rows and columns before and after merging to see whether it makes sense or not: + +The number of rows and columns of `metadata`: + +```{python} +metadata.shape +``` + +The number of rows and columns of `expression`: + +```{python} +expression.shape +``` + +The number of rows and columns of `merged`: + +```{python} +merged.shape +``` + +We see that the number of *columns* in `merged` combines the number of columns in `metadata` and `expression`, while the number of *rows* in `merged` is the smaller of the number of rows in `metadata` and `expression`: it only keeps rows that are found in both Dataframe's index columns. This kind of join is called "inner join", because in the Venn Diagram of elements common in both index column, we keep the inner overlap: + +![](images/join.png) + +You can specifiy the join style by changing the optional input argument `how`. + +- `how = "outer"` keeps all observations - also known as a "full join" + +- `how = "left"` keeps all observations in the left Dataframe. + +- `how = "right"` keeps all observations in the right Dataframe. + +- `how = "inner"` keeps observations common to both Dataframe. This is the default value of `how`. + +## Grouping and summarizing Dataframes + +In a dataset, there may be groups of observations that we want to understand, such as case vs. control, or comparing different cancer subtypes. For example, in `metadata`, the observation is cell lines, and perhaps we want to group cell lines into their respective cancer type, `OncotreeLineage`, and look at the mean age for each cancer type. + +We want to take `metadata`: + +| ModelID | OncotreeLineage | Age | +|--------------|-----------------|-----| +| "ACH-001113" | "Lung" | 69 | +| "ACH-001289" | "Lung" | 23 | +| "ACH-001339" | "Skin" | 14 | +| "ACH-002342" | "Brain" | 23 | +| "ACH-004854" | "Brain" | 56 | +| "ACH-002921" | "Brain" | 67 | + +into: + +| OncotreeLineage | MeanAge | +|-----------------|---------| +| "Lung" | 46 | +| "Skin" | 14 | +| "Brain" | 48.67 | + +To get there, we need to: + +- **Group** the data based on some criteria, elements of `OncotreeLineage` + +- **Summarize** each group via a summary statistic performed on a column, such as `Age`. + +We first subset the the two columns we need, and then use the methods [`.group_by(x)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and `.mean()`. + +```{python} +metadata_grouped = metadata.groupby("OncotreeLineage") +metadata_grouped['Age'].mean() +``` + +Here's what's going on: + +- We use the Dataframe method [`.group_by(x)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and specify the column we want to group by. The output of this method is a Grouped Dataframe object. It still contains all the information of the `metadata` Dataframe, but it makes a note that it's been grouped. + +- We subset to the column `Age`. The grouping information still persists (This is a Grouped Series object). + +- We use the method `.mean()` to calculate the mean value of `Age` within each group defined by `OncotreeLineage`. + +Alternatively, this could have been done in a chain of methods: + +```{python} +metadata.groupby("OncotreeLineage")["Age"].mean() +``` + +Once a Dataframe has been grouped and a column is selected, all the summary statistics methods you learned from last week, such as `.mean()`, `.median()`, `.max()`, can be used. One new summary statistics method that is useful for this grouping and summarizing analysis is [`.count()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.count.html) which tells you how many entries are counted within each group. + +### Optional: Multiple grouping, Multiple columns, Multiple summary statistics + +Sometimes, when performing grouping and summary analysis, you want to operate on multiple columns simultaneously. + +For example, you may want to group by a combination of `OncotreeLineage` and `AgeCategory`, such as "Lung" and "Adult" as one grouping. You can do so like this: + +```{python} +metadata_grouped = metadata.groupby(["OncotreeLineage", "AgeCategory"]) +metadata_grouped['Age'].mean() +``` + +You can also summarize on multiple columns simultaneously. For each column, you have to specify what summary statistic functions you want to use. This can be specified via the [`.agg(x)`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html) method on a Grouped Dataframe. + +For example, coming back to our age case-control Dataframe, + +```{python} +df = pd.DataFrame(data={'status': ["treated", "untreated", "untreated", "discharged", "treated"], + 'age_case': [25, 43, 21, 65, 7], + 'age_control': [49, 20, 32, 25, 32]}) + +df +``` + +We group by `status` and summarize `age_case` and `age_control` with a few summary statistics each: + +```{python} +df.groupby("status").agg({"age_case": "mean", "age_control": ["min", "max", "mean"]}) +``` + +The input argument to the `.agg(x)` method is called a [Dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries), which let's you structure information in a paired relationship. You can learn more about dictionaries here. + +## Exercises + +Exercise for week 4 can be found [here](https://colab.research.google.com/drive/1ntkUdKQ209vu1M89rcsBst-pKKuwzdwX?usp=sharing). diff --git a/_bookdown.yml b/_bookdown.yml index 90c6f3a..31cc82e 100644 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -5,6 +5,7 @@ rmd_files: ["index.Rmd", "01-intro-to-computing.Rmd", "02-data-structures.Rmd", "03-data-wrangling1.Rmd", + "04-data-wrangling2.Rmd", "About.Rmd", "References.Rmd"] new_session: yes diff --git a/images/join.png b/images/join.png new file mode 100644 index 0000000..d408d6b Binary files /dev/null and b/images/join.png differ