08_big_data.Rmd

# Big data {#big_data}

```{r include = FALSE}
# Caching this markdown file 
#knitr::opts_chunk$set(cache = TRUE)
```

## The Big Picture

- Big data problem: data is too big to fit into memory (=local environment).
- R reads data into random-access memory (RAM) at once, and this object lives in memory entirely. So, if object.size > memory.size, the process will crash R. 
- Therefore, the key to dealing with big data in R is reducing the size of data you want to bring into it.

**Techniques to deal with big data**

- Medium-sized file (1-2 GB)
   - Try to reduce the size of the file using slicing and dicing
   - Tools: 
      - R:`data.table::fread(file path, select = c("column 1", "column 2"))`. This command imports data faster than `read.csv()` does.
      - Command-line: [`csvkit`](https://csvkit.readthedocs.io/en/latest/) - a suite of command-line tools to and working with CSV 
- Large file (> 2-10 GB)
   - Put the data into a database and **ACCESS** it 
   - Explore the data and pull the objects of interest 
   
**Databases**

- Types of databases
   - Relational database = a **collection** of **tables** (fixed columns and rows): SQL is a staple tool to define, **query** (the focus of the workshop today), control, and manipulate this type of database
   - Non-relational database = a collection of documents (MongoDB), key-values (Redis and DyanoDB), wide-column stores (Cassandra and HBase), or graph (Neo4j and JanusGraph). Note that this type of database does not preclude SQL. NoSQL stands for ["not only SQL."](https://www.mongodb.com/nosql-explained)
  
**Relational database example**

![Relational Database. Source: MySQL Tutorial](https://sp.mysqltutorial.org/wp-content/uploads/2009/12/MySQL-Sample-Database-Schema.png)

## SQL

- Structured Query Language. Called SEQUEL and was developed by IBM Corporation in the 1970s.

- Remains the standard language for a relational database management system.

- It's a DECLARATIVE language ([what to do > how to do](https://www.sqlite.org/queryplanner.html))
  - Database management systems figure an optimal way to execute a query (query optimization)
    
```sql
SELECT COLUMN FROM TABLE 
```

### Learning objectives 

* Embracing a new mindset: shifting from ownership (opening CSVs stored in your laptop) to access (accessing data stored in a database)

* Learning how to use R and SQL to access and query a database

### SQL and R

* SQL and R

SQL           | R
------------- | --------------------------------------------------------------------------
SELECT        | select() for columns, mutate() for expressions, summarise() for aggregates
FROM          | which data frame 
WHERE         | filter()
GROUP BY      | group_by()
HAVING        | filter() **after group_by()**
ORDER BY      | arrange()
LIMIT         | head()
  
**Challenge 1**

1. Can you tell me the difference in the order in which the following `R` and `SQL` codes were written to manipulate data? For instance, in R, what command comes first? In contrast, in SQL, what command comes first?

- R example 

```r

data %>% # Data 
  select() %>% # Column
  filter() %>% # Row 
  group_by() %>% # Group by 
  summarise(n = n()) %>% # n() is one of the aggregate functions in r; it's count() used inside summarise() function 
  filter() %>% # Row 
  order_by() # Arrange 

```

- SQL example (in a SQL chunk, use `--` instead of `#` to comment) 

```sql 

SELECT column, aggregation (count())` -- Column

FROM data # Data 

WHERE condition -- Filter rows 

GROUP BY column -- Group by

HAVING condition -- Filter rows after group by  

ORDER BY column -- Arrange 

```

![SQL Zine by by [Julia Evans](https://jvns.ca/)](https://wizardzines.com/zines/sql/samples/from.png)

### Setup

Let's get to work. 

### Packages 

- `pacman::p_load()` reduces steps for installing and loading several packages simultaneously. 

```{r}
# pacman 
if (!require("pacman")) install.packages("pacman")

# The rest of pkgs 
pacman::p_load(
 tidyverse, # tidyverse packages 
 DBI, # using SQL queries
 RSQLite, # SQLite
 dbplyr, # use database with dplyr 
 glue, # glue to automate workflow 
 nycflights13 # toy data 
)
```

### NYC flights data 

- [The flight on-time performance data](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236) from the Bureau of Transportation Statistics of the U.S. government. The data goes back to 1987, and its size is more than 20 gigabytes. For practice, we only use a small subset of the original data (flight data departing NYC in 2013) provided by RStudio.

![From RStudio.](https://d33wubrfki0l68.cloudfront.net/245292d1ea724f6c3fd8a92063dcd7bfb9758d02/5751b/diagrams/relational-nycflights.png)

### Workflow 

1. Create/connect to a database 

- Note that the server also can be your laptop (called [localhost](https://en.wikipedia.org/wiki/Localhost#:~:text=In%20computer%20networking%2C%20localhost%20is,via%20the%20loopback%20network%20interface.)).

- Short answer: To do so, you need interfaces between R and a database. We use [`RSQLite`](https://github.com/r-dbi/RSQLite) in this tutorial because it's easy to set up. 

- Long answer: The `DBI` package in R provides a client-side interface that allows `dplyr` to work with databases. DBI is automatically installed when you install `dbplyr`. However, you need to install a specific backend engine (a tool for communication between R and a database management system) for the database (e.g., `RMariaDB`, `RPostgres`, `RSQLite`). In this workshop, we use SQLite because it is the easiest to get started with. I love PostgreSQL because it's open-source and also powerful to do [many amazing things](https://www.postgresql.org/docs/current/functions.html) (e.g., text mining, geospatial analysis). If you want to build a data warehouse, an analytical platform, consider using Spark (Hadoop).

2. Copy a table to the database 

- Option 1: You can create a table and insert rows manually. You also need to define the data schema (the database structure) to do that. 

- Table
    - Collection of rows 
    - Collection of columns (fields or attributes)
    - Each col has a type:
        - String: `VARCHAR(20)`
        - Integer: `INTEGER`
        - Floating-point: `FLOAT`, `DOUBLE`
        - Date/time: `DATE`, `TIME`, `DATETIME`
    - **Schema**: the structure of the database
        - The table name
        - The names and types of its columns
        - Various optional additional information 
            - [Constraints](https://www.w3schools.com/sql/sql_constraints.asp)
                - Syntax: `column datatype constraint`
                - Examples: `NOT NULL`, `UNIQUE`, `INDEX`
        
```sql

-- Create table 

CREATE TABLE students (
    id INT AUTO_INCREMENT,
    name VARCHAR(30),
    birth DATE,
    gpa FLOAT,
    grad INT,
    PRIMARY KEY(id));

-- Insert one additional row 

INSERT INTO students(name, birth, gpa, grad)
      VALUES ('Adam', '2000-08-04', 4.0, 2020);

```

- Option 2: Copy a file (object) to a table in a database using `copy_to`). We take this option as it's fast, and we would like to focus on querying in this workshop. 

3. Query the table 

- Main focus 

4. Pull the results of interests (**data**) using `collect()`

5. Disconnect the database 

#### Create a database 

```{r}

# Define a backend engine 

drv <- RSQLite::SQLite()

# Create an empty in-memory database 
con <- DBI::dbConnect(drv, 
                      dbname = ":memory:")

# Connect to an existing database 
#con <- DBI::dbConnect(RMariaDB::MariaDB(), 
 # host = "database.rstudio.com",
 # user = "hadley",
 # password = rstudioapi::askForPassword("Database password")
#)

dbListTables(con)

# character(0) = NULL

```

- Note that `con` is empty at this stage.

#### Copy an object as a table to the database (push)

```{r}

# Copy objects to the data 
# copy_to() comes from dplyr
copy_to(dest = con, 
        df = flights)

copy_to(dest = con, 
        df = airports)

copy_to(dest = con,
        df = planes)

copy_to(dest = con, 
        df = weather)

# If you need, you can also select which columns you would like to copy:

# copy_to(dest = con, 
#          df = flights, 
#          name = "flights",
#          indexes = list(c("year", "tailnum", "dest")))

```


```{r basic information on tables and fields}

# Show two tables in the database 

dbListTables(con)

# Show the columns/attributes/fields of a table 

dbListFields(con, "flights")

dbListFields(con, "weather")

```

#### Quick demonstrations:

- SELECT desired columns 
- FROM tables 

- Select all columns (*) from `flights` table and show the `first ten rows`
- Note that you can combine SQL and R commands thanks to `dbplyr.`

- Option 1 

```{r}

DBI::dbGetQuery(con, 
                "SELECT * FROM flights;") %>% # SQL
  head(10) # dplyr 

```

- Option 2 (works faster)

```{sql connection=con, eval=FALSE}

SELECT * 
FROM flights 
LIMIT 10

```

- Option 3 (automating workflow)

  - When local variables are updated, the SQL query is also automatically updated. This approach is called [parameterized query](https://www.php.net/manual/en/pdo.prepared-statements.php) (or prepared statement).

```{r}
######################## PREPARATION ########################

# Local variables 
tbl <- "flights"
var <- "dep_delay"
num <- 10

# Glue SQL query string 
# Note that to indicate a numeric value, you don't need.

sql_query <- glue_sql("
  SELECT {`var`}
  FROM {`tbl`}
  LIMIT {num} 
  ", .con = con)

######################## EXECUTION ########################

# Run the query 
dbGetQuery(con, sql_query)
```

**Challenge 2** 
Can you rewrite the above code using `LIMIT` instead of `head(10)`? 

- You may notice that using only SQL code makes querying faster.

- Select `dep_delay` and `arr_delay` from flights table, show the first ten rows, then turn the result into a tibble.

**Challenge 3**
Could you remind me how to see the list of attributes of a table? Let's say you want to see the `flights` table attributes. How can you do it?

- Collect the selected columns and filtered rows 

```{r}

df <- dbGetQuery(con, 
  "SELECT dep_delay, arr_delay FROM flights;") %>%
  head(10) %>%
  collect()

```
  
- Counting rows 

  - Count all (*)
  
```{r}

dbGetQuery(con, 
          "SELECT COUNT(*) 
           FROM flights;") 

```
  
```{r}

dbGetQuery(con, 
           "SELECT COUNT(dep_delay)
           FROM flights;")

```

  - Count distinct values 
  
```{r}
dbGetQuery(con, 
           "SELECT COUNT(DISTINCT dep_delay)
           FROM flights;")

```

#### Tidy-way: dplyr -> SQL

Thanks to the `dbplyr` package, you can use the `dplyr` syntax to query SQL. 

- Note that pipe (%) works.

```{r}

# tbl select tables
flights <- con %>% tbl("flights")
airports <- con %>% tbl("airports")
planes <- con %>% tbl("planes")
weather <- con %>% tbl("weather")

```

- `select` = `SELECT`

```{r}

flights %>% 
  select(contains("delay"))

```

**Challenge 4** 
Your turn: write the same code in SQL. Don't forget to add the `connection` argument to your SQL code chunk.

- `mutate` = `SELECT` `AS`

```{r}

flights %>%
  select(distance, air_time) %>%  
  mutate(speed = distance / (air_time / 60)) 

```

**Challenge 5** 
Your turn: write the same code in SQL. (
Hint: `mutate(new_var = var 1 * var2` (R) = `SELECT var1 * var2 AS near_var` (SQL)

- `filter` = `WHERE` 

```{r}

flights %>% 
  filter(month == 1, day == 1) # filter(month ==1 & day == 1) Both work in the same way.

```

**Challenge 6** 
Your turn: write the same code in SQL (hint: `filter(condition1, condition2)` = `WHERE condition1 and condition2`)

**Additional tips**

Note that R and SQL operators are not exactly alike. R uses `!=` for `Not equal to`. SQL uses `<>` or `!=`. Furthermore, there are some cautions about using `NULL` (NA; unknown or missing): it should be `IS NULL` or `IS NOT NULL` not `=NULL` or `!=NULL` (this makes sense because NULL represents an absence of a value). 

Another pro-tip is [`LIKE` operator](https://www.w3schools.com/sql/sql_like.asp), used in a `WHERE` statement to find values based on string patterns.

```{sql, connection = con}
SELECT DISTINCT(origin) -- Distinct values from origin column
FROM flights
WHERE origin LIKE 'J%'; -- Find any origin values that start with "J"
```

`%` is one of the wildcards you can use for string matching. `%` matches any number of characters. So, `J%` matches Jae, JFK, Joseph, etc. `_` is another useful wildcard that matches exactly one character. So `J_` matches only JA, JE, etc. If wildcards are not enough, then you should consider using regular expressions.

- `arrange` = `ORDER BY`

```{r}

flights %>% 
  arrange(carrier, desc(arr_delay)) %>%
  show_query()

```

**Challenge 7** 
Your turn: write the same code in SQL.
Hint: `arrange(var1, desc(var2)` (R) = `ORDER BY var1, var2 DESC` (SQL)

- `summarise` = `SELECT` `AS` and `group by` = `GROUP BY`

```{r}

flights %>%
  group_by(month, day) %>%
  summarise(delay = mean(dep_delay)) 

```

**Challenge 8** 
Your turn: write the same code in SQL (hint: in SQL the order should be `SELECT group_var1, group_var2, AVG(old_var) AS new_var` -> `FROM` -> `GROUP BY`)

- If you feel too much challenged, here's a help.

```{r}

flights %>%
  group_by(month, day) %>%
  summarise(delay = mean(dep_delay)) %>%
  show_query() # Show the SQL equivalent!

```

- Joins 

- Using joins is more straightforward in R than it is in SQL.

- However, more flexible joins exist in SQL, and they are not available in R. 

  - Joins involving 3+ tables are not supported.
  - Some advanced joins available in SQL are not supported. 
  - For more information, check out [`tidyquery`](https://github.com/ianmcook/tidyquery/issues) to see the latest developments.

- SQL command 

`FROM one table LEFT JOIN another table ON condition = condition` (`ON` in SQL = `BY` in R)

```{sql, connection = con}
SELECT *
FROM flights AS f
LEFT JOIN weather AS w 
ON f.year = w.year AND f.month = w.month
```

Can anyone explain why SQL query using `dplyr` then translated by `show_query()` looks more complex than the above? ([Hint](https://stackoverflow.com/questions/36808295/how-to-remove-duplicate-columns-from-join-in-sql))

```{r}

flights %>% 
  left_join(weather, by = c("year", "month")) %>%
  show_query()

```

#### Collect (pull)

* `collect()` is used to pull the data. Depending on the data size, it may take a long time to run.

- The following code won't work.

> Error in UseMethod("collect") : no applicable method for 'collect' applied to an object of class "c('LayerInstance', 'Layer', 'ggproto', 'gg')"

```{r eval = FALSE}

origin_flights_plot <- flights %>%
  group_by(origin) %>%
  tally() %>%
  ggplot() +
  geom_col(aes(x = origin, y = n)) %>%
  collect()

```

- This works. 

```{r}
df <- flights %>%
  group_by(origin) %>%
  tally() %>%
  collect()

origin_flights_plot <- ggplot(df) +
  geom_col(aes(x = origin, y = n))

origin_flights_plot
```

#### Disconnect 

```{r}

DBI::dbDisconnect(con)

```

### Things we didn't cover 

#### Subquery

Subquery = a query nested inside a query 

This hypothetical example is inspired by [dofactory blog post](https://www.dofactory.com/sql/subquery).

```{sql, eval = FALSE}
SELECT names  -- Outer query 
FROM consultants
WHERE Id IN (SELECT ConsultingId
                FROM consulting_cases 
                WHERE category = 'r' AND category = 'sql'); -- Subquery 
```

#### Common table expression (WITH clauses)

This is just a hypothetical example inspired by [James LeDoux's blog post](https://jamesrledoux.com/code/sql-cte-common-table-expressions.

```{sql, eval = FALSE}
-- cases about R and SQL from dlab-database 
WITH r_sql_consulting_cases AS ( -- The name of the CTE expression 
  -- The CTE query 
  SELECT
    id 
  FROM 
    dlab 
  WHERE
    tags LIKE '%sql%'
  AND
    tags LIKE '%r%'
),
-- count the number of open cases about this consulting category 
-- The outer query 
SELECT status, COUNT(status) AS open_status_count
FROM dlab as d 
INNER JOIN r_sql_consulting_cases as r
  ON d.id = r.id 
WHERE status = 'open'; 
```

### References 

- [csv2db](https://github.com/csv2db/csv2db) - for loading large CSV files in to a database 
- R Studio, [Database using R](https://db.rstudio.com/)
- Ian Cook, ["Bridging the Gap between SQL and R"](https://github.com/ianmcook/rstudioconf2020/blob/master/bridging_the_gap_between_sql_and_r.pdf) rstudio::conf 2020 slides
  - [Video recording](https://www.youtube.com/watch?v=JwP5KdWSgqE&ab_channel=RStudio)
- Data Carpentry contributors, [SQL database and R](https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html), Data Carpentry, September 10, 2019.
- [Introduction to dbplyr](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html)
- Josh Erickson, [SQL in R](http://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/sql.html), STAT 701, University of Michigan
- [SQL zine](https://wizardzines.com/zines/sql/) by Julia Evans
- [q](http://harelba.github.io/q/) - a command-line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files)