-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsql.qmd
121 lines (82 loc) · 5.14 KB
/
sql.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
title: "SQL"
engine: knitr
---
## Overview
So far, we've been able to use [{{< fa brands python >}} Python]{.py} or [{{< fa brands r-project >}} R]{.r} on CSV files without issue. However, when the data that we are working with become very large (e.g., millions of rows) it can be very inefficient to rely on your computer's internal memory to do operations. `r fontawesome::fa(name = "database", fill = "#283E82", a11y = "sem")` [SQL]{.sql} (<u>S</u>tructured <u>Q</u>uery <u>L</u>anguage) is another programming language specifically built to work with larger databases where users "query" just the component of the data that they need for a given operation. This minimizes the amount of data that must be held in memory at any given time.
A caveat before we begin: `r fontawesome::fa(name = "database", fill = "#283E82", a11y = "sem")` [SQL]{.sql} is a fully fledged language in its own right that carries a suite of verbs and syntax conventions. We'll need to engage with these a little in our examples but this website is not meant as a tutorial in this language. Note that [{{< fa brands r-project >}} R]{.r} users may have some `r fontawesome::fa(name = "database", fill = "#283E82", a11y = "sem")` [SQL]{.sql} intuition already because `r fontawesome::fa(name = "database", fill = "#283E82", a11y = "sem")` [SQL]{.sql} shares many of its verbs with the `dplyr` package.
## Library Loading
Begin by loading the needed libraries.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Load the `DBI` package.
```{r r-libs}
# Load needed libraries
library(DBI)
```
## [{{< fa brands python >}} Python]{.py}
Load the `pandas`, `os`, and `sqlite3` packages.
```{python py-libs}
# Load needed libraries
import os
import pandas as pd
import sqlite3
```
:::
## Connecting to Databases
SQLite operations require us to connect to a database, perform desired operations, and then _close the connection_ when we are finished. A good introduction to this is showing how one can list the data tables available in a given database; this is _vital_ information for doing any "real" queries as we need to know the name of the table from which we want to extract data.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
In [{{< fa brands r-project >}} R]{.r} we can open the connection with the `dbConnect` function, identify the tables in the database with the `dbListTables` function and--eventually--close the connection with the `dbDisconnect` function (all of which are in the `DBI` package).
```{r r-sql-connect}
# Open the connection
cxn <- DBI::dbConnect(RSQLite::SQLite(), file.path("data", "mammals.sqlite"))
# Identify tables in the database
DBI::dbListTables(conn = cxn)
# Close the connection
DBI::dbDisconnect(conn = cxn)
```
## [{{< fa brands python >}} Python]{.py}
In [{{< fa brands python >}} Python]{.py} we can open the connection with the `connect` function (from the `sqlite3` library), identify the tables in the database with the `read_sql_query` function (from `pandas`) and--eventually--close the connection with the `close` method.
```{python py-sql-connect}
# Open the connection
cxn = sqlite3.connect(os.path.join("data", "mammals.sqlite"))
# Query some columns in the 'surveys' object
pd.read_sql_query(sql = "SELECT name FROM sqlite_master WHERE type = 'table'", con = cxn)
# Close the connection
cxn.close()
```
:::
## Extracting Data
Connecting and listing the available data in a given database is all well and good but really what we want to do is extract some of the data for later use. To do this we still need to open (and eventually close!) the connection but between those two steps we can use `r fontawesome::fa(name = "database", fill = "#283E82", a11y = "sem")` [SQL]{.sql} syntax to actually perform a query. Note that it is a good practice to fully capitalize `r fontawesome::fa(name = "database", fill = "#283E82", a11y = "sem")` [SQL]{.sql} verbs (e.g., `SELECT`) to help differentiate them from column / data table names.
At a minimum a query must involve `SELECT` to dictate which columns to keep (accepts either column names or a `*` for 'all columns') and `FROM` to indicate which table in the database we want to query.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
```{r r-sql1}
# Open the connection
cxn <- DBI::dbConnect(RSQLite::SQLite(), file.path("data", "mammals.sqlite"))
# Query some columns in the 'surveys' object
query_r <- DBI::dbGetQuery(conn = cxn, statement = "SELECT year, species_id, plot_id FROM surveys")
# Close the connection
DBI::dbDisconnect(conn = cxn)
```
Note that we can still use the query data object even when we have closed our database connection!
```{r r-sql2}
# Check out the first few rows of that
head(query_r, n = 4)
```
## [{{< fa brands python >}} Python]{.py}
```{python py-sql1}
# Open the connection
cxn = sqlite3.connect(os.path.join("data", "mammals.sqlite"))
# Query some columns in the 'surveys' object
query_py = pd.read_sql_query(sql = "SELECT year, species_id, plot_id FROM surveys", con = cxn)
# Close the connection
cxn.close()
```
Note that we can still use the query data object even when we have closed our database connection!
```{python py-sql2}
# Check out the first few rows of that
query_py.head(4)
```
:::