Skip to content

Commit

Permalink
some work
Browse files Browse the repository at this point in the history
  • Loading branch information
xrotwang committed Jul 5, 2024
1 parent 090ce15 commit 8ccff62
Show file tree
Hide file tree
Showing 2 changed files with 181 additions and 54 deletions.
Binary file added docs/pycharm_sqlite.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
235 changes: 181 additions & 54 deletions docs/sql_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,32 +5,44 @@

SQL - the Structured Query Language

### A DSL (domain-specific language) as opposed to a general purpose language like Python
### A DSL (domain-specific language)

SQL is a domain-specific language,

> particularly useful in handling structured data, i.e., data incorporating relations among entities and variables.
No file reading, writing, file system operations, etc.
It is not a general purpose language like Python, so among other things it does not support file reading,
writing, file system operations, etc.

As DSL it may be comparable to [POSIX Regular Expressions](https://en.wikipedia.org/wiki/Regular_expression), and like
RegExes, SQL is well supported and integrated in as data access DSL in general purpose languages like
R (via [RSQLite](https://cran.r-project.org/web/packages/RSQLite/vignettes/RSQLite.html))
and Python (via the [sqlite3 module](https://docs.python.org/3/library/sqlite3.html)).


### SQL is old.

terminology like "join", "select", "group by", "intersect", "union" comes from SQL!

Learning SQL is like learning Latin. Makes it soemwhat easier to understand all of its derivatives.
Learning SQL is a bit like learning Latin. Makes it soemwhat easier to understand all of its derivatives.
But the only people speaking it are somewhat strange. In the case of SQL, though, the Vatican is a
really cool place!

SQL is (still) good to have on your CV if you ever apply for data scienctist jobs
SQL is (still) good to have on your CV if you ever apply for data scienctist jobs.
[Google](https://www.google.com/search?q=is+sql+needed+for+data+science)
has snippets saying "extremely important" and "absolutely necessary" on the first page of results for
"is sql needed for data science" :)


### SQL is *declarative* (as opposed to *imperative*), i.e. no "control flow statements",

> SQL is an example of a declarative programming language. Statements do not describe computations directly, but instead describe the desired result of some computation.
Also, modularization via functions is not a thing.
Also, modularization via functions is not a thing (we'll see ways to modularize SQL below, though).

Example:
- values for languages without glottocodes, Python vs. SQL
- or non-cognate coded lexical items?

While this "lack of expressiveness" might seem like a disadvantage, it actually is advantageous to judge
applicability of SQL: If you can fit your data aggregation needs into one statement (one *query*) (and possibly a handful of
Expand All @@ -49,55 +61,160 @@ applicability of SQL: If you can fit your data aggregation needs into one statem

> Despite the existence of standards, virtually no implementations in existence adhere to it fully, and most SQL code requires at least some changes before being ported to different database systems.
- even common functions may have (very) different names (`string_agg` vs `group_concat`).

But SQL is also very well supported in IDEs like PyCharm (professional). You can associate a project with a database
and get a query console to write SQL with tab-completion for table and column names, function names, etc.

<img src="pycharm_sqlite.png" width="400" />


### SQLite

### SQL the language vs. database managers aka RDBMS or database engines
SQL the language vs. database managers aka RDBMS or database engines

But the reason why SQL is a good choice for data manipulation in linguistics is SQLite, i.e. an
ubiquitous, [database engine](https://en.wikipedia.org/wiki/Database_engine) that works on single-file databases.
ubiquitous, [database engine](https://en.wikipedia.org/wiki/Database_engine) that works on **single-file** databases.
So this diversity only hurts us by making googling for SQL solutions a bit trickier. (But it's good to know
that should we run into performance issues with SQLite, PostgreSQL is a likely solution - and is known for adhering "most"
to the SQL standard.)

SQLite is an embedded SQL database engine (embedded in the code accessing the database) as opposed to
RDBMS like PostgreSQL or MySQL, where the database engine runs in a server-like process and is accessed
over the network (or sockets).
over the network (or sockets). Being embeddable means SQLite is everywhere! Firefox stores its bookmarks in SQLite

```sql
sqlite> select visit_count, url from moz_places where url like '%ufggym%' order by visit_count desc limit 3;
116|https://ufggym.eltern-portal.org/start
116|https://ufggym.eltern-portal.org/aktuelles/elternbriefe
81|https://ufggym.eltern-portal.org/aktuelles/schwarzes_brett
```

Shotwell stores its collection metadata in SQLite
```shell
$ sqlite3 .local/share/shotwell/data/photo.db
SQLite version 3.31.1 2020-01-27 19:55:54
Enter ".help" for usage hints.
sqlite> select count(*) from PhotoTable;
44111
```
SQLite adds functionality like input and output.
Example: Writing query results to CSV
The SQLite database format is also among the ["preferred" formats of the Library of Congress](https://www.loc.gov/preservation/resources/rfs/data.html)
for the archiving of datasets (together with CSV, JSON, and a couple more)
## SQL - Getting started
> maximizing the chances for survival and continued accessibility of creative content well into the future.
-> software carpentry: https://swcarpentry.github.io/sql-novice-survey/
Getting used to SQL as R user: Maybe start with dbplyr https://dbplyr.tidyverse.org/ - running
[show_query](https://dplyr.tidyverse.org/reference/explain.html) now and then.
## CLDF SQL
### Running SQL with `sqlite3`
Why is SQL(ite) useful when dealing with CLDF data?
Each CLDF dataset can be converted to a SQLite database, running the `cldf createdb` command installed
with `pycldf`. This SQLite database makes uniform access a lot easier for tools that are not CLDF aware:
- tables are named with component names rather than filenames, so data about languages will be stored
in a table called `LanguageTable`, no matter what the file in the CLDF dataset was called (e.g.
[`societies.csv` in the case of D-PLACE datasets](https://github.com/D-PLACE/dplace-dataset-ea/tree/main/cldf))
- columns are named with CLDF property terms rather than - possibly custom - column names. Thus, the column
holding Glottocodes will always be called `cldf_glottocode`
- foreign keys are inferred from CLDF reference properties
We can create a SQLite database loaded with version v2020.3 of WALS running
```shell
cldf createdb https://raw.githubusercontent.com/cldf-datasets/wals/v2020.3/cldf/StructureDataset-metadata.json wals.sqlite
```
[sqlite3](https://sqlite.org/cli.html)
Interactively:
## SQL - Getting started
The software carpentry lesson [Databases and SQL](https://swcarpentry.github.io/sql-novice-survey/)
provides a really good introduction to both, SQL and SQLite.
From the shell:
```shell
$ sqlite3 lsi.sqlite "select cldf_id, cldf_name from languagetable limit 5"
JAPANESE|JAPANESE
AINU|AINU
KOREAN|KOREAN
TURKI|TURKI
MANCHU|MANCHU
```
To recapitulate the main points of this intro, we can inspect the
following query which computes the number of datapoints in WALS for each Glottocode assigned to a
WALS language in the [Central Pama-Nyungan genus](https://wals.info/languoid/genus/centralpamanyungan).
Note that we exploit the fact that values are linked to languages by the required
[languageReference](https://github.com/cldf/cldf/blob/d529ef48e99ab6a36d43f8734809190d54ea2a4a/components/values/ValueTable-metadata.json#L15-L21)
in `ValueTable`.

or save SQL in a text file and feed it into `sqlite3` via [input redirection](https://www.gnu.org/software/bash/manual/html_node/Redirections.html#Redirecting-Input):
```shell
$ sqlite3 < q.sql
```sql
-- 1. Selecting data
SELECT coalesce(l.cldf_glottocode, '########') AS Glottocode, -- 5. Missing data + 4. Calculating new values
group_concat(distinct l.cldf_name) AS Names, -- 2. ... removing duplicates
count(v.cldf_id) AS Datapoints -- 6. Aggregation
FROM LanguageTable AS l
JOIN -- 7. Combining data
ValueTable AS v
ON l.cldf_id = v.cldf_languageReference
WHERE l.Genus = 'Central Pama-Nyungan' -- 3. Filtering
GROUP BY l.cldf_glottocode -- 6. Aggregation
ORDER BY Datapoints DESC; -- 2. Sorting ...
```
Output:
> [!NOTE]
> Except for the `WHERE l.Genus = ...` clause, this query will work on basically every CLDF StructureDataset.
The result of the query shows that one language has no Glottocode, represented by the special `NULL`
value, which we turned into `'######'` via the [`coalesce` function](https://www.sqlitetutorial.net/sqlite-functions/sqlite-coalesce/).
Glottocode|Names|Datapoints
--- | --- | ---
east2379|Arrernte (Mparntwe)|79
dier1241|Diyari|73
alya1239|Alyawarra|52
pitt1247|Pitta Pitta|41
darl1243|Paakantyi|41
arab1267|Arabana|36
yand1253|Innamincka|26
west2441|Arrernte (Western)|22
ngur1261|Wangkumara|16
adny1235|Adynyamathanha|15
wira1265|Wirangu|12
kaur1267|Kaurna|3
bang1339|Banggarla|3
########|Arrernte|2
wang1290|Wangkangurru|1
kayt1238|Kaytej|1
### Running SQL queries
With [sqlite3](https://sqlite.org/cli.html):
- interactively:
```sql
$ sqlite
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.
sqlite> select cldf_id, cldf_name from languagetable limit 5;
aab|Arapesh (Abu)
aar|Aari
aba|Abau
abb|Arabic (Chadian)
abd|Abidji
```
- from the shell:
```shell
$ sqlite3 wals.sqlite "select cldf_id, cldf_name from languagetable limit 5;"
aab|Arapesh (Abu)
aar|Aari
aba|Abau
abb|Arabic (Chadian)
abd|Abidji
```
- or save SQL in a text file and feed it into `sqlite3` via [input redirection](https://www.gnu.org/software/bash/manual/html_node/Redirections.html#Redirecting-Input):
```shell
$ sqlite3 wals.sqlite < q.sql
aab|Arapesh (Abu)
aar|Aari
aba|Abau
abb|Arabic (Chadian)
abd|Abidji
```
Controlling the output formatting:
```shell
$ sqlite3 --header --csv lsi.sqlite "select cldf_id, cldf_name from languagetable limit 5"
cldf_id,cldf_name
Expand All @@ -109,18 +226,8 @@ MANCHU,MANCHU
```

## CLDF SQL

Each CLDF dataset can be converted to a SQLite database, running the `cldf createdb` command installed
with `pycldf`. This SQLite database makes uniform access a lot easier for tools that are not CLDF aware:
- tables are named with component names rather than filenames
- columns are named with CLDF propery terms rather than - possibly custom - column names

### Examples

1. Investigating coverage in terms of languages of a dataset
2. Computing WALS-style value tables for StructureDatasets
Getting used to SQL as R user: Maybe start with dbplyr https://dbplyr.tidyverse.org/ - running
[show_query](https://dplyr.tidyverse.org/reference/explain.html) now and then.
## SQLite advanced
Expand Down Expand Up @@ -221,21 +328,37 @@ $ sqlite3 glottolog.sqlite < q.sql
```
### Window functions: OVER
https://www.sqlite.org/windowfunctions.html
> A window function is an SQL function where the input values are taken from a "window" of one or more rows in the results set of a SELECT statement.
https://www.sqlitetutorial.net/sqlite-window-functions/
> An SQLite window function performs a calculation on a set of rows that are related to the current row. Unlike an aggregate function, a window function does not cause rows to become grouped into a single result row. a window function retains the row identities. Behind the scenes, window functions can access more than just the current row of the query result.
In the follwing example from doreco, we use the `row_number` window function to augment the list of phones
for a word with an index number, and then only select the first ones to get "word initial phones":
https://github.com/cldf-datasets/doreco/blob/8124db8c45f1186e998d7536401dc6d6dceebb32/etc/views.sql#L3-L12
### A "real world" example
```sql
attach database "phoible.sqlite" as phoible;
attach database "clts.sqlite" as clts;
attach database "lsi-cldf/lsi.sqlite" as lsi;
ATTACH DATABASE "phoible.sqlite" AS phoible;
ATTACH DATABASE "clts.sqlite" AS clts;
ATTACH DATABASE "lsi-cldf/lsi.sqlite" AS lsi;
WITH
lsigraphemes AS (
SELECT
'' as grapheme,
'LSI' as source,
f.cldf_segments || ' ' as segments
from lsi.formtable as f, lsi.languagetable as l
where f.cldf_languagereference = l.cldf_id and l.cldf_glottocode = 'mala1464'
'' AS grapheme,
'LSI' AS source,
f.cldf_segments || ' ' AS segments
FROM lsi.formtable AS f, lsi.languagetable AS l
WHERE f.cldf_languagereference = l.cldf_id AND l.cldf_glottocode = 'mala1464'
UNION ALL
SELECT
substr(segments, 0, instr(segments, ' ')),
Expand All @@ -244,9 +367,9 @@ WITH
FROM lsigraphemes
WHERE segments != ''
),
phoiblegraphemes as (
select distinct c.cltsgrapheme as grapheme, 'PHOIBLE' as source
from
phoiblegraphemes AS (
SELECT DISTINCT c.cltsgrapheme AS grapheme, 'PHOIBLE' AS source
FROM
(
select v.cldf_value as grapheme from phoible.valuetable as v
where cldf_languagereference = 'mala1464' and contribution_id = 1762
Expand All @@ -273,6 +396,11 @@ where clts.name like '%vowel'
order by g.source, g.grapheme;
```
- operators,
- like SQLite's string concatenation `||`
- comparison with `like`
- functions like `substr`, `instr`
### JSON
Expand All @@ -282,7 +410,6 @@ order by g.source, g.grapheme;
## SQL as ideal intermediate data aggregation step between raw data and analysis (e.g. in R)
- CLDF SQL helps with re-use (of code, etc.) because it focuses on the commonalitites between CLDF datasets
- DoReCo: SQLite is fast!
- DoReCo: SQLite is fast! https://github.com/FredericBlum/initial_lengthening/blob/main/init_query.sql
When the result is supposed to be a single table to be fed into R, SQL makes for a transparent aggregation.

0 comments on commit 8ccff62

Please sign in to comment.