slides/thursday_03.html


<!DOCTYPE html>
<html>
  <head>
    <title>Digital Summer School 2024: THU03</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    <link rel="stylesheet" href="../style.css">  </head>
  <body>
    <textarea id="source">


class: center, middle, titlepage
### THU03: *Scaling Digital Preservation Workflows*

---
class: contentpage
### **Agenda**
      
1. Scheduling Tools  
 1.1 Unix Cron   
 1.2 Windows Task Scheduler   

2. Environmental variables   
 2.1 Setting them in Unix/Windows  
 2.2 Accessing them in code  

3. Expanding workflows with APIs   
 3.1 RESTful API basics  
 3.2 Requests library  
 3.3 Rate limiting  

4. Elasticsearch and SQLite databases  
 4.1 Elasticsearch    
 4.2 SQLite database  


---
class: contentpage
### **1. Scheduling Tools**

Script scheduling involves automating the execution of scripts or programs at predetermined times or intervals. This is crucial for:  
- Routine maintenance tasks
- Automated backups
- Periodic data processing
- Recurring reports generation

Key components of script scheduling include the scheduling mechanism, exectution parameters and the script, programme or command.

```sh
# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name command to be executed
47 6    * * 7   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )
#
```
---
class: contentpage
### **1.1 Unix Cron**

Cron is the standard Unix scheduler for Mac and Linux. It uses crontab files to manage schedules, structing the timing parameters: `Minute Hour Day Month DayOfWeek Command`  

```sh
# MIN    HR    DAY  MTH  DAY/WK  USR   FLOCK LOCK DETAILS                                         COMMAND
 */15    *     *    *    *       root  /usr/bin/flock -w 0 --verbose /var/run/dpx_rawcook.lock    ${DPX_SCRIPTS}qnap/dpx_rawcook.sh
```
This example runs a shell script for automated DPX encoding every 15 minutes. The Linux Flock lock flock can help prevent multiple instances of a script from running concurrently, essential for maintaining system stability and data integrity.

Advanced cron features include:     
- Reboot tasks at system startup using `@reboot` prefix
- Environment variable setting by adding $PATH information into cron
- Log redirection to capture stdout/stderr `>> /path/to/logfile 2>&1`

To launch a cron editing session you can call one of the following:
- `crontab -e` opens a user's crontab editing. All users including root user can schedule scripts here
- `/etc/crontab/` is a system wide crontab intended for system scheduling purposes

---
class: contentpage
### **1.1 Unix Cron**

```sh
# /etc/crontab: system-wide crontab
# Unlike any other crontab you don't have to run the `crontab'
# command to install the new version when you edit this file
# and files in /etc/cron.d. These files also have username fields.

# MIN    HR    DAY  MTH  DAY/WK  USR   FLOCK LOCK DETAILS                                             COMMAND
 */15    *     *    *    *       root  /usr/bin/flock -w 0 --verbose /var/run/dpx_rawcook.lock       ${DPX_SCRIPTS}qnap/dpx_rawcook.sh
 0       */4   *    *    *       root  /usr/bin/flock -w 0 --verbose /var/run/dpx_post_rawcook.lock  ${DPX_SCRIPTS}qnap/dpx_post_rawcook.sh
 30      */4   *    *    *       root  /usr/bin/flock -w 0 --verbose /var/run/dpx_check_script.lock  ${DPX_SCRIPTS}qnap/dpx_check_script.sh
 0       20    *    *    5       root  ${PY3_ENV}  ${DPX_SCRIPTS}qnap/dpx_part_whole_move.py > /tmp/python_cron.log 2>&1
 */5     *     *    *    *       root  ${CODE}flock_rebuild.sh

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name command to be executed
17 *    * * *   root    cd / && run-parts --report /etc/cron.hourly
25 6    * * *   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
47 6    * * 7   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )
52 6    1 * *   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly )
#
```
---
class: contentpage
### **1.1 Unix Cron**

When cron jobs fail it can be hard to identify the cause of the problem.
- Syntax error somewhere in a schedule line
- Incorrect numbers in designated column (ie, 8, or 9 in a day of the week)
- Unknown user or mistyped user name
- Insufficient user permissions for the command

When the crontab fails to run it can take some time to understand what the problems is:
- Restart crontab after making challenges
  ```sh
  sudo systemctl restart cron
  ```
- Check the logs
  ```sh
  /var/log/cron (CentOS)  /var/log/syslog (Ubuntu or Debian)
  ```
- Run CronitorCLI software which includes shell command to test run any job emulating cron, with debugging
  ```sh
  cronitor select
  ```
---
class: contentpage
### **1.2 Windows Task Scheduler**

The Windows Task Scheduler is the equivalent to cron.  

Graphical interface for creating and managing scheduled tasks, which can also be managed via the command line (schtasks)

You can launch the Task Scheduler GUI from `CMD.exe` or `Powershell` by typing:
```sh
taskschd.msc
```

The Task scheduler offers hey features:
- Flexible scheduling options (daily, weekly, monthly, etc.)
- Ability to set conditions (e.g., run only when idle)
- Can run scripts, programs, or send emails

Example schtasks command:
```sh
schtasks /create /tn "Backup daily" /tr C:\scripts\backup.bat /sc daily /st 12:00
```
This command creates a task named "Daily Backup" that runs backup.bat daily at 12:00 Noon  

---
class: contentpage
### **2. Environmental variables**

Environment variables are dynamic values that can affect the behavior of running processes on a computer. They are part of the environment in which a process runs and are typically used for:  
- Store configuration settings 
- Manage system-wide settings
- Pass information between processes
- Enhancing security by keeping sensitive data out of code
- Facilitate cross-environment development

Advantages when using them in your code:  

- If dynamic content changes they can change application behaviour without modifying the code itself
- They store sensitive information (like API keys) keeping them invisible in repositories
- The same variable name can be allocated across environments, but contain different information like paths
- Useful to help anage different settings for development, testing, and production
- They help with cross-platform code compatibility when storing OS specific data

---
class: contentpage
### **2.1 Setting environmental variables in Windows**

Create a temporary environmental variable for current session only - make the variable name unique!  
In Command Prompt then Powershell:  
```sh
set MY_VARIABLE=my_value
$env:MY_VARIABLE = "my_value"
```
Permanent for the logged in user.  
Using Command Prompt then Powershell:  
```sh
setx MY_VARIABLE "my_value"
[System.Environment]::SetEnvironmentVariable("MY_VARIABLE", "my_value", "User")
```
System-wide using PowerShell run as Administrator:
```sh
[System.Environment]::SetEnvironmentVariable("MY_VARIABLE", "my_value", "Machine")
```
---
class: contentpage
### **2.1 Setting environmental variables in Unix**

You can create temporary, permanent and system wide environmental variables.
Temporary variable to be used only in your current session:
```sh
export MY_VARIABLE="my_value"
```
Permanent for a specific user creating the variable. Depending on the shell being used you would add this to `~/.bashrc`, `~/.bash_profile`, or `~/.zshrc`:
```sh
echo 'export MY_VARIABLE="my_value"' >> ~/.bashrc
source ~/.bashrc
```
You can add a system-wide environment variable so all users/automations to easily access add them:
```sh
sudo echo 'MY_VARIABLE="my_value"' >> /etc/environment
sudo nano /etc/environment
```
To make the variable available you need to run this command then log in the server again plus log out and in again:
```sh
source /etc/environment
```
---

class: contentpage
### **2.2 Accessing environmental variables in code**

You have seen may examples of this already, particulary in the example bash scripts
```sh
variable="$ENV_VARIABLE"
echo $ENV_VARIABLE
```
To access them in Python code:
```python3
import os

# Safely check if the environmental variable exists
if 'ENV_VARIABLE' in os.environ:
    # Retrieve environmental variable
    variable = os.environ.get('ENV_VARIABLE')
else:
    return False
```

---
class: contentpage
### **3. Expanding workflows with APIs**


.center[<img src="https://raw.githubusercontent.com/digitensions/summer-school-2024-local/main/thursday/images/Screenshot 2024-09-22 at 12.39.31.png" width="840">]


---
class: contentpage
### **3.1 RESTful API basics**

RESTful API (REpresentational State Transfer) allow two systems to exchange information over the internet, using HTTP methods GET, PUT, POST, DELETE.

Key elements of RESTful APIs include:
- Client or software on user computer initiates the communication
- Server offers the API as a means of accessing its data or service
- The client retrieves a resource from the API
- Each REST message sent and received is self-descriptive 

Different HTTP response codes are returned depending on the success of you call:
- 200 reports that a request has been successful
- 201 can sometimes be used to report a successful create (POST) 
- 400 Bad Request is returned when something is wrong with the call or the endpoint
- 404 Not Found is returned when an API URL is incorrect or the API is off-line
- 401 is returned when an unauthorised request is placed
- 500 is returned when the API system has a problem, like the API cannot connect to the database
---
class: contentpage
### **3.1 RESTful API basics**

- An endpoint URL for the API
```sh
URL = "http://127.0.0.1:5000/api/v1/resources/books"
```
- The HTTP method for Create (POST), Read (GET), Update (PUT), Delete (DELETE) - A 'CRUD' action
```sh
POST
```
- HTTP headers such as authentication tokens or content type confirmation
```sh
"content-type": "application/json"
"accept": "application/json"
"apikey": os.environ['API_KEY']
```
- Body data usually supplied in XML or JSON encoded strings
```sh
{
  "title": "Titus Groan",
  "author": "Mervyn Peake",
  "first_sentence": "Gormenghast, that is, the main massing of the original stone, taken by itself would have displayed a certain ponderous architectural quality were it possible to have ignored the circumfusion of those mean dwellings that swarmed like an epidemic around its outer walls",
  "year_published": "1968"
}
```
---
class: contentpage
### **3.2 Requests library**

.left[<img src="https://raw.githubusercontent.com/digitensions/summer-school-2024-local/main/thursday/images/Screenshot 2024-09-23 at 22.18.23.png" width="600">]

---
class: contentpage
### **3.2 Requests library**

Requests is the standard method for sending HTTP requests in Python. The library simplifies the process of sending and receiving, leaving developers with more time to process the data received!

Let try launching the requests library using a Flask API linked to a database of sci-fi book authors. Open a new terminal/command prompt, cd into the `thurdsay_script` folder, then into the `api` folder.

```sh
pip install Flask

python3 api/launch_api.py
```

Flask is a micro web framework written in Python that is well-suited for building REST APIs due to its flexibility and simplicity. This API has no authentication requirements and requires the wifi to be on.

Load up your web browser and see if this address is showing you a basic database:  

`http://127.0.0.1:5000/api/v1/resources/books/all`
---
class: contentpage
### **3.2 Requests library**

<font color="orange">Practise:</font> Return to your Python shell and call `pip install requests`. Test a simple GET request and look at the different responses available:
```python
pip install requests
import requests

# Set the URL for retrieving all contents
url = "http://127.0.0.1:5000/api/v1/resources/books/all"

# Create the HTTP GET request
response = requests.get(url)

# Assess the reponse data
print(f"Method used: {response.request.method}")
print(f"Response status code: {response.status_code}"")
print(f"Content-Type: {response.headers.get('Content-Type')})
print(f"Response date: {response.headers.get('Date')})
print(f"Response: {response.json()}")
```

---
class: contentpage
### **3.2 Requests library**

<font color="orange">Practise:</font> Try more focused GET requests by picking an author and searching for the author's name, publishing date or title of the book:
```python3
import requests

URL = "http://127.0.0.1:5000/api/v1/resources/books"

query = {
  "author": "Jo Walton"
}

book = requests.get(URL, params=query, verify=False)
print(book.text)
print(book.json())
```

---
class: contentpage
### **3.2 Requests library**

To make a POST request using a book sample from the repository path `thursday_scripts/api/new_books.json`. The payload dictionary need wrapping in `json.dumps()` method, and ensure you include the headers needed to inform the API how the POST data is formatted:

```python
import json
import requests

URL = "http://127.0.0.1:5000/api/v1/resources/books"
HEADERS = {'Content-Type': 'application/json'}

# Populate a JSON dictionary with sample data from the new_books.json and pick a unique ID 
# to avoid replicating another user's POST

payload =  {
  "id": 49,
  "title": "Titus Groan",
  "author": "Mervyn Peake",
  "first_sentence": "Gormenghast, that is, the main massing of the original stone...",
  "year_published": "1968"
}

response = requests.post(URL, headers=HEADERS, data=json.dumps(payload))
print(response.json())
```
---
class: contentpage
### **3.2 Requests library**

You can refresh the website to see if the new book is there:  

`http://127.0.0.1:5000/api/v1/resources/books/all`


It's possible to delete the book again using the DELETE request. Import JSON again to format the delete query. Select the title and unique ID of your book just written and build a query dictionary to identify the record for deletion.
```python
import json
import requests

URL = "http://127.0.0.1:5000/api/v1/resources/books"
HEADERS = {'Content-Type': 'application/json'}

query = {
  "title": "Titus Groan",
  "id": 49
}

response = requests.delete(URL, headers=HEADERS, data=json.dumps(query))
print(response.json())
```
---
class: contentpage
### **3.3 Rate limiting and Retrying**
 
API rate limiting is a technique used to control the number of requests that a client can make to an API within a specified period.
- Rate limiting protects against Denial-of-Service (DoS) attacks and offenses like spamming API endpoints
- By limiting requests, servers can efficiently manage resources so no one client can dominate access
- It can save money when APIs are based in cloud environments and charged based on usage
- There may be legal or contractual obligations to limit data throughput or access

If you encounter problems you can manage the way your code calls the API by using Python tools like `ratelimit` to control how often your requests are made to an API:

```python3
from ratelimit import limits
import requests

FIFTEEN_MINUTES = 900

@limits(calls=15, period=FIFTEEN_MINUTES)
def call_api(url):
    response = requests.get(url)

    if response.status_code != 200:
        raise Exception('API response: {}'.format(response.status_code))
    return response
```
---
class: contentpage
### **3.3 Rate limiting and Retrying**

Sometimes APIs can have communications problems, like connection timeouts or an internal error like failed database connection (500 error code). When this happens it's useful to have some means of catching it and retrying your call again.  Tenacity is my favourite Python package, which is a library for retrying code after experiencing a failure of these kinds.  

```sh
pip install tenacity
```
Tenacity uses function `decorators` to tell functions to retry. By default @retry will work after any raised exception.
```python3
from tenacity import *

@retry
def func(filename):
    try:
        response = requests.get(url)
    except requests.exceptions.Timeout as error:
        raise error
```
It's really easy to customise the `@retry` decorator to limit the amount of retries and to set a wait duration
```python
@retry(stop=stop_after_attempt(5), wait=wait_fixed(2))
```

---
class: contentpage
### **4. Elasticsearch indexes and SQLite databases** 

The BFI National Archive have a few resources built upon Elasticsearch and SQLite. We use them to store temporary request information between departments.

.centre[<img src="https://raw.githubusercontent.com/digitensions/summer-school-2024-local/main/thursday/images/Screenshot 2024-09-23 at 21.50.54.png" width="800">]

Our SQLite interaction gathers user requests to preserve television programming from curatorial teams. A web interface is provided for them to select a TV channel and data, this is stored into our SQLite database and queried by Python daily, moving files automatically from their back up location into our ingest pathway.

---
class: contentpage
### **4. Elasticsearch indexes and SQLite databases** 

.centre[<img src="https://raw.githubusercontent.com/digitensions/summer-school-2024-local/main/thursday/images/Screenshot 2024-09-23 at 21.52.42.png" width="1000">]

Our Elasticsearch index integrates between BFI DPI Browser, which allows colleagues to review ingested file and request to download/transcode them from our Black Pearl tape library. The Elasticsearch index receives this data from the UI frontend, and is queried by my Python scripts to see if new downlaod requests have been placed.

When a file has been processed the `Current status` is updated from `Requested` to `Download complete`, and the Python scripts ignore it.

Storage capacity:
- An SQLite database can hold is equivalent to 281TB, so many trillions of rows of data (18446744073709551616)
- Elasticsearch in comparison can hold 2GB of data, and you cannot query more than 10,000 rows at once
---
class: contentpage
### **4.1 Elasticsearch Indexing, searching documents**   

About Elasticsearch:
- Searching in Elasticsearch is very accurate and incredibly fast
- It’s highly scalable across servers and it's easy to replicate data between indexes
- Elasticsearch is very popular for log searching and analysis
- Basic units of information within elasticsearch and are stored in JSON format.

Elasticsearch, which requires installation of Elasticsearch software, and also pip installation of `elasticsearch`.
```sh
pip install elasticsearch
```
An Elasticsearch index can be viewed from your computer using `http://localhost:9200`, or other port number. From here you can view the contents of your index.
```python
from elasticsearch import Elasticsearch

# Set up Elasticsearch
ES = Elasticsearch(['http://localhost:9200'])
if not ES.ping():
    print("Something's broken with your path: 'http://localhost:9200'")
    sys.exit()
```
---
class: contentpage
### **4.1 Elasticsearch Indexing, searching documents**   

So to help demonstrate how both Elasticsearch and SQLite work, I have build a script which creates an example of each, allows us to populate it with data, search and then udpate specific fields in both. You can launch it from the Python shell again:

```python
import elasticsearch

python3

import sqlite_elasticsearch as se

se.create_index()
```

Visit `http://localhost:9200` to see the elasticsearch that's launced. Then you can populate your elasticsearch index with a folder full of files, such as the film_scan folder with the DPX in:
```python
se.populate_elasticsearch('../media/film_scans/')
se.search_status('Complete')
se.search_status('In Progress')
se.update_status('')
```
---
class: contentpage
### **4.2 Connecting to SQLite databases**

SQLite is one of the most used database engines in the world, operational on mobile phones and computers worldwide. It's a lightweight database whose source code is open source, and you can easily create an SQLite database using Python's sqlite3 library.  

The database is stored into a `.db` file in a location of your choice. For this script it will appear inside your `thursday_scripts` folder.

Return to the Python shell, and import the `sqlite_elasticsearch` script to test Sqlite functionality:
```python3
import sqlite_elasticsearch as se

se.create_database_content('../media/film_scan/')
all_entries = se.fetch_all()
complete_entries = se.fetch_database_content('status', 'Complete')
filename = se.fetch_database_content('fname', '0086400.dpx')
```
The script also includes a function that will change a status, then you can call it again to see it's changed:
```python3
se.change_status('0084600.dpx', 'Transcoding', 'Complete')
filename = se.fetch_database_content('fname', '0086400.dpx')
```


    </textarea>
    <script src="https://remarkjs.com/downloads/remark-latest.min.js" type="text/javascript"></script>
    <script type="text/javascript">var slideshow = remark.create({ratio: "16:9"});</script>
  </body>
</html>