-
Notifications
You must be signed in to change notification settings - Fork 19
/
Copy paththursday_03.html
566 lines (454 loc) · 21 KB
/
thursday_03.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
<!DOCTYPE html>
<html>
<head>
<title>Digital Summer School 2024: THU03</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<link rel="stylesheet" href="../style.css"> </head>
<body>
<textarea id="source">
class: center, middle, titlepage
### THU03: *Scaling Digital Preservation Workflows*
---
class: contentpage
### **Agenda**
1. Scheduling Tools
1.1 Unix Cron
1.2 Windows Task Scheduler
2. Environmental variables
2.1 Setting them in Unix/Windows
2.2 Accessing them in code
3. Expanding workflows with APIs
3.1 RESTful API basics
3.2 Requests library
3.3 Rate limiting
4. Elasticsearch and SQLite databases
4.1 Elasticsearch
4.2 SQLite database
---
class: contentpage
### **1. Scheduling Tools**
Script scheduling involves automating the execution of scripts or programs at predetermined times or intervals. This is crucial for:
- Routine maintenance tasks
- Automated backups
- Periodic data processing
- Recurring reports generation
Key components of script scheduling include the scheduling mechanism, exectution parameters and the script, programme or command.
```sh
# Example of job definition:
# .---------------- minute (0 - 59)
# | .------------- hour (0 - 23)
# | | .---------- day of month (1 - 31)
# | | | .------- month (1 - 12) OR jan,feb,mar,apr ...
# | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# | | | | |
# * * * * * user-name command to be executed
47 6 * * 7 root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )
#
```
---
class: contentpage
### **1.1 Unix Cron**
Cron is the standard Unix scheduler for Mac and Linux. It uses crontab files to manage schedules, structing the timing parameters: `Minute Hour Day Month DayOfWeek Command`
```sh
# MIN HR DAY MTH DAY/WK USR FLOCK LOCK DETAILS COMMAND
*/15 * * * * root /usr/bin/flock -w 0 --verbose /var/run/dpx_rawcook.lock ${DPX_SCRIPTS}qnap/dpx_rawcook.sh
```
This example runs a shell script for automated DPX encoding every 15 minutes. The Linux Flock lock flock can help prevent multiple instances of a script from running concurrently, essential for maintaining system stability and data integrity.
Advanced cron features include:
- Reboot tasks at system startup using `@reboot` prefix
- Environment variable setting by adding $PATH information into cron
- Log redirection to capture stdout/stderr `>> /path/to/logfile 2>&1`
To launch a cron editing session you can call one of the following:
- `crontab -e` opens a user's crontab editing. All users including root user can schedule scripts here
- `/etc/crontab/` is a system wide crontab intended for system scheduling purposes
---
class: contentpage
### **1.1 Unix Cron**
```sh
# /etc/crontab: system-wide crontab
# Unlike any other crontab you don't have to run the `crontab'
# command to install the new version when you edit this file
# and files in /etc/cron.d. These files also have username fields.
# MIN HR DAY MTH DAY/WK USR FLOCK LOCK DETAILS COMMAND
*/15 * * * * root /usr/bin/flock -w 0 --verbose /var/run/dpx_rawcook.lock ${DPX_SCRIPTS}qnap/dpx_rawcook.sh
0 */4 * * * root /usr/bin/flock -w 0 --verbose /var/run/dpx_post_rawcook.lock ${DPX_SCRIPTS}qnap/dpx_post_rawcook.sh
30 */4 * * * root /usr/bin/flock -w 0 --verbose /var/run/dpx_check_script.lock ${DPX_SCRIPTS}qnap/dpx_check_script.sh
0 20 * * 5 root ${PY3_ENV} ${DPX_SCRIPTS}qnap/dpx_part_whole_move.py > /tmp/python_cron.log 2>&1
*/5 * * * * root ${CODE}flock_rebuild.sh
# Example of job definition:
# .---------------- minute (0 - 59)
# | .------------- hour (0 - 23)
# | | .---------- day of month (1 - 31)
# | | | .------- month (1 - 12) OR jan,feb,mar,apr ...
# | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# | | | | |
# * * * * * user-name command to be executed
17 * * * * root cd / && run-parts --report /etc/cron.hourly
25 6 * * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
47 6 * * 7 root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )
52 6 1 * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly )
#
```
---
class: contentpage
### **1.1 Unix Cron**
When cron jobs fail it can be hard to identify the cause of the problem.
- Syntax error somewhere in a schedule line
- Incorrect numbers in designated column (ie, 8, or 9 in a day of the week)
- Unknown user or mistyped user name
- Insufficient user permissions for the command
When the crontab fails to run it can take some time to understand what the problems is:
- Restart crontab after making challenges
```sh
sudo systemctl restart cron
```
- Check the logs
```sh
/var/log/cron (CentOS) /var/log/syslog (Ubuntu or Debian)
```
- Run CronitorCLI software which includes shell command to test run any job emulating cron, with debugging
```sh
cronitor select
```
---
class: contentpage
### **1.2 Windows Task Scheduler**
The Windows Task Scheduler is the equivalent to cron.
Graphical interface for creating and managing scheduled tasks, which can also be managed via the command line (schtasks)
You can launch the Task Scheduler GUI from `CMD.exe` or `Powershell` by typing:
```sh
taskschd.msc
```
The Task scheduler offers hey features:
- Flexible scheduling options (daily, weekly, monthly, etc.)
- Ability to set conditions (e.g., run only when idle)
- Can run scripts, programs, or send emails
Example schtasks command:
```sh
schtasks /create /tn "Backup daily" /tr C:\scripts\backup.bat /sc daily /st 12:00
```
This command creates a task named "Daily Backup" that runs backup.bat daily at 12:00 Noon
---
class: contentpage
### **2. Environmental variables**
Environment variables are dynamic values that can affect the behavior of running processes on a computer. They are part of the environment in which a process runs and are typically used for:
- Store configuration settings
- Manage system-wide settings
- Pass information between processes
- Enhancing security by keeping sensitive data out of code
- Facilitate cross-environment development
Advantages when using them in your code:
- If dynamic content changes they can change application behaviour without modifying the code itself
- They store sensitive information (like API keys) keeping them invisible in repositories
- The same variable name can be allocated across environments, but contain different information like paths
- Useful to help anage different settings for development, testing, and production
- They help with cross-platform code compatibility when storing OS specific data
---
class: contentpage
### **2.1 Setting environmental variables in Windows**
Create a temporary environmental variable for current session only - make the variable name unique!
In Command Prompt then Powershell:
```sh
set MY_VARIABLE=my_value
$env:MY_VARIABLE = "my_value"
```
Permanent for the logged in user.
Using Command Prompt then Powershell:
```sh
setx MY_VARIABLE "my_value"
[System.Environment]::SetEnvironmentVariable("MY_VARIABLE", "my_value", "User")
```
System-wide using PowerShell run as Administrator:
```sh
[System.Environment]::SetEnvironmentVariable("MY_VARIABLE", "my_value", "Machine")
```
---
class: contentpage
### **2.1 Setting environmental variables in Unix**
You can create temporary, permanent and system wide environmental variables.
Temporary variable to be used only in your current session:
```sh
export MY_VARIABLE="my_value"
```
Permanent for a specific user creating the variable. Depending on the shell being used you would add this to `~/.bashrc`, `~/.bash_profile`, or `~/.zshrc`:
```sh
echo 'export MY_VARIABLE="my_value"' >> ~/.bashrc
source ~/.bashrc
```
You can add a system-wide environment variable so all users/automations to easily access add them:
```sh
sudo echo 'MY_VARIABLE="my_value"' >> /etc/environment
sudo nano /etc/environment
```
To make the variable available you need to run this command then log in the server again plus log out and in again:
```sh
source /etc/environment
```
---
class: contentpage
### **2.2 Accessing environmental variables in code**
You have seen may examples of this already, particulary in the example bash scripts
```sh
variable="$ENV_VARIABLE"
echo $ENV_VARIABLE
```
To access them in Python code:
```python3
import os
# Safely check if the environmental variable exists
if 'ENV_VARIABLE' in os.environ:
# Retrieve environmental variable
variable = os.environ.get('ENV_VARIABLE')
else:
return False
```
---
class: contentpage
### **3. Expanding workflows with APIs**
.center[<img src="https://raw.githubusercontent.com/digitensions/summer-school-2024-local/main/thursday/images/Screenshot 2024-09-22 at 12.39.31.png" width="840">]
---
class: contentpage
### **3.1 RESTful API basics**
RESTful API (REpresentational State Transfer) allow two systems to exchange information over the internet, using HTTP methods GET, PUT, POST, DELETE.
Key elements of RESTful APIs include:
- Client or software on user computer initiates the communication
- Server offers the API as a means of accessing its data or service
- The client retrieves a resource from the API
- Each REST message sent and received is self-descriptive
Different HTTP response codes are returned depending on the success of you call:
- 200 reports that a request has been successful
- 201 can sometimes be used to report a successful create (POST)
- 400 Bad Request is returned when something is wrong with the call or the endpoint
- 404 Not Found is returned when an API URL is incorrect or the API is off-line
- 401 is returned when an unauthorised request is placed
- 500 is returned when the API system has a problem, like the API cannot connect to the database
---
class: contentpage
### **3.1 RESTful API basics**
- An endpoint URL for the API
```sh
URL = "http://127.0.0.1:5000/api/v1/resources/books"
```
- The HTTP method for Create (POST), Read (GET), Update (PUT), Delete (DELETE) - A 'CRUD' action
```sh
POST
```
- HTTP headers such as authentication tokens or content type confirmation
```sh
"content-type": "application/json"
"accept": "application/json"
"apikey": os.environ['API_KEY']
```
- Body data usually supplied in XML or JSON encoded strings
```sh
{
"title": "Titus Groan",
"author": "Mervyn Peake",
"first_sentence": "Gormenghast, that is, the main massing of the original stone, taken by itself would have displayed a certain ponderous architectural quality were it possible to have ignored the circumfusion of those mean dwellings that swarmed like an epidemic around its outer walls",
"year_published": "1968"
}
```
---
class: contentpage
### **3.2 Requests library**
.left[<img src="https://raw.githubusercontent.com/digitensions/summer-school-2024-local/main/thursday/images/Screenshot 2024-09-23 at 22.18.23.png" width="600">]
---
class: contentpage
### **3.2 Requests library**
Requests is the standard method for sending HTTP requests in Python. The library simplifies the process of sending and receiving, leaving developers with more time to process the data received!
Let try launching the requests library using a Flask API linked to a database of sci-fi book authors. Open a new terminal/command prompt, cd into the `thurdsay_script` folder, then into the `api` folder.
```sh
pip install Flask
python3 api/launch_api.py
```
Flask is a micro web framework written in Python that is well-suited for building REST APIs due to its flexibility and simplicity. This API has no authentication requirements and requires the wifi to be on.
Load up your web browser and see if this address is showing you a basic database:
`http://127.0.0.1:5000/api/v1/resources/books/all`
---
class: contentpage
### **3.2 Requests library**
<font color="orange">Practise:</font> Return to your Python shell and call `pip install requests`. Test a simple GET request and look at the different responses available:
```python
pip install requests
import requests
# Set the URL for retrieving all contents
url = "http://127.0.0.1:5000/api/v1/resources/books/all"
# Create the HTTP GET request
response = requests.get(url)
# Assess the reponse data
print(f"Method used: {response.request.method}")
print(f"Response status code: {response.status_code}"")
print(f"Content-Type: {response.headers.get('Content-Type')})
print(f"Response date: {response.headers.get('Date')})
print(f"Response: {response.json()}")
```
---
class: contentpage
### **3.2 Requests library**
<font color="orange">Practise:</font> Try more focused GET requests by picking an author and searching for the author's name, publishing date or title of the book:
```python3
import requests
URL = "http://127.0.0.1:5000/api/v1/resources/books"
query = {
"author": "Jo Walton"
}
book = requests.get(URL, params=query, verify=False)
print(book.text)
print(book.json())
```
---
class: contentpage
### **3.2 Requests library**
To make a POST request using a book sample from the repository path `thursday_scripts/api/new_books.json`. The payload dictionary need wrapping in `json.dumps()` method, and ensure you include the headers needed to inform the API how the POST data is formatted:
```python
import json
import requests
URL = "http://127.0.0.1:5000/api/v1/resources/books"
HEADERS = {'Content-Type': 'application/json'}
# Populate a JSON dictionary with sample data from the new_books.json and pick a unique ID
# to avoid replicating another user's POST
payload = {
"id": 49,
"title": "Titus Groan",
"author": "Mervyn Peake",
"first_sentence": "Gormenghast, that is, the main massing of the original stone...",
"year_published": "1968"
}
response = requests.post(URL, headers=HEADERS, data=json.dumps(payload))
print(response.json())
```
---
class: contentpage
### **3.2 Requests library**
You can refresh the website to see if the new book is there:
`http://127.0.0.1:5000/api/v1/resources/books/all`
It's possible to delete the book again using the DELETE request. Import JSON again to format the delete query. Select the title and unique ID of your book just written and build a query dictionary to identify the record for deletion.
```python
import json
import requests
URL = "http://127.0.0.1:5000/api/v1/resources/books"
HEADERS = {'Content-Type': 'application/json'}
query = {
"title": "Titus Groan",
"id": 49
}
response = requests.delete(URL, headers=HEADERS, data=json.dumps(query))
print(response.json())
```
---
class: contentpage
### **3.3 Rate limiting and Retrying**
API rate limiting is a technique used to control the number of requests that a client can make to an API within a specified period.
- Rate limiting protects against Denial-of-Service (DoS) attacks and offenses like spamming API endpoints
- By limiting requests, servers can efficiently manage resources so no one client can dominate access
- It can save money when APIs are based in cloud environments and charged based on usage
- There may be legal or contractual obligations to limit data throughput or access
If you encounter problems you can manage the way your code calls the API by using Python tools like `ratelimit` to control how often your requests are made to an API:
```python3
from ratelimit import limits
import requests
FIFTEEN_MINUTES = 900
@limits(calls=15, period=FIFTEEN_MINUTES)
def call_api(url):
response = requests.get(url)
if response.status_code != 200:
raise Exception('API response: {}'.format(response.status_code))
return response
```
---
class: contentpage
### **3.3 Rate limiting and Retrying**
Sometimes APIs can have communications problems, like connection timeouts or an internal error like failed database connection (500 error code). When this happens it's useful to have some means of catching it and retrying your call again. Tenacity is my favourite Python package, which is a library for retrying code after experiencing a failure of these kinds.
```sh
pip install tenacity
```
Tenacity uses function `decorators` to tell functions to retry. By default @retry will work after any raised exception.
```python3
from tenacity import *
@retry
def func(filename):
try:
response = requests.get(url)
except requests.exceptions.Timeout as error:
raise error
```
It's really easy to customise the `@retry` decorator to limit the amount of retries and to set a wait duration
```python
@retry(stop=stop_after_attempt(5), wait=wait_fixed(2))
```
---
class: contentpage
### **4. Elasticsearch indexes and SQLite databases**
The BFI National Archive have a few resources built upon Elasticsearch and SQLite. We use them to store temporary request information between departments.
.centre[<img src="https://raw.githubusercontent.com/digitensions/summer-school-2024-local/main/thursday/images/Screenshot 2024-09-23 at 21.50.54.png" width="800">]
Our SQLite interaction gathers user requests to preserve television programming from curatorial teams. A web interface is provided for them to select a TV channel and data, this is stored into our SQLite database and queried by Python daily, moving files automatically from their back up location into our ingest pathway.
---
class: contentpage
### **4. Elasticsearch indexes and SQLite databases**
.centre[<img src="https://raw.githubusercontent.com/digitensions/summer-school-2024-local/main/thursday/images/Screenshot 2024-09-23 at 21.52.42.png" width="1000">]
Our Elasticsearch index integrates between BFI DPI Browser, which allows colleagues to review ingested file and request to download/transcode them from our Black Pearl tape library. The Elasticsearch index receives this data from the UI frontend, and is queried by my Python scripts to see if new downlaod requests have been placed.
When a file has been processed the `Current status` is updated from `Requested` to `Download complete`, and the Python scripts ignore it.
Storage capacity:
- An SQLite database can hold is equivalent to 281TB, so many trillions of rows of data (18446744073709551616)
- Elasticsearch in comparison can hold 2GB of data, and you cannot query more than 10,000 rows at once
---
class: contentpage
### **4.1 Elasticsearch Indexing, searching documents**
About Elasticsearch:
- Searching in Elasticsearch is very accurate and incredibly fast
- It’s highly scalable across servers and it's easy to replicate data between indexes
- Elasticsearch is very popular for log searching and analysis
- Basic units of information within elasticsearch and are stored in JSON format.
Elasticsearch, which requires installation of Elasticsearch software, and also pip installation of `elasticsearch`.
```sh
pip install elasticsearch
```
An Elasticsearch index can be viewed from your computer using `http://localhost:9200`, or other port number. From here you can view the contents of your index.
```python
from elasticsearch import Elasticsearch
# Set up Elasticsearch
ES = Elasticsearch(['http://localhost:9200'])
if not ES.ping():
print("Something's broken with your path: 'http://localhost:9200'")
sys.exit()
```
---
class: contentpage
### **4.1 Elasticsearch Indexing, searching documents**
So to help demonstrate how both Elasticsearch and SQLite work, I have build a script which creates an example of each, allows us to populate it with data, search and then udpate specific fields in both. You can launch it from the Python shell again:
```python
import elasticsearch
python3
import sqlite_elasticsearch as se
se.create_index()
```
Visit `http://localhost:9200` to see the elasticsearch that's launced. Then you can populate your elasticsearch index with a folder full of files, such as the film_scan folder with the DPX in:
```python
se.populate_elasticsearch('../media/film_scans/')
se.search_status('Complete')
se.search_status('In Progress')
se.update_status('')
```
---
class: contentpage
### **4.2 Connecting to SQLite databases**
SQLite is one of the most used database engines in the world, operational on mobile phones and computers worldwide. It's a lightweight database whose source code is open source, and you can easily create an SQLite database using Python's sqlite3 library.
The database is stored into a `.db` file in a location of your choice. For this script it will appear inside your `thursday_scripts` folder.
Return to the Python shell, and import the `sqlite_elasticsearch` script to test Sqlite functionality:
```python3
import sqlite_elasticsearch as se
se.create_database_content('../media/film_scan/')
all_entries = se.fetch_all()
complete_entries = se.fetch_database_content('status', 'Complete')
filename = se.fetch_database_content('fname', '0086400.dpx')
```
The script also includes a function that will change a status, then you can call it again to see it's changed:
```python3
se.change_status('0084600.dpx', 'Transcoding', 'Complete')
filename = se.fetch_database_content('fname', '0086400.dpx')
```
</textarea>
<script src="https://remarkjs.com/downloads/remark-latest.min.js" type="text/javascript"></script>
<script type="text/javascript">var slideshow = remark.create({ratio: "16:9"});</script>
</body>
</html>