normalize dataset names to new versions after 2024 portal relaunch

berlinonline · Nov 28, 2024 · c6976b6 · c6976b6
1 parent 01282f7
commit c6976b6
Show file tree

Hide file tree

Showing 4 changed files with 43,962 additions and 46,787 deletions.
diff --git a/README.md b/README.md
@@ -117,6 +117,52 @@ The structure of the data is as follows:
 }
 ```
 
+## Normalization of Dataset Names
+
+In August 2024, the Open Data Portal received a major update, which resulted in slightly changed URLs for many datasets.
+There is a mapping from the old to the new datasets: https://github.com/berlinonline/berlin_dataset_name_mapping
+
+Starting in November 2024, the dataset names in the usage statistics have been normalized to show the new dataset names everywhere, to allow comparisons through time.
+
+In cases where requests to both the old and the new name were made in the same month, both have been combined to a single entry, with the sum of the impressions and visits.
+This normalization is implemented in the [map_dataset_names.py](bin/map_dataset_names.py) script.
+
+For example, there is the following mapping:
+
+```csv
+old_name,new_name
+verlauf-der-berliner-mauer-1989-wms,verlauf-der-berliner-mauer-1989-wms-bc24fb23
+```
+
+In the usage data, there is:
+
+```json
+...
+    "2024-10": {
+      "verlauf-der-berliner-mauer-1989-wms-bc24fb23": {
+        "impressions": 247,
+        "visits": 210
+      },
+      ...
+      "verlauf-der-berliner-mauer-1989-wms": {
+        "impressions": 1,
+        "visits": 1
+      },
+...
+```
+
+These two entries have been combined to:
+
+```json
+...
+    "2024-10": {
+      "verlauf-der-berliner-mauer-1989-wms-bc24fb23": {
+        "impressions": 248,
+        "visits": 211
+      },
+...
+```
+
 ## License
 
 All software in this repository is published under the [MIT License](LICENSE). All data in this repository (in particular the `.csv` and `.json` files) is published under [CC BY 3.0 DE](https://creativecommons.org/licenses/by/3.0/de/).
@@ -127,6 +173,6 @@ Dataset URL: [https://daten.berlin.de/datensaetze/zugriffsstatistik-daten-berlin
 
 This page was generated from the github repository at [https://github.com/berlinonline/berlin_dataportal_usage](https://github.com/berlinonline/berlin_dataportal_usage).
 
-2020, Knud Möller, [BerlinOnline GmbH](https://www.berlinonline.net)
+2024, Knud Möller, [BerlinOnline GmbH](https://www.berlinonline.net)
 
-Last changed: 2024-11-27
+Last changed: 2024-11-28
diff --git a/bin/map_dataset_names.py b/bin/map_dataset_names.py
@@ -0,0 +1,57 @@
+'''
+A script to replace old package/dataset names in the usage data with the new ones.
+
+See https://github.com/berlinonline/berlin_dataset_name_mapping
+'''
+
+import csv
+import urllib.request
+import json
+import logging
+import sys
+from datetime import datetime
+
+logging.basicConfig(level=logging.INFO)
+LOG = logging.getLogger(__name__)
+
+def sum_stats(stat_1: dict, stat_2: dict) -> dict:
+    sum = {}
+    sum['impressions'] = stat_1['impressions'] + stat_2['impressions']
+    sum['visits'] = stat_1['visits'] + stat_2['visits']
+    return sum
+
+mapping_url = "https://raw.githubusercontent.com/berlinonline/berlin_dataset_name_mapping/refs/heads/main/dataset_name_mapping.2024-09-06.csv"
+response = urllib.request.urlopen(mapping_url)
+LOG.info(f" reading dataset name mappings from {mapping_url} ...")
+lines = [l.decode('utf-8') for l in response.readlines()]
+reader = csv.DictReader(lines)
+mapping = {}
+
+for row in reader:
+    mapping[row['old_name']] = row['new_name']
+
+usage_data_path = "data/current/daten_berlin_de.stats.json"
+if len(sys.argv) > 1:
+    usage_data_path = sys.argv[1]
+
+LOG.info(f" reading usage data from {usage_data_path} ...")
+
+with open(usage_data_path) as f:
+    usage_data = json.load(f)
+
+sub_page_counts = usage_data['stats']['pages']['datensaetze']['sub_page_counts']
+
+LOG.info(" normalising dataset names in usage data ...")
+updated_usage_data = {}
+for date, ids in sub_page_counts.items():
+    updated_usage_data[date] = {}
+    for current_package_name, stats in ids.items():
+        new_package_name = mapping.get(current_package_name, current_package_name)
+        if new_package_name in updated_usage_data[date]:
+            stats = sum_stats(stats, updated_usage_data[date][new_package_name])
+        updated_usage_data[date][new_package_name] = stats
+
+usage_data['stats']['pages']['datensaetze']['sub_page_counts'] = updated_usage_data
+usage_data['timestamp'] = datetime.now().astimezone().strftime('%Y-%m-%d %H:%M:%S %z')
+
+print(json.dumps(usage_data, indent=2))