Skip to content

Commit

Permalink
normalize dataset names to new versions after 2024 portal relaunch
Browse files Browse the repository at this point in the history
  • Loading branch information
knudmoeller committed Nov 28, 2024
1 parent 01282f7 commit c6976b6
Show file tree
Hide file tree
Showing 4 changed files with 43,962 additions and 46,787 deletions.
50 changes: 48 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,52 @@ The structure of the data is as follows:
}
```

## Normalization of Dataset Names

In August 2024, the Open Data Portal received a major update, which resulted in slightly changed URLs for many datasets.
There is a mapping from the old to the new datasets: https://github.com/berlinonline/berlin_dataset_name_mapping

Starting in November 2024, the dataset names in the usage statistics have been normalized to show the new dataset names everywhere, to allow comparisons through time.

In cases where requests to both the old and the new name were made in the same month, both have been combined to a single entry, with the sum of the impressions and visits.
This normalization is implemented in the [map_dataset_names.py](bin/map_dataset_names.py) script.

For example, there is the following mapping:

```csv
old_name,new_name
verlauf-der-berliner-mauer-1989-wms,verlauf-der-berliner-mauer-1989-wms-bc24fb23
```

In the usage data, there is:

```json
...
"2024-10": {
"verlauf-der-berliner-mauer-1989-wms-bc24fb23": {
"impressions": 247,
"visits": 210
},
...
"verlauf-der-berliner-mauer-1989-wms": {
"impressions": 1,
"visits": 1
},
...
```

These two entries have been combined to:

```json
...
"2024-10": {
"verlauf-der-berliner-mauer-1989-wms-bc24fb23": {
"impressions": 248,
"visits": 211
},
...
```

## License

All software in this repository is published under the [MIT License](LICENSE). All data in this repository (in particular the `.csv` and `.json` files) is published under [CC BY 3.0 DE](https://creativecommons.org/licenses/by/3.0/de/).
Expand All @@ -127,6 +173,6 @@ Dataset URL: [https://daten.berlin.de/datensaetze/zugriffsstatistik-daten-berlin

This page was generated from the github repository at [https://github.com/berlinonline/berlin_dataportal_usage](https://github.com/berlinonline/berlin_dataportal_usage).

2020, Knud Möller, [BerlinOnline GmbH](https://www.berlinonline.net)
2024, Knud Möller, [BerlinOnline GmbH](https://www.berlinonline.net)

Last changed: 2024-11-27
Last changed: 2024-11-28
57 changes: 57 additions & 0 deletions bin/map_dataset_names.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
'''
A script to replace old package/dataset names in the usage data with the new ones.
See https://github.com/berlinonline/berlin_dataset_name_mapping
'''

import csv
import urllib.request
import json
import logging
import sys
from datetime import datetime

logging.basicConfig(level=logging.INFO)
LOG = logging.getLogger(__name__)

def sum_stats(stat_1: dict, stat_2: dict) -> dict:
sum = {}
sum['impressions'] = stat_1['impressions'] + stat_2['impressions']
sum['visits'] = stat_1['visits'] + stat_2['visits']
return sum

mapping_url = "https://raw.githubusercontent.com/berlinonline/berlin_dataset_name_mapping/refs/heads/main/dataset_name_mapping.2024-09-06.csv"
response = urllib.request.urlopen(mapping_url)
LOG.info(f" reading dataset name mappings from {mapping_url} ...")
lines = [l.decode('utf-8') for l in response.readlines()]
reader = csv.DictReader(lines)
mapping = {}

for row in reader:
mapping[row['old_name']] = row['new_name']

usage_data_path = "data/current/daten_berlin_de.stats.json"
if len(sys.argv) > 1:
usage_data_path = sys.argv[1]

LOG.info(f" reading usage data from {usage_data_path} ...")

with open(usage_data_path) as f:
usage_data = json.load(f)

sub_page_counts = usage_data['stats']['pages']['datensaetze']['sub_page_counts']

LOG.info(" normalising dataset names in usage data ...")
updated_usage_data = {}
for date, ids in sub_page_counts.items():
updated_usage_data[date] = {}
for current_package_name, stats in ids.items():
new_package_name = mapping.get(current_package_name, current_package_name)
if new_package_name in updated_usage_data[date]:
stats = sum_stats(stats, updated_usage_data[date][new_package_name])
updated_usage_data[date][new_package_name] = stats

usage_data['stats']['pages']['datensaetze']['sub_page_counts'] = updated_usage_data
usage_data['timestamp'] = datetime.now().astimezone().strftime('%Y-%m-%d %H:%M:%S %z')

print(json.dumps(usage_data, indent=2))
Loading

0 comments on commit c6976b6

Please sign in to comment.