Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GCMD keywords extractor util #51

Merged
merged 7 commits into from
Jun 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/build_deploy_edge.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ on:
- '**/*.md'
- '.github/environment/**'
- 'geonetwork-config/**'
- 'utilities/**'

permissions:
id-token: write
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ on:
paths-ignore:
- '**/*.md'
- '.github/environment/**'
- 'utilities/**'

concurrency:
group: ${{ github.ref }}
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,5 @@ gn4_data/
**/target/

**/.git-versioned-pom.xml

**/__pycache__
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,3 +121,7 @@ Once you have the json, you can generate code like the one here in Java to acces
You can see a config file related to S3, however we do not use it because after experiment it, it
didn't support well as the GN4 will issue warning on file not found with relative folder name. The
code is just keep as a record.

## Utilities folder

You can use the available utilities inside the `utilities` folder of this repository. More details inside.
1 change: 1 addition & 0 deletions utilities/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Utilities for GeoNetwork
14 changes: 14 additions & 0 deletions utilities/geonetwork-gcmd-extractor/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
.vscode
.idea
*.iml
outputs/*.txt
outputs/*.csv

**/*#

dev.py

**/__pycache__

pregenerated/*.csv
pregenerated/*.txt
674 changes: 674 additions & 0 deletions utilities/geonetwork-gcmd-extractor/LICENSE

Large diffs are not rendered by default.

87 changes: 87 additions & 0 deletions utilities/geonetwork-gcmd-extractor/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@

# GCMD Keywords Extractor

A Python tool for extracting GCMD (Global Change Master Directory) keywords used by metadata records from AODN's GeoNetwork catalog via the CSW service.

This tool assists the AODN Metadata Governance Officer in extracting GCMD keyword on-demand reports.

It works with the CSW service of both GeoNetwork3 and GeoNetwork4.

### Requirements

- Python 3.10
- Poetry
- Conda (recommended for creating a virtual environment)

### Installation

1. **Install Conda** (if not already installed):

Follow the instructions at [Conda Installation](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html).

2. **Create and activate a Conda virtual environment:**

```bash
conda create -n gcmd_extractor python=3.10
conda activate gcmd_extractor
```

3. **Install Poetry** (if not already installed):

```bash
curl -sSL https://install.python-poetry.org | python3 -
```

Make sure to add Poetry to your PATH as instructed during the installation.

4. **Clone the repository:**

```bash
# after cloning the repo with git clone command
cd geonetwork-gcmd-extractor
```

5. **Install dependencies using Poetry:**

```bash
poetry install
```

### Usage

Configurations are defined in `config/config.json`, you can change CSW service source URL in there for example.

Run the script:

```bash
poetry run python main.py
```

For parameter usage instruction
```bash
poetry run python main.py --help
```

### NLP Grouping Similar Texts

There is an implementation for using NLP to fuzzy group similar texts regardless of typos, plurals, case sensitivity, etc. For example:

Inputs:
```python
["Sea surface tempoerature", "SEA SURFACE TEMPERATUR", "car", "cars", "elephant", "ellephent", "antarticca"]
```

Outputs:
```python
['SEA SURFACE TEMPERATURE', 'CAR', 'ELEPHANT', 'ANTARCTICA']
```

This module is not used in the processor class; it is there for reference purposes. To use it, after running `poetry install`, you might want to run `poetry run download-spacy-model` and then import it where needed.

```python
from utils.nlp_grouping import GroupingSimilarTexts
```

### Extracted results

Output files will be generated in the `outputs` folder.
Empty file.
11 changes: 11 additions & 0 deletions utilities/geonetwork-gcmd-extractor/config/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"output_folder": "outputs",
"unique_gcmd_keywords_file": "unique_gcmd_keywords.csv",
"non_unique_gcmd_keywords_file": "non_unique_full_term_gcmd_keywords.csv",
"unique_gcmd_thesaurus_file": "unique_gcmd_thesaurus.csv",
"records_failed_file": "records_failed.txt",
"is_harvested_by_identifier_file": "is_harvested_by_identifier.csv",
"csw_url": "https://catalogue.aodn.org.au/geonetwork/srv/eng/csw?request=GetCapabilities&service=CSW&version=2.0.2",
"output_schema": "http://standards.iso.org/iso/19115/-3/mdb/2.0",
"batch_size": 10
}
21 changes: 21 additions & 0 deletions utilities/geonetwork-gcmd-extractor/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import os
import argparse
from processor.processor import GCMDProcessor

if __name__ == "__main__":
parser = argparse.ArgumentParser(description="GeoNetwork GCMD Extractor")
parser.add_argument(
"--test",
type=int,
help="Run in test mode with the specified number of records. If not provided, the full dataset will be processed.",
)
args = parser.parse_args()

config_path = os.path.join("config", "config.json")
processor = GCMDProcessor(config_path)

if args.test:
print(f"Running in test mode with {args.test} records.")
processor.run(total_records=args.test)
else:
processor.run()
Loading
Loading