Skip to content

Commit

Permalink
Merge pull request #51 from aodn/gcmd-utils
Browse files Browse the repository at this point in the history
Add GCMD keywords extractor util
  • Loading branch information
utas-raymondng authored Jun 7, 2024
2 parents bebfda4 + 37490b2 commit 9dbcd64
Show file tree
Hide file tree
Showing 23 changed files with 2,851 additions and 0 deletions.
1 change: 1 addition & 0 deletions .github/workflows/build_deploy_edge.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ on:
- '**/*.md'
- '.github/environment/**'
- 'geonetwork-config/**'
- 'utilities/**'

permissions:
id-token: write
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ on:
paths-ignore:
- '**/*.md'
- '.github/environment/**'
- 'utilities/**'

concurrency:
group: ${{ github.ref }}
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,5 @@ gn4_data/
**/target/

**/.git-versioned-pom.xml

**/__pycache__
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,3 +121,7 @@ Once you have the json, you can generate code like the one here in Java to acces
You can see a config file related to S3, however we do not use it because after experiment it, it
didn't support well as the GN4 will issue warning on file not found with relative folder name. The
code is just keep as a record.

## Utilities folder

You can use the available utilities inside the `utilities` folder of this repository. More details inside.
1 change: 1 addition & 0 deletions utilities/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Utilities for GeoNetwork
14 changes: 14 additions & 0 deletions utilities/geonetwork-gcmd-extractor/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
.vscode
.idea
*.iml
outputs/*.txt
outputs/*.csv

**/*#

dev.py

**/__pycache__

pregenerated/*.csv
pregenerated/*.txt
674 changes: 674 additions & 0 deletions utilities/geonetwork-gcmd-extractor/LICENSE

Large diffs are not rendered by default.

87 changes: 87 additions & 0 deletions utilities/geonetwork-gcmd-extractor/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@

# GCMD Keywords Extractor

A Python tool for extracting GCMD (Global Change Master Directory) keywords used by metadata records from AODN's GeoNetwork catalog via the CSW service.

This tool assists the AODN Metadata Governance Officer in extracting GCMD keyword on-demand reports.

It works with the CSW service of both GeoNetwork3 and GeoNetwork4.

### Requirements

- Python 3.10
- Poetry
- Conda (recommended for creating a virtual environment)

### Installation

1. **Install Conda** (if not already installed):

Follow the instructions at [Conda Installation](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html).

2. **Create and activate a Conda virtual environment:**

```bash
conda create -n gcmd_extractor python=3.10
conda activate gcmd_extractor
```

3. **Install Poetry** (if not already installed):

```bash
curl -sSL https://install.python-poetry.org | python3 -
```

Make sure to add Poetry to your PATH as instructed during the installation.

4. **Clone the repository:**

```bash
# after cloning the repo with git clone command
cd geonetwork-gcmd-extractor
```

5. **Install dependencies using Poetry:**

```bash
poetry install
```

### Usage

Configurations are defined in `config/config.json`, you can change CSW service source URL in there for example.

Run the script:

```bash
poetry run python main.py
```

For parameter usage instruction
```bash
poetry run python main.py --help
```

### NLP Grouping Similar Texts

There is an implementation for using NLP to fuzzy group similar texts regardless of typos, plurals, case sensitivity, etc. For example:

Inputs:
```python
["Sea surface tempoerature", "SEA SURFACE TEMPERATUR", "car", "cars", "elephant", "ellephent", "antarticca"]
```

Outputs:
```python
['SEA SURFACE TEMPERATURE', 'CAR', 'ELEPHANT', 'ANTARCTICA']
```

This module is not used in the processor class; it is there for reference purposes. To use it, after running `poetry install`, you might want to run `poetry run download-spacy-model` and then import it where needed.

```python
from utils.nlp_grouping import GroupingSimilarTexts
```

### Extracted results

Output files will be generated in the `outputs` folder.
Empty file.
11 changes: 11 additions & 0 deletions utilities/geonetwork-gcmd-extractor/config/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"output_folder": "outputs",
"unique_gcmd_keywords_file": "unique_gcmd_keywords.csv",
"non_unique_gcmd_keywords_file": "non_unique_full_term_gcmd_keywords.csv",
"unique_gcmd_thesaurus_file": "unique_gcmd_thesaurus.csv",
"records_failed_file": "records_failed.txt",
"is_harvested_by_identifier_file": "is_harvested_by_identifier.csv",
"csw_url": "https://catalogue.aodn.org.au/geonetwork/srv/eng/csw?request=GetCapabilities&service=CSW&version=2.0.2",
"output_schema": "http://standards.iso.org/iso/19115/-3/mdb/2.0",
"batch_size": 10
}
21 changes: 21 additions & 0 deletions utilities/geonetwork-gcmd-extractor/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import os
import argparse
from processor.processor import GCMDProcessor

if __name__ == "__main__":
parser = argparse.ArgumentParser(description="GeoNetwork GCMD Extractor")
parser.add_argument(
"--test",
type=int,
help="Run in test mode with the specified number of records. If not provided, the full dataset will be processed.",
)
args = parser.parse_args()

config_path = os.path.join("config", "config.json")
processor = GCMDProcessor(config_path)

if args.test:
print(f"Running in test mode with {args.test} records.")
processor.run(total_records=args.test)
else:
processor.run()
Loading

0 comments on commit 9dbcd64

Please sign in to comment.