WHG PLACE (Place Linkage, Alignment, and Concordance Engine) #381

docuracy · 2024-10-03T08:32:15Z

Vespa would facilitate the development of the WHG PLACE (Place Linkage, Alignment, and Concordance Engine), a system designed not only to improve search results delivered through our APIs and on our website, but also to significantly streamline the incorporation of newly contributed datasets into WHG. The PLACE engine (using G2P+mBERT+FeatureClasses and trained on GeoNames Variants) would greatly reduce the manual effort required to reconcile, match, and rank places across different datasets—an essential step in the ongoing establishment of WHG as a central resource in the Digital Humanities ecosystem.

Overview

This project aims to use GeoNames toponym variants to train a Siamese network for generating vector embeddings that can be used in Vespa for phonetic and semantic similarity searches.

The process begins by identifying toponyms which appear in GeoNames places which have at least one variant in addition to a primary name. PanPhon (derived from G2P) and mBERT embeddings are then calculated for every toponym, and these embeddings are then used for clustering within each place's toponyms to identify phonetic and semantic similarities. From within these clusters, pairs of toponyms which share either phonetic or semantic similarity, or both, are used to train a Siamese network, which is designed to learn and focus on relative similarities between input pairs. This means that small errors in phonetic or semantic representations won't significantly impact the model's ability to differentiate between true variants and non-variants. In combination with PanPhon and mBERT preprocessing, the Siamese network is then used to generate vector embeddings for all of the GeoNames toponyms, which are then stored in a Vespa index for efficient searching. This constitutes the Place Matching Engine.
Steps Involved
- Preprocessing GeoNames Data
  - Dataset: Use the GeoNames dataset, which contains place names and their variants, often together with an indication of their language.
  - Goal: Preprocess the variants to identify sufficiently similar pairs of toponyms that are valid variants of each other.
    - Remove non-relevant symbols and normalize text.
    - For each toponym, calculate:
      - PanPhon phonetic feature vectors: Extracted from IPA representations generated using a G2P model such as CharsiuG2P. PanPhon converts the IPA into phonetic feature vectors that capture detailed articulatory characteristics.
      - Semantic embeddings using mBERT: Generate embeddings using an mBERT model to capture the semantic context of each toponym in its language. Where possible use a fine-tuned mBERT model to account for language-specific variations.
    - Clustering similar names: Use DBSCAN or similar clustering techniques on the combined PanPhon and mBERT embeddings to identify, for each GeoNames place, pairs of toponyms that share phonetic or semantic similarity, or both. The combination helps identify nuanced similarities across languages and dialects.
Training the Siamese Network
- Siamese Network Architecture
  - Goal: Train a Siamese network that takes attributes of pairs of toponyms as input and generates combined embeddings to determine whether they are variants of the same place name.
  - Input:
    - PanPhon phonetic feature vectors.
    - Corresponding mBERT semantic embeddings.
    - GeoNames Feature Classes (categorical attributes of the places).
  - Output: Vector embeddings of each place name.
  - Objective: Minimize the distance between embeddings of variants and maximize the distance between non-variants.
- Training Process
  - Dataset: Use identified GeoNames variant pairs as positive examples and randomly selected non-variant pairs as negative examples.
  - Feature Integration:
    - Combine PanPhon phonetic embeddings and mBERT semantic embeddings into a unified representation.
    - Incorporate GeoNames Feature Classes into the training data to provide additional categorical context (e.g., whether the toponyms refer to populated places, rivers, mountains, etc.).
  - Training Loop: Train the network to produce close embeddings for positive pairs and distant embeddings for negative pairs, leveraging both phonetic and semantic information.
Inference Pipeline
- Goal: Build a pipeline that integrates G2P, PanPhon, mBERT, and the Siamese network to infer vector embeddings for place names in a single pass.
  - Pipeline Steps:
    1. G2P model converts a place name to its IPA representation.
    2. PanPhon converts the IPA representation into detailed phonetic feature vectors.
    3. mBERT generates a semantic embedding based on the place name's textual representation and language.
    4. GeoNames Feature Class is integrated to add categorical context.
    5. The Siamese network combines the PanPhon, mBERT embeddings, and GeoNames Feature Class to generate a final vector embedding representing the place name.
- Chained Model Inference:
  - Automation: Create an end-to-end function that automatically preprocesses a place name (normalizing text and removing non-relevant symbols), runs G2P to convert the name to IPA, extracts the PanPhon phonetic features, generates the mBERT semantic embedding, and incorporates the GeoNames Feature Class. The Siamese network then generates the final vector embedding in one efficient pass.
  - Efficient Vector Generation: This process enables fast and accurate embedding generation for new or unknown place names by automating the chaining of phonetic and semantic similarity measures.
Storing Embeddings in Vespa

Vespa Schema
- Goal: Store the vector embeddings generated by the inference pipeline in Vespa for efficient similarity searches.
- Schema Design: Define a schema in Vespa that includes the toponyms, their corresponding vector embeddings, and associated GeoNames Feature Classes. This will support vector search queries, allowing both phonetic and semantic similarity searches.
Inserting Data into Vespa
- Batch Processing: Process the GeoNames toponyms through the inference pipeline in batches. Once each toponym has been processed, insert its vector embedding and feature class into Vespa.
- Optimisation: To ensure efficient scalability, optimise the batch size and utilise asynchronous processes to manage the high volume of toponyms and embeddings.

Querying Vespa for Similar Names
- Vector Search Capability
  - Goal: Use Vespa's vector search functionality to find place names that are similar in both phonetic and semantic terms based on the embeddings.
  - Query Process:
    - New place names will be processed through the same G2P, PanPhon, mBERT, and Siamese network pipeline to generate their vector embeddings.
    - Vespa will perform a nearest neighbour search using these embeddings, enabling fast and precise retrieval of similar names.
  - Search Parameters: Leverage vector distance metrics within Vespa to ensure phonetic and semantic nuances are accurately reflected in the search results.
Expected Outcomes
- Improved Similarity Matching: The combination of G2P, PanPhon, and mBERT embeddings, coupled with the Siamese network, is expected to improve the accuracy of place name variant matching. The system will better handle nuanced phonetic variations across languages and dialects, as well as capturing semantic similarities.
- Fast and Accurate Searches: Storing vector embeddings in Vespa allows for rapid retrieval of similar place names, with the ability to fine-tune the balance between phonetic and semantic similarity in the search results.
- Scalability and Flexibility: The solution is designed to be scalable for larger datasets and adaptable to other geospatial applications. Its modular design allows easy expansion or refinement of the model.
Next Steps
- Preprocessing Pipeline: Complete and test the preprocessing pipeline for GeoNames data, including G2P conversion, phonetic vector extraction using PanPhon, and mBERT embedding generation.
- Siamese Network Training: Train the Siamese network using the identified pairs of place names, refining the integration of mBERT embeddings and GeoNames Feature Classes into the training process.
- Pipeline Integration: Build the complete inference pipeline that automates the generation of embeddings from place names, ensuring the smooth chaining of the G2P, PanPhon, mBERT, and Siamese network components.
- Vespa Testing: Store embeddings in Vespa and conduct performance tests to validate the accuracy and efficiency of vector search for place names, ensuring the system can scale and adapt to various data loads.

docuracy added the enhancement New feature or request label Oct 3, 2024

docuracy self-assigned this Oct 3, 2024

docuracy changed the title ~~Phonetic Vector Embeddings~~ Train model with G2P, mBERT, and Feature Classes for Toponym Embeddings in Vespa Oct 3, 2024

docuracy changed the title ~~Train model with G2P, mBERT, and Feature Classes for Toponym Embeddings in Vespa~~ Place Matching Engine (G2P+mBERT+FeatureClasses): trained on GeoNames Variants Oct 3, 2024

docuracy pinned this issue Oct 5, 2024

docuracy added the reconciliation label Oct 8, 2024

This was referenced Oct 14, 2024

Manual reconciliation #375

Closed

Automatic Reconciliation #361

Closed

docuracy changed the title ~~Place Matching Engine (G2P+mBERT+FeatureClasses): trained on GeoNames Variants~~ WHG PLACE (Place Linkage and Advanced Concordance Engine) Oct 16, 2024

docuracy changed the title ~~WHG PLACE (Place Linkage and Advanced Concordance Engine)~~ WHG PLACE (Place Linkage, Alignment, and Concordance Engine) Oct 17, 2024

docuracy mentioned this issue Oct 17, 2024

Rumsey Map Collection Integration #397

Open

docuracy added the Technical Board label Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WHG PLACE (Place Linkage, Alignment, and Concordance Engine) #381

WHG PLACE (Place Linkage, Alignment, and Concordance Engine) #381

docuracy commented Oct 3, 2024 •

edited

Loading

WHG PLACE (Place Linkage, Alignment, and Concordance Engine) #381

WHG PLACE (Place Linkage, Alignment, and Concordance Engine) #381

Comments

docuracy commented Oct 3, 2024 • edited Loading

docuracy commented Oct 3, 2024 •

edited

Loading