Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Including city name in forward geocoding text search not working as expected. #107

Open
gagandeepsingh1105 opened this issue May 21, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@gagandeepsingh1105
Copy link

Hi there,

I am an engineer at Public Health Agency of Canada. We currently have a use case for which we are looking to deploy an instance Pelias Geocoder.
For this use case, we have some custom input data(a csv file) of Canada locations only and we want to use Pelias Geocoder's forward geocoding to convert the text address to longitudes and latitudes.
And for this reason we are trying to deploy csv-importer. Below is the snapshot of input data that we have ingested into our elastic search instance:
image

While using forward geocoding if we supply street number, street name and province , then the api returns the response with confidence level =1 and source =custom:

Api request: https://geocoder.alpha.phac.gc.ca/v1/search?text="283 prince philip dr nl"&sources=custom
image

But if we also include the city name in the input text, then the confidence level drops to 0.6 and the match type changes to fall back. As you may have already noted that we do have a column named 'city' in our input data but somehow csv-importer is not able to read it and falls back to whosonfirst data source.

We have tried a couple of things at our end to resolve this issue:

  1. In the pelias.json configuration file , we added a "docs" key to map the columns in the csv file with those in pelias schema but got the following error:

image

Snapshot of pelias.json file:
"csv": {
"datapath": "/data/csv-importer-files",
"files": ["NLFD_test_changed.csv"],
"docs": [
{
"name": "LAT",
"type": "number",
"required": true
},
{
"name": "LON",
"type": "number",
"required": true
},
{
"name": "SOURCE",
"type": "number",
"required": true
},
{
"name": "LAYER",
"type": "number",
"required": true
},
{
"name": "NUMBER",
"type": "string",
"required": false,
"es_field": "address.number"
},
{
"name": "STREET",
"type": "string",
"required": false,
"es_field": "address.street"
},
{
"name": "CITY",
"type": "string",
"required": false,
"es_field": "address.city"
},
{
"name": "NAME",
"type": "string",
"required": false,
"es_field": "address.name"
},
{
"name": "MAIL_PROV_ABVN",
"type": "string",
"required": false,
"es_field": "address.region"
},
{
"name": "POSTALCODE",
"type": "string",
"required": false,
"es_field": "address.postalcode"
}
],
"download": []
}

  1. Also, tried to give the column mapping in a separate file but that too didn't work and got the same error again

image

Snapshot of pelias.json file
{
"imports": {
"csv": {
"datapath": "/data",
"files": [
"canada-locations.csv"
],
"mappings": "/code/csv_mapping.json"
}
}
}

and then defined the column mappings in a separate file:
{
"mappings": {
"id": "id",
"latitude": "latitude",
"longitude": "longitude",
"number": "house_number",
"street": "street",
"city": "city",
"region": "region",
"province": "province",
"country": "country",
"postalcode": "postalcode",
"category": "category",
"name": "name",
"layer": "address"
}
}

Steps to Reproduce

  1. Deploy an instance of Pelias Geocoder with csv-importer running
  2. Make the above mentioned configuration changes in pelias.json file.
  3. Try the following Api calls:
    https://geocoder.alpha.phac.gc.ca/v1/search?text="283 prince philip dr nl"&sources=custom
    https://geocoder.alpha.phac.gc.ca/v1/search?text="283 prince philip dr st john's nl"&sources=custom

Expected behavior
Including city name in the search text should also give confidence=1 and source=custom

Environment (please complete the following information):
We are currently running an instance of Pelias Geocoder on a kubernetes cluster on Google Cloud Platform

Please do let us know in case you require any additional information to debug this issue.
Thanks in advance.

@missinglink
Copy link
Member

Hi @gagandeepsingh1105, the 'administrative hierarchy' (ie. the city/province/country) of each record in Pelias is sourced exclusively from the WhosOnFirst dataset through point-in-polygon lookups at index time.

@missinglink
Copy link
Member

I believe this is a duplicate of #74

@missinglink
Copy link
Member

missinglink commented May 24, 2024

I'm not against adding this option to custom builds, the issue is that currently all administrative regions are composed of a source, id and term (with an optional abbreviation).

We could use 'custom' as the source, but each admin region would need to have a unique id in order to correctly generate the _gid field.

An autoincrement value could work here but would have the disadvantage that two places in the same area would have differing parent IDs.

@missinglink
Copy link
Member

missinglink commented May 24, 2024

It's possible to have multiple associated 'parents' for a single layer, so for example a record can have multiple 'region' records associated.

The issue would be that we only return one (ie. the first one), so it would either need to be decided (or configurable) whether the record from the CSV file was returned, or the WOF one, in the case where both data sources returned a match.

@the-epeecurean
Copy link

Hello,

I am a developer on the original poster's team. I think this is an issue of how WOF is passed back as the first record returned, or how readily it is searched for a 'fallback' match, if a locality name is present despite a focus on a more granular location.

I performed the same two searches in the original post excluding the "sources=custom" filter from the API call and encountered the same behaviour. A search for "283 Prince Philip dr NL" (https://geocoder.alpha.phac.gc.ca/api/search?text="283%20prince%20philip%20dr%20NL") resulted in a match from the custom source with confidence 1.0.

However, a search for "283 Prince Philip dr St. John's NL" results in a match from WOF, and seemingly ignores a filter on the address layer type:
https://geocoder.alpha.phac.gc.ca/api/search?text=%22283%20prince%20philip%20dr%20st%20john%27s%20nl%22
OR
https://geocoder.alpha.phac.gc.ca/api/search?text=%22283%20prince%20philip%20dr%20st%20john%27s%20nl%22&layers=address

We'd like to use the custom data source in performing batch forward geocoding, and it is useful to pass an 'address, city, province' search term where the inclusion of the city helps refine the search. As identified in the original issue, this does not appear to be what is happening due to the inclusion of the city name.

We understand that WOF is the exclusive source for administrative hierarchy in Pelias, but the inclusion of the place name shouldn't cue the fallback behaviour when an accurate match to the desired layer granularity (street address) is available. In this scenario a street address supplemented by a city name should refine the area for a search, but it seems that it prompts a fallback match instead. It also seems to ignore a layer search filter in the API call when the city name is included, triggering the returned fallback result from WOF.

Thank you for your help!

@missinglink
Copy link
Member

missinglink commented May 31, 2024

The debug query param displays a bunch more info:
https://geocoder.alpha.phac.gc.ca/api/search?text=%22283%20prince%20philip%20dr%20st%20john%27s%20nl%22&layers=address&debug=1

You can see that the Placeholder service ran, it found a matching locality:

{
  "controller:placeholder": [
    {
      "id": 890456615,
      "name": "St. John's",
      "placetype": "locality",
      "population": 99182,
      "lineage": [
        {
          "country": {
            "id": 85633041,
            "name": "Canada",
            "abbr": "CAN",
            "languageDefaulted": false
          },
          "county": {
            "id": 1158869009,
            "name": "Division No. 1",
            "languageDefaulted": false
          },
          "locality": {
            "id": 890456615,
            "name": "St. John's",
            "languageDefaulted": false
          },
          "region": {
            "id": 85682123,
            "name": "Newfoundland and Labrador",
            "abbr": "NL",
            "languageDefaulted": false
          }
        }
      ],
      "geom": {
        "bbox": "-52.72931,47.54494,-52.68931,47.58494",
        "lat": 47.56494,
        "lon": -52.70931
      },
      "languageDefaulted": false
    }
  ]
}

Then when the Elasticsearch query is run, the ID of the locality matched above is added as a Filter condition (ie. mandatory condition):

{
  "filter": {
    "bool": {
      "minimum_should_match": 1,
      "should": [
        {
          "terms": {
            "parent.locality_id": [
              "890456615"
            ]
          }
        }
      ],
      "must": [
        {
          "terms": {
            "layer": [
              "address"
            ]
          }
        }
      ]
    }
  }
}

Of course this results in 0 hits:

{
  "controller:search": {
    "queryType": {
      "address_search_using_ids": {
        "es_took": 36,
        "response_time": 42,
        "retries": 0,
        "es_hits": 0,
        "es_result_count": 0
      }
    }
  }
}

At this point there are zero matches, I forget the exact workflow here but I believe it falls back to a legacy search method which was more lenient.

I don't like that the request specifies only address layers but returns other layers, this is likely a bug, but one which doesn't often occur outside of custom installations such as this.

@missinglink
Copy link
Member

missinglink commented May 31, 2024

The geometry of 890456615 St. John's is of type Point, which explains why the address wasn't associated via the PIP service. (the address must lie inside the boundary)

@missinglink
Copy link
Member

missinglink commented May 31, 2024

Maybe for your usecase you can disable the Placeholder service, or possibly don't add any data to it?
I haven't tested it, but it might prevent the filter condition being added to the elasticsearch query, which sounds like what you want.

@missinglink
Copy link
Member

@the-epeecurean are there better open geo data for that region?

the only one I can find is points only, does the CA govt publish something better than this? https://opendata.gov.nl.ca/public/opendata/page/?page-id=datasetdetails&id=265

@the-epeecurean
Copy link

@missinglink There are ... Statistics Canada publishes a hierarchy of delineated boundaries. I've just been evaluating some cherry-picked WOF 'fallback' results we've been seeing in testing.

Here's a link to an open REST point for the collected Cartographic Boundary files published by Statistics Canada:
https://geo.statcan.gc.ca/geo_wa/rest/services/2021/Cartographic_boundary_files/MapServer

And a reference to descriptions of the Cartographic Boundary files made available (at the bottom under "1. Spatial information products"):
https://www150.statcan.gc.ca/n1/pub/92-196-x/92-196-x2021001-eng.htm

A polygon for the example cited in the Issue above (St. John's NL) appears at the CSD (census subdivision) and CMA (census metropolitan area) levels.
However, some smaller localities (within a larger CMA, e.g., Halifax, NS) show up as polygons in the DPL (designated place) boundary file.

If there is any way that we could help in facilitating this spatial information being included in WOF, please let us know. It would help our usecase greatly to see a broader capture of localities in Canada represented as polygons.

@nvkelso
Copy link

nvkelso commented Jun 3, 2024

Adding an issue upstream in Who's On First to help facilitate this work:

tl;dr the new 2021 cartographic boundary files from Stats Canada look great and we'd love to import them!

@nick-rv
Copy link

nick-rv commented Jan 10, 2025

Hi @missinglink , i am facing a similar issue.
I did again an openadresses and custom csv import, after having:

  • adapted the pelias.json configuration for import
  • both provided wof data inside import directory and fixed the pip service settings

The csv-import jobs ran successfully, and i can see now WOF attributes displayed for most of the address records returned in the api responses.

But if i include the city name inside the requested address, i obtain only fallback records based on whosonfirst data (no address records).
As you suggested earlier in this issue, i tried to stop the placeholder service but the direct geocoding requests fail from that moment.

To be clearer:

  • The /v1/search?text=4 avenue de paris 78000 request returns an exact match for the address
  • But the /v1/search?text=4 avenue de paris 78000 versailles request fails to return an address record (match_type: fallback and wof records only)

Is this behaviour normal ?
Or could there be something wrong coming from the import job ?

For example i can see that the returned address record has wof properties concerning only the country (not the city, region, borough, etc.) :
image

Does each CSV record need to be associated explicitly to its city, borough,etc through the parent_json column ?
I understood that this should not be necessary according to the repo documentations (csv-importer, wof-admin-lookup).

Thank you so much for your help.

@nick-rv
Copy link

nick-rv commented Jan 10, 2025

If i compare with the public pelias instance https://pelias.github.io/compare/#/v1/search?text=4+avenue+de+paris+78000+versailles , the address seems returned even if the city is mentioned inside the search string :
image

Regarding my own pelias instance when the address record is returned, it does not have all these wof properties but only the country :
image

When i try this same request that includes the city name /v1/search?text=4%20avenue%20de%20paris%2078000%20versailles , then here are the logs obtained :

From pelias API :

2025-01-10T15:03:53.394Z - debug: [api] [lang] 'fr' via 'header'
2025-01-10T15:03:53.394Z - debug: [libpostal] libpostal: http://libpostal:4400/
2025-01-10T15:03:53.396Z - debug: [placeholder] placeholder: http://placeholder:4100/
2025-01-10T15:03:53.409Z - info: [api] placeholder response_time=13, text=4 avenue de paris 78000 versailles, size=10, private=false, name=French, iso6391=fr, iso6393=fra, via=header, defaulted=false, querySize=20, parser=libpostal, housenumber=4, street=avenue de paris, postalcode=78000, city=versailles, result_count=0, text_length=34, controller=placeholder
2025-01-10T15:03:53.412Z - debug: [api] [controller:placeholder] [result_count:23]
2025-01-10T15:03:53.421Z - info: [api] elasticsearch controller=search, queryType=address_search_using_ids, es_hits=0, result_count=0, es_took=5, response_time=9, text=4 avenue de paris 78000 versailles, size=10, private=false, name=French, iso6391=fr, iso6393=fra, via=header, defaulted=false, querySize=20, parser=libpostal, housenumber=4, street=avenue de paris, postalcode=78000, city=versailles, retries=0, text_length=34
2025-01-10T15:03:53.421Z - debug: [api] [ES response]
2025-01-10T15:03:53.424Z - info: [api] pelias_parser response_time=3, text=4 avenue de paris 78000 versailles, size=10, private=false, name=French, iso6391=fr, iso6393=fra, via=header, defaulted=false, querySize=20, parser=libpostal, housenumber=4, street=avenue de paris, postalcode=78000, city=versailles, solutions=3, text_length=34
2025-01-10T15:03:53.425Z - debug: [api] req.clean: {"text":"4 avenue de paris 78000 versailles","size":10,"private":false,"lang":{"name":"French","iso6391":"fr","iso6393":"fra","via":"header","defaulted":false},"querySize":20,"parser":"libpostal","parsed_text":{"housenumber":"4","street":"avenue de paris","postalcode":"78000","city":"versailles"}}, pre-sort: [whosonfirst:locality:101753015,whosonfirst:neighbourhood:85853675,whosonfirst:locality:101718731,whosonfirst:localadmin:404484333,whosonfirst:localadmin:404497371,whosonfirst:macrocounty:404228265,whosonfirst:locality:1729432751,whosonfirst:localadmin:404363071,whosonfirst:county:102072567,whosonfirst:locality:85946975,whosonfirst:locality:85971525,whosonfirst:locality:101712821,whosonfirst:locality:85941529,whosonfirst:locality:85940625,whosonfirst:neighbourhood:421174889,whosonfirst:neighbourhood:420781695,whosonfirst:locality:1125798833,whosonfirst:locality:1125976233,whosonfirst:locality:1226645223,whosonfirst:locality:1243297577,whosonfirst:locality:1276360471,whosonfirst:locality:1343733151,whosonfirst:locality:1343994861], post-sort: [whosonfirst:locality:101753015,whosonfirst:macrocounty:404228265,whosonfirst:county:102072567,whosonfirst:locality:101718731,whosonfirst:locality:1729432751,whosonfirst:locality:85946975,whosonfirst:locality:85971525,whosonfirst:locality:101712821,whosonfirst:locality:85941529,whosonfirst:locality:85940625,whosonfirst:locality:1125798833,whosonfirst:locality:1125976233,whosonfirst:locality:1226645223,whosonfirst:locality:1243297577,whosonfirst:locality:1276360471,whosonfirst:locality:1343733151,whosonfirst:locality:1343994861,whosonfirst:localadmin:404484333,whosonfirst:localadmin:404497371,whosonfirst:localadmin:404363071,whosonfirst:neighbourhood:85853675,whosonfirst:neighbourhood:421174889,whosonfirst:neighbourhood:420781695]
2025-01-10T15:03:53.425Z - debug: [api] [dupe][replacing] query=4 avenue de paris 78000 versailles, superior=Versailles whosonfirst:locality:101753015, inferior=Versailles whosonfirst:macrocounty:404228265
2025-01-10T15:03:53.425Z - debug: [api] [dupe][replacing] query=4 avenue de paris 78000 versailles, superior=Versailles whosonfirst:locality:101753015, inferior=Versailles whosonfirst:county:102072567
2025-01-10T15:03:53.425Z - debug: [api] [dupe][replacing] query=4 avenue de paris 78000 versailles, superior=Versailles whosonfirst:locality:101753015, inferior=Versailles whosonfirst:localadmin:404363071
2025-01-10T15:03:53.425Z - debug: [api] [dupe][replacing] query=4 avenue de paris 78000 versailles, superior=Versailles whosonfirst:county:102072567, inferior=Versailles whosonfirst:macrocounty:404228265
2025-01-10T15:03:53.426Z - debug: [api] [dupe][replacing] query=4 avenue de paris 78000 versailles, superior=Versailles whosonfirst:localadmin:404363071, inferior=Versailles whosonfirst:macrocounty:404228265
2025-01-10T15:03:53.426Z - debug: [api] [dupe][replacing] query=4 avenue de paris 78000 versailles, superior=Versailles whosonfirst:localadmin:404363071, inferior=Versailles whosonfirst:county:102072567
2025-01-10T15:03:53.427Z - debug: [api] [dupe][replacing] query=4 avenue de paris 78000 versailles, superior=Versailles whosonfirst:locality:101718731, inferior=Versailles whosonfirst:localadmin:404484333
2025-01-10T15:03:53.429Z - debug: [api] [dupe][replacing] query=4 avenue de paris 78000 versailles, superior=Versailles whosonfirst:locality:1729432751, inferior=Versailles whosonfirst:locality:85940625
2025-01-10T15:03:53.429Z - debug: [api] [dupe][replacing] query=4 avenue de paris 78000 versailles, superior=Versailles whosonfirst:locality:1729432751, inferior=Versailles Township whosonfirst:localadmin:404497371
2025-01-10T15:03:53.430Z - debug: [api] [dupe][replacing] query=4 avenue de paris 78000 versailles, superior=Versailles whosonfirst:locality:85940625, inferior=Versailles Township whosonfirst:localadmin:404497371
2025-01-10T15:03:53.432Z - debug: [api] [dupe][replacing] query=4 avenue de paris 78000 versailles, superior=Versailles whosonfirst:neighbourhood:421174889, inferior=Versailles whosonfirst:neighbourhood:420781695
2025-01-10T15:03:53.433Z - debug: [language] language: http://placeholder:4100/
2025-01-10T15:03:53.441Z - info: [api] language response_time=8, language=fr, controller=language
2025-01-10T15:03:53.441Z - debug: [api] [language] [debug] missing translation fra 404484333
2025-01-10T15:03:53.441Z - debug: [api] [language] [debug] missing translation fra 404497371
2025-01-10T15:03:53.441Z - debug: [api] [language] [debug] missing translation fra 1729432751
2025-01-10T15:03:53.441Z - debug: [api] [language] [debug] missing translation fra 404516999
2025-01-10T15:03:53.441Z - debug: [api] [language] [debug] missing translation fra 404524347
2025-01-10T15:03:53.441Z - debug: [api] [language] [debug] missing translation fra 404500301
2025-01-10T15:03:53.441Z - debug: [api] [language] [debug] missing translation fra 1125798833
2025-01-10T15:03:53.441Z - debug: [api] [language] [debug] missing translation fra 1125976233
2025-01-10T15:03:53.443Z - info: [api] [IP removed] - - [10/Jan/2025:15:03:53 +0000] "GET /v1/search?text=%5Bremoved%5D HTTP/1.1" 200 9480

From the placeholder service :

took: 0.931ms
parent not found! country_id -1
parent not found! region_id 85687331
2025-01-10T15:03:53.409Z - info: [placeholder] ::ffff:192.168.200.34 - GET /parser/search HTTP/1.1 200 15508 - 10.416 ms
2025-01-10T15:03:53.441Z - info: [placeholder] ::ffff:192.168.200.34 - GET /parser/findbyid HTTP/1.1 200 16945 - 4.532 ms

From libpostal:

15:03:53.395813 [wof-libpostal-server] STATUS parse '4 avenue de paris 78000 versailles' 246.785µs

Thanks again @missinglink

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants