Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch US census data from a hosted FlatGeoBuf #432

Merged
merged 16 commits into from
Jan 1, 2021

Conversation

michaelkirk
Copy link
Collaborator

Rather than prepping a cropped geojson for each city, I've hosted an agglomerated flatgeobuf file with all the areas and their populations.

Because FGB's are quite efficient, I was able to store the most granular census area: "census blocks". A tract can be several times larger than a block, so getting this more fine grained information should give us a more accurate picture of where trips terminate.

I'm actually optimistic that it would be reasonable to serve other countries population areas in the same unified fgb file, and then we don't need any client changes, per-city handling, or server side logic about which census/population data we want - we "just" need to update the hosted FGB with the new data, and this code which requests based on the WGS84 bounding box will continue to work.

Preparing the fgb is currently a manual process, but hopefully we don't need to do it often. I've included a script to show what I did for the US data which could be adapted to incorporate other countries. I imagine if it gets more use it'd make sense to turn it into something more robust than a bash script.

The reason this is a draft is because app crashes in some cities. e.g. try to run it in South Seattle due to the census geometries containing invalid polygon rings.

The crash can be avoided with this change: 7e382f5

...but I assume we don't want to wholesale disable ring validation... right?

I've verified that, at least in the cases I inspected, the error is legit. They really are self intersecting (just touching in the cases I inspected), but I don't think that will stop the building population assignment from working.

One approach - rather than converting to an abstreet::geom::Polygon we could hang on to the geo::Polygon, which won't explode when faced with an invalid ring. WDYT?

Cargo.toml Outdated

# Minimize bandwidth and requests
# Waiting for release of https://github.com/bjornharrtell/flatgeobuf/pull/93)
flatgeobuf = { git = "https://github.com/bjornharrtell/flatgeobuf" }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these changes it's fairly snappy. It's already faster than downloading and parsing a prepared geojson.

However if we get a much larger data structure, I have some wild ideas for how to make things even faster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was half-following the discussion in the PRs -- what a detailed, cool contribution to a common lib! I'm sure many other use cases will benefit from this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if you're interested, this would be pretty cool work to demo to cugos -- the next meeting should be Jan 20

#!/bin/bash -e

# Outputs a FlatGeoBuf of geometries and their corresponding population,
# suitable for use in the popdat crate.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is kind of heinous. I struggled a bit with ogr2ogr, but I'm not sure how to make it better.

Part of this would be straight forward to port to rust, but unfortunately rust doesn't yet have a a FGB writer.

let collection = geojson::FeatureCollection::try_from(geojson)?;
pub fn fetch_all_for_map(map: &Map, timer: &mut Timer) -> anyhow::Result<Vec<CensusArea>> {
timer.start("processing population areas fgb");
let areas = tokio::runtime::Runtime::new()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It changes from map to map, but anecdotally, this seems to require 8-10 requests totaling 3-4MB of bandwidth.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's awesome

timer.start("opening FGB reader");
// See the import handbook for how to prepare this file.
let mut fgb =
HttpFgbReader::open("https://s3.amazonaws.com/mjk_asdf/abs/population_areas.fgb")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it probably makes sense to move this file to the abstreet s3 buckets.

(it's currently 11.5gb FYI)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just finished downloading it, now struggling to re-upload. aws CLI couldn't figure out how to do a direct copy from another public-through-HTTP bucket. :P
It'll eventually be at abstreet.s3-website.us-east-2.amazonaws.com/population_areas.fgb. I don't think sense it makes sense to put it in /dev/data/input or anything like that, because that directory is synced with a local copy. I'm not sure if it makes sense to let people opt into having a local copy of this file through the updater.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've been thinking about the offline use case and haven't come to any conclusions. Relying on downloading a GBs large file seems fraught. Maybe it'd make sense to instead have the importer generate and cache the final scenario?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be the easiest route forward. It might be a choppy gameplay experience to have to read from a remote file, do all of the pathfinding to pick modes, etc just to start a scenario. So for most maps, we can precompute this. If we still want to have live tuning of some of the parameters into the scenario generation (which seems likely -- % of students, if census doesn't tell us), then we can figure out some caching for the extracted CensusAreas. Some of it might be a little more interesting on the web.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I've updated the link.

Copy link
Collaborator

@dabreegster dabreegster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could hang on to the geo::Polygon, which won't explode when faced with an invalid ring

It depends how the code in popdat/src/distribute_people.rs evolves. Right now it's brute-force checking all buildings against all areas. I'm guessing we'll want to throw a quadtree around this at some point. There isn't a consistent way we're doing that in the rest of the code yet, but making it handle both geom::Polygon and geo::Polygon wouldn't be hard.

So we could either keep geo::Polygon or do Polygon::buggy_new or make Ring::unchecked_new (in the vein of PolyLine::unchecked_new, disabling the checks). I lean towards the former, since it's simpler and CensusArea isn't used outside of popdat yet. (Although maybe we'll want a simple UI for visualizing the demographics, unless there's already a nice web version somewhere.)

Cargo.toml Outdated

# Minimize bandwidth and requests
# Waiting for release of https://github.com/bjornharrtell/flatgeobuf/pull/93)
flatgeobuf = { git = "https://github.com/bjornharrtell/flatgeobuf" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was half-following the discussion in the PRs -- what a detailed, cool contribution to a common lib! I'm sure many other use cases will benefit from this

@@ -7,12 +7,16 @@ edition = "2018"
[dependencies]
abstutil = { path = "../abstutil" }
anyhow = "1.0.35"
flatgeobuf = { version = "*" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this brought in lots of new dependencies. cargo tree -i -p openssl points to a dep on reqwest. This'll prevent us from building in WASM because of the dependency on native SSL. I think we'll need a way to change the reqwest features that fgb uses to be something like default-features=false, features=["blocking", "rustls-tls"] }. I'm always a little confused by overriding transitive deps, but I think the proper way is to add a feature to fgb that in turn changes the dep on reqwest, instead of using patch.crates-io

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooph, I once again have mostly forgotten about the wasm use case.

Allegedly reqwest has some support for the wasm target, but I'm a little nervous what I'll find. I'll look into it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, this maybe isn't straightforward -- in our wasm file loader, we use a call that doesn't even expose progress, let alone let you make range requests. If there's not something easy we can do, we can just rethink having game depend on popdat, and instead just generate a scenario through importer.

# ## Input
#
# I wasn't able to find a reasonable solution using the official census.gov
# files or API's. Instead I batch-downloaded all the census blocks and their
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nhgis.org needs a login and has a fancy interface for selecting downloads, so we can't just stick a URL in here. :( The number of tables they have is a bit overwhelming and I couldn't easily tell what you've selected -- which year, table name? Could you be a bit more specific here so the instructions are more reproducible?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some specific steps

let collection = geojson::FeatureCollection::try_from(geojson)?;
pub fn fetch_all_for_map(map: &Map, timer: &mut Timer) -> anyhow::Result<Vec<CensusArea>> {
timer.start("processing population areas fgb");
let areas = tokio::runtime::Runtime::new()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's awesome

timer.start("opening FGB reader");
// See the import handbook for how to prepare this file.
let mut fgb =
HttpFgbReader::open("https://s3.amazonaws.com/mjk_asdf/abs/population_areas.fgb")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just finished downloading it, now struggling to re-upload. aws CLI couldn't figure out how to do a direct copy from another public-through-HTTP bucket. :P
It'll eventually be at abstreet.s3-website.us-east-2.amazonaws.com/population_areas.fgb. I don't think sense it makes sense to put it in /dev/data/input or anything like that, because that directory is synced with a local copy. I'm not sure if it makes sense to let people opt into having a local copy of this file through the updater.

@michaelkirk
Copy link
Collaborator Author

So we could either keep geo::Polygon or do Polygon::buggy_new or make Ring::unchecked_new (in the vein of PolyLine::unchecked_new, disabling the checks). I lean towards the former, since it's simpler and CensusArea isn't used outside of popdat yet. (Although maybe we'll want a simple UI for visualizing the demographics, unless there's already a nice web version somewhere.)

I've gone the route of plumbing through the geo::Polygon. I didn't realize that the underlying intersection/area were relying on to_geo anyway, so skipping the to/from is a nice little perf boost.

In the longer running conversation we've had, I think this is a vote in favor of having abstreet::geom::Polygon be a wrapper for geo::Polygon.

@michaelkirk
Copy link
Collaborator Author

michaelkirk commented Dec 29, 2020

By way of an update, I believe the only thing still outstanding is the wasm interop.

There's a bunch of things I don't know and am trying to learn about, but it seems like a large task.

It's unclear to me if you can "just" dump a bunch of async machinery written to work in rust and have it run in wasm, due to the runtime constraints of the browser (ie. single-threaded).

There are certain tools which allow you to send data into and pop data out of JS, but still trying to figure out if and how I can use them to have this code run in the browser. I'll keep trying to solve it this way tomorrow, but I'll probably pivot to having the importer generate and cache scenarios for now if I can't make progress soon.

As a side note a couple things I had to do:

  • update tokio, and use "rt" feature
  • add CORS policy to s3 with allow headers *

@dabreegster
Copy link
Collaborator

Exciting progress!

probably pivot to having the importer generate and cache scenarios

Agreed this is a good way forwards. Getting wasm support glued to reqwest could be a big detour, and since we don't have any online parameter tuning ideas yet, we can just have this work offline only to start.

update tokio, and use "rt" feature

The one thing to watch out for if you also updated hyper is making sure cargo run --bin headless -- --port=1234 still works. I tried upgrading to the new tokio a few days ago and hit some issue figuring out the new hyper API's socket binding.

Census block geometries often have self intersecting rings. In the cases
I inspected this was due to *touching* itself at a point, not
overlapping at a line nor crossing into an area.

e.g. best I can represent in ascii...

.____. .__.
|    | |  |
|    |<   |  <--- Ring touching itself at the "|<"
|    |_|  |
|_________|
This isn't expected to work out of the box. It's more like really
precise documentation. In fact, I first tried writing this *as*
documentation, but in order to be clear enough it required so many code
snippets that I just opted to write the script instead.
Similar to the FileLoader, the FutureLoader will load any future async
while polling from a loading panel.

On native, this will dispatch to a background thread via tokio. On wasm
this must use the same single threaded runloop as all promises.
@michaelkirk
Copy link
Collaborator Author

Ok! After much ado, this is ready for another review.

Other than a rebase, here's the primary change to fix wasm: 6fe2428

// See the import handbook for how to prepare this file.
let mut fgb =
// HttpFgbReader::open("https://abstreet.s3.amazonaws.com/population_areas.fgb").await?;
HttpFgbReader::open("https://s3.amazonaws.com/mjk_asdf/abs/population_areas.fgb").await?;
Copy link
Collaborator Author

@michaelkirk michaelkirk Dec 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this back to my bucket until your bucket has a cors policy.

Here's the one I used:

[
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "GET"
        ],
        "AllowedOrigins": [
            "*"
        ],
        "ExposeHeaders": []
    }
]

To set it up from the web: log in to the console > s3 > abstreet bucket > Permissions > Cors

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I just updated it

@michaelkirk
Copy link
Collaborator Author

Ok! After much ado, this is ready for another review.

I will say it's kind of complicated and I'm open to other approaches.

I suspect that this won't be the last time we want to do some async stuff in the browser though, so I'm gambling that, if not the code, at least the knowledge will be reused.

Copy link
Collaborator

@dabreegster dabreegster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This certainly is complex, but I don't think any of it is avoidable, and it's pretty nicely isolated in load.rs. The only real wart is having to care about Send or not in LoadScenario (or any other place making use of this async magic in the future).

I tried Manhattan natively and on wasm, and it seemed to all just work! (Using the population layer, I saw a hotspot at https://www.openstreetmap.org/relation/3697456 -- no familiarity with the area, but it's a nice big building, so I'll believe it.)

My vote is to go with this approach. Once we have more knobs to turn at the activity generation level, being able to dynamically generate the scenario seems desirable -- so making the importer create one file per non-Seattle map isn't really what we want. I suspect we might want to cache the CensusAreas if we want to use this in an extremely live-tuning kind of way. But:

  • We could maybe put that in PerMap, similar to Scenario
  • Generating the scenario is slow anyway because of pathfinding for mode choice, so a few seconds to re-download isn't bad
  • Maybe we can split the pipeline a bit and assign number of people to buildings at importer time, and avoid the network during the rest of the popdat gen

So, I'll convert this from a draft and after the few misc fixes, we can merge. Thanks for fighting through the winit/future/async mess! Definitely valuable knowledge gained.

Cargo.toml Outdated
@@ -34,4 +34,7 @@ members = [
opt-level = 3

[patch.crates-io]
# Some niceties for parsing geojson feature properties.
# Upstreaming at https://github.com/georust/geojson/pull/155
geojson = { git = "https://github.com/georust/geojson", branch = "mkirk/try_from" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is merged, so we can pin to geojson git and clean this up, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

let map_bounds = map.get_gps_bounds().clone();
let mut rng = sim::fork_rng(&mut rng);

LoadScenario::Future(Box::pin(async move {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First use of pin! My understanding of this is pretty fuzzy, but https://rust-lang.github.io/async-book/04_pinning/01_chapter.html makes sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I read that and it sort of makes sense, but I haven't internalized it at all. I'm primarily driven by compiler errors at this point.

LoadScenario::Future(Box::pin(async move {
let areas = popdat::CensusArea::fetch_all_for_map(&map_area, &map_bounds).await?;

let scenario_from_app: Box<dyn Send + FnOnce(&App) -> Scenario> =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so IIUC, the structure is that an async piece will fetch remote data, then the "callback" synchronously (and expensively :( ) runs to produce the scenario. Makes sense -- conceptually the same as LoadScenario::Path on web.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I'd like to move more of the computation onto the background thread (at least for native) but right now it all requires Map which is not Send.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that might be easy to fix. I think the only reason it's not Send is because of the path_calc: ThreadLocal<RefCell<PathCalculator>> trick deep inside the pathfinding layer. It may not be hard to rearchitect and plumb the PathCalculator in. (It's an internal fast_paths thing that makes it cheaper to run queries repeatedly on the same thread.)

})
}

#[cfg(not(target_arch = "wasm32"))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a separate copy to express the +Send constraint, right? Doesn't seem worth parameterizing this one method, so this is fine.

Copy link
Collaborator Author

@michaelkirk michaelkirk Dec 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the Send constraint can't be present for wasm32, but note also that the wasm version does wasm_bindgen_futures::spawn_local while the native does tokio::runtime::Runtime::spawn.

I briefly tried to hide the spawn divergence by moving it into AppLike...

 trait AppLike {
+   type PossiblyConcurrentFuture = Future
+   fn spawn(future: Self::PossiblyConcurrentFuture);
   ...
 } 
#[cfg(not(target_arch = "wasm32"))]
impl AppLike for App {
    type PossiblyConcurrentFuture = Future + Send
    fn spawn(&self, future: Self::PossiblyConcurrentFuture) {
        self.runtime.spawn(future)
    }
    ...  
} 

#[cfg(arch = "wasm32")]
impl AppLike for App {
    type PossiblyConcurrentFuture = Future;
    fn spawn(&self, future: Self::PossiblyConcurrentFuture) {
        wasm_bindgen_futures::spawn_local(future)
    }
    ...
 } 

But I abandoned the approach after seeing how many new compiler errors it caused. I might revisit - I suspect it's better for the tokio Runtime instance to live on something permanent like App.

@@ -244,3 +252,118 @@ mod wasm_loader {
}
}
}

pub struct FutureLoader<A, T>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm noticing that this and FileLoader don't actually care that the app is AppLike at all. I suspect we'll be able to lift both of these to widgetry, but no urgency for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm - interesting!

@@ -7,12 +7,16 @@ edition = "2018"
[dependencies]
abstutil = { path = "../abstutil" }
anyhow = "1.0.35"
flatgeobuf = { version = "*" }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the fix is upstreamed, should we pin to 0.4.0? I guess using wildcards is fine too; it just means when we occasionally try to upgrade everything, we use cargo update -p instead of changing it here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@@ -2,7 +2,7 @@ use rand::Rng;
use rand_xorshift::XorShiftRng;

use abstutil::prettyprint_usize;
use geom::Polygon;
use geo::algorithm::{area::Area, contains::Contains};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Belongs in the block above. I hereby regret switching over to this import style. :(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

// See the import handbook for how to prepare this file.
let mut fgb =
// HttpFgbReader::open("https://abstreet.s3.amazonaws.com/population_areas.fgb").await?;
HttpFgbReader::open("https://s3.amazonaws.com/mjk_asdf/abs/population_areas.fgb").await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I just updated it

warn!("skipping feature with missing population");
continue;
}
let population: usize = props["population"].parse()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much nicer API!

Copy link
Collaborator Author

@michaelkirk michaelkirk Jan 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeahhhhh geojson properties api is rough. georust/geojson#157

return Transition::Push(FutureLoader::<App, Scenario>::new(
ctx,
Box::pin(future),
"Loading Scenario",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UX is a bit clumsy now -- there's no indication of what's actually happening. Not sure what we could do better. If there was a clear sense of progress, or it was very easy to subscribe to the number of bytes the FGB reader had received so far, it'd be kind of nice to display it. But since there's not a known total, not really useful. And it would vastly complicate this interface. So seems fine for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The major hurdle is any communication would need to be potentially cross thread. And App is pretty strongly locked into the main thread.

If there's some way to make a handle to Timer Send, I think this could get better... the immediate hurdle is that Timer.sink needs to be Send, which blows up lots of things.

@dabreegster dabreegster marked this pull request as ready for review December 31, 2020 23:09
@michaelkirk
Copy link
Collaborator Author

Alright - I believe I've responded to all of your feedback.

@dabreegster
Copy link
Collaborator

LGTM, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants