-
-
Notifications
You must be signed in to change notification settings - Fork 353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fetch US census data from a hosted FlatGeoBuf #432
Conversation
Cargo.toml
Outdated
|
||
# Minimize bandwidth and requests | ||
# Waiting for release of https://github.com/bjornharrtell/flatgeobuf/pull/93) | ||
flatgeobuf = { git = "https://github.com/bjornharrtell/flatgeobuf" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With these changes it's fairly snappy. It's already faster than downloading and parsing a prepared geojson.
However if we get a much larger data structure, I have some wild ideas for how to make things even faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was half-following the discussion in the PRs -- what a detailed, cool contribution to a common lib! I'm sure many other use cases will benefit from this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if you're interested, this would be pretty cool work to demo to cugos -- the next meeting should be Jan 20
#!/bin/bash -e | ||
|
||
# Outputs a FlatGeoBuf of geometries and their corresponding population, | ||
# suitable for use in the popdat crate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script is kind of heinous. I struggled a bit with ogr2ogr, but I'm not sure how to make it better.
Part of this would be straight forward to port to rust, but unfortunately rust doesn't yet have a a FGB writer.
popdat/src/import_census.rs
Outdated
let collection = geojson::FeatureCollection::try_from(geojson)?; | ||
pub fn fetch_all_for_map(map: &Map, timer: &mut Timer) -> anyhow::Result<Vec<CensusArea>> { | ||
timer.start("processing population areas fgb"); | ||
let areas = tokio::runtime::Runtime::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It changes from map to map, but anecdotally, this seems to require 8-10 requests totaling 3-4MB of bandwidth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's awesome
popdat/src/import_census.rs
Outdated
timer.start("opening FGB reader"); | ||
// See the import handbook for how to prepare this file. | ||
let mut fgb = | ||
HttpFgbReader::open("https://s3.amazonaws.com/mjk_asdf/abs/population_areas.fgb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it probably makes sense to move this file to the abstreet s3 buckets.
(it's currently 11.5gb FYI)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just finished downloading it, now struggling to re-upload. aws CLI couldn't figure out how to do a direct copy from another public-through-HTTP bucket. :P
It'll eventually be at abstreet.s3-website.us-east-2.amazonaws.com/population_areas.fgb. I don't think sense it makes sense to put it in /dev/data/input or anything like that, because that directory is synced with a local copy. I'm not sure if it makes sense to let people opt into having a local copy of this file through the updater.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I've been thinking about the offline use case and haven't come to any conclusions. Relying on downloading a GBs large file seems fraught. Maybe it'd make sense to instead have the importer generate and cache the final scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be the easiest route forward. It might be a choppy gameplay experience to have to read from a remote file, do all of the pathfinding to pick modes, etc just to start a scenario. So for most maps, we can precompute this. If we still want to have live tuning of some of the parameters into the scenario generation (which seems likely -- % of students, if census doesn't tell us), then we can figure out some caching for the extracted CensusAreas. Some of it might be a little more interesting on the web.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I've updated the link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could hang on to the geo::Polygon, which won't explode when faced with an invalid ring
It depends how the code in popdat/src/distribute_people.rs
evolves. Right now it's brute-force checking all buildings against all areas. I'm guessing we'll want to throw a quadtree around this at some point. There isn't a consistent way we're doing that in the rest of the code yet, but making it handle both geom::Polygon
and geo::Polygon
wouldn't be hard.
So we could either keep geo::Polygon
or do Polygon::buggy_new
or make Ring::unchecked_new
(in the vein of PolyLine::unchecked_new
, disabling the checks). I lean towards the former, since it's simpler and CensusArea
isn't used outside of popdat
yet. (Although maybe we'll want a simple UI for visualizing the demographics, unless there's already a nice web version somewhere.)
Cargo.toml
Outdated
|
||
# Minimize bandwidth and requests | ||
# Waiting for release of https://github.com/bjornharrtell/flatgeobuf/pull/93) | ||
flatgeobuf = { git = "https://github.com/bjornharrtell/flatgeobuf" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was half-following the discussion in the PRs -- what a detailed, cool contribution to a common lib! I'm sure many other use cases will benefit from this
popdat/Cargo.toml
Outdated
@@ -7,12 +7,16 @@ edition = "2018" | |||
[dependencies] | |||
abstutil = { path = "../abstutil" } | |||
anyhow = "1.0.35" | |||
flatgeobuf = { version = "*" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this brought in lots of new dependencies. cargo tree -i -p openssl
points to a dep on reqwest. This'll prevent us from building in WASM because of the dependency on native SSL. I think we'll need a way to change the reqwest features that fgb uses to be something like default-features=false, features=["blocking", "rustls-tls"] }
. I'm always a little confused by overriding transitive deps, but I think the proper way is to add a feature to fgb that in turn changes the dep on reqwest, instead of using patch.crates-io
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooph, I once again have mostly forgotten about the wasm use case.
Allegedly reqwest has some support for the wasm target, but I'm a little nervous what I'll find. I'll look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, this maybe isn't straightforward -- in our wasm file loader, we use a call that doesn't even expose progress, let alone let you make range requests. If there's not something easy we can do, we can just rethink having game
depend on popdat
, and instead just generate a scenario through importer
.
# ## Input | ||
# | ||
# I wasn't able to find a reasonable solution using the official census.gov | ||
# files or API's. Instead I batch-downloaded all the census blocks and their |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nhgis.org needs a login and has a fancy interface for selecting downloads, so we can't just stick a URL in here. :( The number of tables they have is a bit overwhelming and I couldn't easily tell what you've selected -- which year, table name? Could you be a bit more specific here so the instructions are more reproducible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some specific steps
popdat/src/import_census.rs
Outdated
let collection = geojson::FeatureCollection::try_from(geojson)?; | ||
pub fn fetch_all_for_map(map: &Map, timer: &mut Timer) -> anyhow::Result<Vec<CensusArea>> { | ||
timer.start("processing population areas fgb"); | ||
let areas = tokio::runtime::Runtime::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's awesome
popdat/src/import_census.rs
Outdated
timer.start("opening FGB reader"); | ||
// See the import handbook for how to prepare this file. | ||
let mut fgb = | ||
HttpFgbReader::open("https://s3.amazonaws.com/mjk_asdf/abs/population_areas.fgb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just finished downloading it, now struggling to re-upload. aws CLI couldn't figure out how to do a direct copy from another public-through-HTTP bucket. :P
It'll eventually be at abstreet.s3-website.us-east-2.amazonaws.com/population_areas.fgb. I don't think sense it makes sense to put it in /dev/data/input or anything like that, because that directory is synced with a local copy. I'm not sure if it makes sense to let people opt into having a local copy of this file through the updater.
fbdfdd1
to
9df82a7
Compare
I've gone the route of plumbing through the In the longer running conversation we've had, I think this is a vote in favor of having |
By way of an update, I believe the only thing still outstanding is the wasm interop. There's a bunch of things I don't know and am trying to learn about, but it seems like a large task. It's unclear to me if you can "just" dump a bunch of async machinery written to work in rust and have it run in wasm, due to the runtime constraints of the browser (ie. single-threaded). There are certain tools which allow you to send data into and pop data out of JS, but still trying to figure out if and how I can use them to have this code run in the browser. I'll keep trying to solve it this way tomorrow, but I'll probably pivot to having the importer generate and cache scenarios for now if I can't make progress soon. As a side note a couple things I had to do:
|
Exciting progress!
Agreed this is a good way forwards. Getting wasm support glued to reqwest could be a big detour, and since we don't have any online parameter tuning ideas yet, we can just have this work offline only to start.
The one thing to watch out for if you also updated |
c941076
to
4d4b0b3
Compare
Census block geometries often have self intersecting rings. In the cases I inspected this was due to *touching* itself at a point, not overlapping at a line nor crossing into an area. e.g. best I can represent in ascii... .____. .__. | | | | | |< | <--- Ring touching itself at the "|<" | |_| | |_________|
This reverts commit 7e382f5.
This isn't expected to work out of the box. It's more like really precise documentation. In fact, I first tried writing this *as* documentation, but in order to be clear enough it required so many code snippets that I just opted to write the script instead.
Similar to the FileLoader, the FutureLoader will load any future async while polling from a loading panel. On native, this will dispatch to a background thread via tokio. On wasm this must use the same single threaded runloop as all promises.
4d4b0b3
to
6fe2428
Compare
Ok! After much ado, this is ready for another review. Other than a rebase, here's the primary change to fix wasm: 6fe2428 |
popdat/src/import_census.rs
Outdated
// See the import handbook for how to prepare this file. | ||
let mut fgb = | ||
// HttpFgbReader::open("https://abstreet.s3.amazonaws.com/population_areas.fgb").await?; | ||
HttpFgbReader::open("https://s3.amazonaws.com/mjk_asdf/abs/population_areas.fgb").await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this back to my bucket until your bucket has a cors policy.
Here's the one I used:
[
{
"AllowedHeaders": [
"*"
],
"AllowedMethods": [
"GET"
],
"AllowedOrigins": [
"*"
],
"ExposeHeaders": []
}
]
To set it up from the web: log in to the console > s3 > abstreet bucket > Permissions > Cors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I just updated it
I will say it's kind of complicated and I'm open to other approaches. I suspect that this won't be the last time we want to do some async stuff in the browser though, so I'm gambling that, if not the code, at least the knowledge will be reused. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This certainly is complex, but I don't think any of it is avoidable, and it's pretty nicely isolated in load.rs
. The only real wart is having to care about Send
or not in LoadScenario
(or any other place making use of this async magic in the future).
I tried Manhattan natively and on wasm, and it seemed to all just work! (Using the population layer, I saw a hotspot at https://www.openstreetmap.org/relation/3697456 -- no familiarity with the area, but it's a nice big building, so I'll believe it.)
My vote is to go with this approach. Once we have more knobs to turn at the activity generation level, being able to dynamically generate the scenario seems desirable -- so making the importer create one file per non-Seattle map isn't really what we want. I suspect we might want to cache the CensusAreas
if we want to use this in an extremely live-tuning kind of way. But:
- We could maybe put that in
PerMap
, similar toScenario
- Generating the scenario is slow anyway because of pathfinding for mode choice, so a few seconds to re-download isn't bad
- Maybe we can split the pipeline a bit and assign number of people to buildings at importer time, and avoid the network during the rest of the popdat gen
So, I'll convert this from a draft and after the few misc fixes, we can merge. Thanks for fighting through the winit/future/async mess! Definitely valuable knowledge gained.
Cargo.toml
Outdated
@@ -34,4 +34,7 @@ members = [ | |||
opt-level = 3 | |||
|
|||
[patch.crates-io] | |||
# Some niceties for parsing geojson feature properties. | |||
# Upstreaming at https://github.com/georust/geojson/pull/155 | |||
geojson = { git = "https://github.com/georust/geojson", branch = "mkirk/try_from" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is merged, so we can pin to geojson git and clean this up, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
let map_bounds = map.get_gps_bounds().clone(); | ||
let mut rng = sim::fork_rng(&mut rng); | ||
|
||
LoadScenario::Future(Box::pin(async move { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First use of pin! My understanding of this is pretty fuzzy, but https://rust-lang.github.io/async-book/04_pinning/01_chapter.html makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I read that and it sort of makes sense, but I haven't internalized it at all. I'm primarily driven by compiler errors at this point.
LoadScenario::Future(Box::pin(async move { | ||
let areas = popdat::CensusArea::fetch_all_for_map(&map_area, &map_bounds).await?; | ||
|
||
let scenario_from_app: Box<dyn Send + FnOnce(&App) -> Scenario> = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so IIUC, the structure is that an async piece will fetch remote data, then the "callback" synchronously (and expensively :( ) runs to produce the scenario. Makes sense -- conceptually the same as LoadScenario::Path
on web.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. I'd like to move more of the computation onto the background thread (at least for native) but right now it all requires Map which is not Send.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that might be easy to fix. I think the only reason it's not Send
is because of the path_calc: ThreadLocal<RefCell<PathCalculator>>
trick deep inside the pathfinding layer. It may not be hard to rearchitect and plumb the PathCalculator
in. (It's an internal fast_paths
thing that makes it cheaper to run queries repeatedly on the same thread.)
}) | ||
} | ||
|
||
#[cfg(not(target_arch = "wasm32"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a separate copy to express the +Send
constraint, right? Doesn't seem worth parameterizing this one method, so this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the Send
constraint can't be present for wasm32, but note also that the wasm version does wasm_bindgen_futures::spawn_local
while the native does tokio::runtime::Runtime::spawn
.
I briefly tried to hide the spawn
divergence by moving it into AppLike...
trait AppLike {
+ type PossiblyConcurrentFuture = Future
+ fn spawn(future: Self::PossiblyConcurrentFuture);
...
}
#[cfg(not(target_arch = "wasm32"))]
impl AppLike for App {
type PossiblyConcurrentFuture = Future + Send
fn spawn(&self, future: Self::PossiblyConcurrentFuture) {
self.runtime.spawn(future)
}
...
}
#[cfg(arch = "wasm32")]
impl AppLike for App {
type PossiblyConcurrentFuture = Future;
fn spawn(&self, future: Self::PossiblyConcurrentFuture) {
wasm_bindgen_futures::spawn_local(future)
}
...
}
But I abandoned the approach after seeing how many new compiler errors it caused. I might revisit - I suspect it's better for the tokio Runtime instance to live on something permanent like App.
@@ -244,3 +252,118 @@ mod wasm_loader { | |||
} | |||
} | |||
} | |||
|
|||
pub struct FutureLoader<A, T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm noticing that this and FileLoader
don't actually care that the app is AppLike
at all. I suspect we'll be able to lift both of these to widgetry
, but no urgency for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm - interesting!
popdat/Cargo.toml
Outdated
@@ -7,12 +7,16 @@ edition = "2018" | |||
[dependencies] | |||
abstutil = { path = "../abstutil" } | |||
anyhow = "1.0.35" | |||
flatgeobuf = { version = "*" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the fix is upstreamed, should we pin to 0.4.0
? I guess using wildcards is fine too; it just means when we occasionally try to upgrade everything, we use cargo update -p
instead of changing it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
popdat/src/distribute_people.rs
Outdated
@@ -2,7 +2,7 @@ use rand::Rng; | |||
use rand_xorshift::XorShiftRng; | |||
|
|||
use abstutil::prettyprint_usize; | |||
use geom::Polygon; | |||
use geo::algorithm::{area::Area, contains::Contains}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Belongs in the block above. I hereby regret switching over to this import style. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
popdat/src/import_census.rs
Outdated
// See the import handbook for how to prepare this file. | ||
let mut fgb = | ||
// HttpFgbReader::open("https://abstreet.s3.amazonaws.com/population_areas.fgb").await?; | ||
HttpFgbReader::open("https://s3.amazonaws.com/mjk_asdf/abs/population_areas.fgb").await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I just updated it
warn!("skipping feature with missing population"); | ||
continue; | ||
} | ||
let population: usize = props["population"].parse()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much nicer API!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeahhhhh geojson properties api is rough. georust/geojson#157
return Transition::Push(FutureLoader::<App, Scenario>::new( | ||
ctx, | ||
Box::pin(future), | ||
"Loading Scenario", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UX is a bit clumsy now -- there's no indication of what's actually happening. Not sure what we could do better. If there was a clear sense of progress, or it was very easy to subscribe to the number of bytes the FGB reader had received so far, it'd be kind of nice to display it. But since there's not a known total, not really useful. And it would vastly complicate this interface. So seems fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The major hurdle is any communication would need to be potentially cross thread. And App is pretty strongly locked into the main thread.
If there's some way to make a handle to Timer Send
, I think this could get better... the immediate hurdle is that Timer.sink needs to be Send, which blows up lots of things.
Alright - I believe I've responded to all of your feedback. |
LGTM, thank you! |
Rather than prepping a cropped geojson for each city, I've hosted an agglomerated flatgeobuf file with all the areas and their populations.
Because FGB's are quite efficient, I was able to store the most granular census area: "census blocks". A tract can be several times larger than a block, so getting this more fine grained information should give us a more accurate picture of where trips terminate.
I'm actually optimistic that it would be reasonable to serve other countries population areas in the same unified fgb file, and then we don't need any client changes, per-city handling, or server side logic about which census/population data we want - we "just" need to update the hosted FGB with the new data, and this code which requests based on the WGS84 bounding box will continue to work.
Preparing the fgb is currently a manual process, but hopefully we don't need to do it often. I've included a script to show what I did for the US data which could be adapted to incorporate other countries. I imagine if it gets more use it'd make sense to turn it into something more robust than a bash script.
The reason this is a draft is because app crashes in some cities. e.g. try to run it in South Seattle due to the census geometries containing invalid polygon rings.
The crash can be avoided with this change: 7e382f5
...but I assume we don't want to wholesale disable ring validation... right?
I've verified that, at least in the cases I inspected, the error is legit. They really are self intersecting (just
touching
in the cases I inspected), but I don't think that will stop the building population assignment from working.One approach - rather than converting to an
abstreet::geom::Polygon
we could hang on to the geo::Polygon, which won't explode when faced with an invalid ring. WDYT?