make json preprocess version numbers available external to jsons #2

jeremydouglass · 2019-05-26T20:49:06Z

We cut down substantially on times for reruns, but just opening, parsing, and closing each JSON is still taking a long time -- even if the second we check it that is a no-op. This is something like 90 seconds for 300 articles, even if every article is read-only and then ignored (spacy never runs, zip never saved).

In the future we might want to move to keeping a list of article names with version numbers in the root of the zip -- like we are keeping a list of processed zips. Then we don't have to open them (which takes a surprisingly long time).

For example, we might have a single json file in the root that helps us write conditions to decide what to update / reprocess, based on the preprocess version, the hash, and/or when it was last processed.

{
  "article1.json": {
        "ppversion": "0.1",
        "hashversion": "0.1",
        "processed": "20190503123114"
      },
  "article1.json": {
        "ppversion": "0.1",
        "hashversion": "0.1",
        "processed": "20190503123114"
      }
}

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make json preprocess version numbers available external to jsons #2

make json preprocess version numbers available external to jsons #2

jeremydouglass commented May 26, 2019

make json preprocess version numbers available external to jsons #2

make json preprocess version numbers available external to jsons #2

Comments

jeremydouglass commented May 26, 2019