Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make json preprocess version numbers available external to jsons #2

Open
jeremydouglass opened this issue May 26, 2019 · 0 comments
Open

Comments

@jeremydouglass
Copy link
Contributor

We cut down substantially on times for reruns, but just opening, parsing, and closing each JSON is still taking a long time -- even if the second we check it that is a no-op. This is something like 90 seconds for 300 articles, even if every article is read-only and then ignored (spacy never runs, zip never saved).

In the future we might want to move to keeping a list of article names with version numbers in the root of the zip -- like we are keeping a list of processed zips. Then we don't have to open them (which takes a surprisingly long time).

For example, we might have a single json file in the root that helps us write conditions to decide what to update / reprocess, based on the preprocess version, the hash, and/or when it was last processed.

{
  "article1.json": {
        "ppversion": "0.1",
        "hashversion": "0.1",
        "processed": "20190503123114"
      },
  "article1.json": {
        "ppversion": "0.1",
        "hashversion": "0.1",
        "processed": "20190503123114"
      }
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant