You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We cut down substantially on times for reruns, but just opening, parsing, and closing each JSON is still taking a long time -- even if the second we check it that is a no-op. This is something like 90 seconds for 300 articles, even if every article is read-only and then ignored (spacy never runs, zip never saved).
In the future we might want to move to keeping a list of article names with version numbers in the root of the zip -- like we are keeping a list of processed zips. Then we don't have to open them (which takes a surprisingly long time).
For example, we might have a single json file in the root that helps us write conditions to decide what to update / reprocess, based on the preprocess version, the hash, and/or when it was last processed.
We cut down substantially on times for reruns, but just opening, parsing, and closing each JSON is still taking a long time -- even if the second we check it that is a no-op. This is something like 90 seconds for 300 articles, even if every article is read-only and then ignored (spacy never runs, zip never saved).
In the future we might want to move to keeping a list of article names with version numbers in the root of the zip -- like we are keeping a list of processed zips. Then we don't have to open them (which takes a surprisingly long time).
For example, we might have a single json file in the root that helps us write conditions to decide what to update / reprocess, based on the preprocess version, the hash, and/or when it was last processed.
The text was updated successfully, but these errors were encountered: