-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Technologies cleanup automation #52
Conversation
tech.technology NOT IN (SELECT DISTINCT name FROM wappalyzer.technologies) | ||
OR category NOT IN (SELECT DISTINCT name FROM wappalyzer.categories) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the wappalyzer
tables updated with every crawl? Is that planned to be continued after we remove the legacy tables? (I thought this WAS a legacy table!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are updated with every repo update, see https://github.com/HTTPArchive/wappalyzer/blob/main/.github/workflows/upload.yml.
These are not dependent on crawl detections.
But they should represent the crawl configuration considering we don't merge changes during crawl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any risk that a technology that is removed (or renamed) mid-crawl causes it to be deleted? I try not to merge things mid-crawl, but stranger things have happened. But not sure how to deal with that to be honest...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. But I think monthly manual changes pose a bigger risk.
- I hope 🤞🏽 we'll find a solution to avoid corrupted values on wptagent during the next attempt.
- To avoid retrospective issues I expect we'll be able to extend
reports.cwv_tect_technologies
with clean historical data. - We could also backup the original data for N months just in case.
Co-authored-by: Barry Pollard <[email protected]>
Co-authored-by: Barry Pollard <[email protected]>
Co-authored-by: Barry Pollard <[email protected]>
Co-authored-by: Barry Pollard <[email protected]>
I added two steps to a crawl pipeline:
crawl.pages
table.This should give us a clean monthly technology data until we'll find time to fix those few anomalies on the wptagent.
Closes #43