Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Technologies cleanup automation #52

Merged
merged 14 commits into from
Jan 20, 2025
Merged

Technologies cleanup automation #52

merged 14 commits into from
Jan 20, 2025

Conversation

max-ostapenko
Copy link
Contributor

@max-ostapenko max-ostapenko commented Jan 20, 2025

I added two steps to a crawl pipeline:

  • an assertion for the technologies data on staging (optional, doesn't block the downstream tasks) to track the scale of the issue
  • cleanup for the crawl.pages table.

This should give us a clean monthly technology data until we'll find time to fix those few anomalies on the wptagent.

Closes #43

@max-ostapenko max-ostapenko marked this pull request as ready for review January 20, 2025 21:25
Comment on lines +24 to +25
tech.technology NOT IN (SELECT DISTINCT name FROM wappalyzer.technologies)
OR category NOT IN (SELECT DISTINCT name FROM wappalyzer.categories)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the wappalyzer tables updated with every crawl? Is that planned to be continued after we remove the legacy tables? (I thought this WAS a legacy table!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are updated with every repo update, see https://github.com/HTTPArchive/wappalyzer/blob/main/.github/workflows/upload.yml.

These are not dependent on crawl detections.
But they should represent the crawl configuration considering we don't merge changes during crawl.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any risk that a technology that is removed (or renamed) mid-crawl causes it to be deleted? I try not to merge things mid-crawl, but stranger things have happened. But not sure how to deal with that to be honest...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. But I think monthly manual changes pose a bigger risk.

  • I hope 🤞🏽 we'll find a solution to avoid corrupted values on wptagent during the next attempt.
  • To avoid retrospective issues I expect we'll be able to extend reports.cwv_tect_technologies with clean historical data.
  • We could also backup the original data for N months just in case.

definitions/output/crawl/pages.js Show resolved Hide resolved
definitions/output/crawl/pages.js Outdated Show resolved Hide resolved
definitions/output/crawl/pages.js Outdated Show resolved Hide resolved
definitions/output/crawl/pages.js Outdated Show resolved Hide resolved
@max-ostapenko max-ostapenko merged commit de144b0 into main Jan 20, 2025
19 checks passed
@max-ostapenko max-ostapenko deleted the vertical-butterfly branch January 20, 2025 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix corrupted values of technologies detections
2 participants