Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Technologies cleanup automation #52

Merged
merged 14 commits into from
Jan 20, 2025
46 changes: 35 additions & 11 deletions definitions/declarations/httparchive.js
Original file line number Diff line number Diff line change
@@ -1,17 +1,41 @@
const stagingTables = ['pages', 'requests', 'parsed_css']
for (const table of stagingTables) {
// Staging tables source: https://github.com/HTTPArchive/crawl/blob/main/crawl.py
['pages', 'requests', 'parsed_css'].forEach(table =>
declare({
schema: 'crawl_staging',
name: table
})
}
)

declare({
schema: 'wappalyzer',
name: 'technologies'
})
assert('corrupted_technology_values')
.tags(['crawl_complete'])
max-ostapenko marked this conversation as resolved.
Show resolved Hide resolved
.query(ctx => `
SELECT
date,
client,
tech,
COUNT(DISTINCT page) AS cnt_pages,
ARRAY_AGG(DISTINCT page LIMIT 3) AS sample_pages
FROM ${ctx.ref('crawl_staging', 'pages')} AS pages
LEFT JOIN pages.technologies AS tech
LEFT JOIN tech.categories AS category
WHERE
date = '${constants.currentMonth}' AND
(
tech.technology NOT IN (SELECT DISTINCT name FROM wappalyzer.technologies)
OR category NOT IN (SELECT DISTINCT name FROM wappalyzer.categories)
OR ARRAY_LENGTH(tech.categories) = 0
Comment on lines +25 to +26
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the wappalyzer tables updated with every crawl? Is that planned to be continued after we remove the legacy tables? (I thought this WAS a legacy table!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are updated with every repo update, see https://github.com/HTTPArchive/wappalyzer/blob/main/.github/workflows/upload.yml.

These are not dependent on crawl detections.
But they should represent the crawl configuration considering we don't merge changes during crawl.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any risk that a technology that is removed (or renamed) mid-crawl causes it to be deleted? I try not to merge things mid-crawl, but stranger things have happened. But not sure how to deal with that to be honest...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. But I think monthly manual changes pose a bigger risk.

  • I hope 🤞🏽 we'll find a solution to avoid corrupted values on wptagent during the next attempt.
  • To avoid retrospective issues I expect we'll be able to extend reports.cwv_tect_technologies with clean historical data.
  • We could also backup the original data for N months just in case.

)
GROUP BY
date,
client,
tech
ORDER BY cnt_pages DESC
`);

declare({
schema: 'wappalyzer',
name: 'categories'
})
// Wappalyzer tables source: https://github.com/HTTPArchive/wappalyzer/blob/main/.github/workflows/upload.yml
['technologies', 'categories'].forEach(table =>
declare({
schema: 'wappalyzer',
name: table
})
)
78 changes: 73 additions & 5 deletions definitions/output/crawl/pages.js
Original file line number Diff line number Diff line change
Expand Up @@ -52,23 +52,91 @@ publish('pages', {
DELETE FROM ${ctx.self()}
WHERE date = '${constants.currentMonth}' AND
client = 'desktop';
`).query(ctx => `

INSERT INTO ${ctx.self()}
SELECT
*
FROM ${ctx.ref('crawl_staging', 'pages')}
WHERE date = '${constants.currentMonth}' AND
client = 'desktop'
${constants.devRankFilter}
`).postOps(ctx => `
${constants.devRankFilter};

DELETE FROM ${ctx.self()}
WHERE date = '${constants.currentMonth}' AND
client = 'mobile';

INSERT INTO ${ctx.self()}
`).query(ctx => `
SELECT
*
FROM ${ctx.ref('crawl_staging', 'pages')}
WHERE date = '${constants.currentMonth}' AND
client = 'mobile'
${constants.devRankFilter}
`).postOps(ctx => `
CREATE TEMP TABLE technologies_cleaned AS (
tunetheweb marked this conversation as resolved.
Show resolved Hide resolved
WITH wappalyzer AS (
SELECT DISTINCT
name AS technology,
categories
FROM ${ctx.ref('wappalyzer', 'technologies')}
), pages AS (
max-ostapenko marked this conversation as resolved.
Show resolved Hide resolved
SELECT
client,
page,
tech.technology,
tech.categories,
tech.info
FROM ${ctx.self()} AS pages
LEFT JOIN pages.technologies AS tech
WHERE date = '${constants.currentMonth}' ${constants.devRankFilter}
), -- Identify impacted pages
max-ostapenko marked this conversation as resolved.
Show resolved Hide resolved
impacted_pages AS (
SELECT DISTINCT
client,
page
FROM pages
LEFT JOIN pages.categories AS category
WHERE
-- Technology is corrupted
technology NOT IN (SELECT DISTINCT technology FROM wappalyzer) OR
-- Technology's category is corrupted
CONCAT(technology, category) NOT IN (
SELECT DISTINCT
CONCAT(technology, category)
FROM wappalyzer
LEFT JOIN wappalyzer.categories AS category
)
), -- Keep valid technologies and use correct categories
max-ostapenko marked this conversation as resolved.
Show resolved Hide resolved
reconstructed_technologies AS (
SELECT
client,
page,
ARRAY_AGG(STRUCT(
pages.technology,
wappalyzer.categories,
pages.info
)) AS technologies
FROM pages
INNER JOIN impacted_pages
USING (client, page)
INNER JOIN wappalyzer
ON pages.technology = wappalyzer.technology
GROUP BY
client,
page
)

SELECT
client,
page,
technologies
FROM reconstructed_technologies
);

-- Update the crawl.pages table with the cleaned and restored technologies
UPDATE ${ctx.self()} AS pages
SET technologies = technologies_cleaned.technologies
FROM technologies_cleaned
WHERE pages.date = '${constants.currentMonth}' AND
pages.client = technologies_cleaned.client AND
pages.page = technologies_cleaned.page;
`)
Loading