Skip to content

Commit

Permalink
Merge branch 'main' into serp-tests
Browse files Browse the repository at this point in the history
  • Loading branch information
janheinrichmerker committed Nov 1, 2023
2 parents 9d8b4e2 + 3367559 commit f0714e0
Show file tree
Hide file tree
Showing 20 changed files with 1,261 additions and 196 deletions.
1 change: 0 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,6 @@ jobs:
- name: "📤 Upload coverage to Codecov"
uses: codecov/codecov-action@v2
with:
fail_ci_if_error: true
token: ${{ secrets.CODECOV_TOKEN }}
release:
name: "🚀 Create GitHub release"
Expand Down
109 changes: 2 additions & 107 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,112 +1,7 @@
### JetBrains ###
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839

# User-specific stuff
.idea/**/workspace.xml
.idea/**/tasks.xml
.idea/**/usage.statistics.xml
.idea/**/dictionaries
.idea/**/shelf

# AWS User-specific
.idea/**/aws.xml

# Generated files
.idea/**/contentModel.xml
*.iml

# Sensitive or high-churn files
.idea/**/dataSources/
.idea/**/dataSources.ids
.idea/**/dataSources.local.xml
.idea/**/sqlDataSources.xml
.idea/**/dynamic.xml
.idea/**/uiDesigner.xml
.idea/**/dbnavigator.xml

# Gradle
.idea/**/gradle.xml
.idea/**/libraries

# Gradle and Maven with auto-import
# When using Gradle or Maven with auto-import, you should exclude module files,
# since they will be recreated, and may cause churn. Uncomment if using
# auto-import.
# .idea/artifacts
# .idea/compiler.xml
# .idea/jarRepositories.xml
# .idea/modules.xml
# .idea/*.iml
# .idea/modules
# *.iml
# *.ipr

# CMake
cmake-build-*/

# Mongo Explorer plugin
.idea/**/mongoSettings.xml

# File-based project format
*.iws

# IntelliJ
.idea
out/

# mpeltonen/sbt-idea plugin
.idea_modules/

# JIRA plugin
atlassian-ide-plugin.xml

# Cursive Clojure plugin
.idea/replstate.xml

# SonarLint plugin
.idea/sonarlint/

# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties

# Editor-based Rest Client
.idea/httpRequests

# Android studio 3.1+ serialized cache file
.idea/caches/build_file_checksums.ser

### JetBrains Patch ###
# Comment Reason: https://github.com/joeblau/gitignore.io/issues/186#issuecomment-215987721

# *.iml
# modules.xml
# .idea/misc.xml
# *.ipr

# Sonarlint plugin
# https://plugins.jetbrains.com/plugin/7973-sonarlint
.idea/**/sonarlint/

# SonarQube Plugin
# https://plugins.jetbrains.com/plugin/7238-sonarqube-community-plugin
.idea/**/sonarIssues.xml

# Markdown Navigator plugin
# https://plugins.jetbrains.com/plugin/7896-markdown-navigator-enhanced
.idea/**/markdown-navigator.xml
.idea/**/markdown-navigator-enh.xml
.idea/**/markdown-navigator/

# Cache file creation bug
# See https://youtrack.jetbrains.com/issue/JBR-2257
.idea/$CACHE_FILE$

# CodeStream plugin
# https://plugins.jetbrains.com/plugin/12206-codestream
.idea/codestream.xml
*.iml

### LaTeX ###
## Core latex/pdflatex auxiliary files:
Expand Down
11 changes: 0 additions & 11 deletions .idea/.gitignore

This file was deleted.

1 change: 0 additions & 1 deletion .idea/.name

This file was deleted.

5 changes: 0 additions & 5 deletions .idea/codeStyles/codeStyleConfig.xml

This file was deleted.

16 changes: 0 additions & 16 deletions .idea/inspectionProfiles/Project_Default.xml

This file was deleted.

6 changes: 0 additions & 6 deletions .idea/misc.xml

This file was deleted.

8 changes: 0 additions & 8 deletions .idea/modules.xml

This file was deleted.

12 changes: 0 additions & 12 deletions .idea/php.xml

This file was deleted.

6 changes: 0 additions & 6 deletions .idea/vcs.xml

This file was deleted.

16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,10 @@ If you want to learn more about each step here are some more detailed guides:

Let's start with a small example and construct a query log for the [ChatNoir](https://chatnoir.eu) search engine:
1. `python -m web_archive_query_log make archived-urls chatnoir`
2. `python -m web_archive_query_log make archived-query-urls chatnoir`
3. `python -m web_archive_query_log make archived-raw-serps chatnoir`
4. `python -m web_archive_query_log make archived-parsed-serps chatnoir`
1. `python -m archive_query_log make archived-urls chatnoir`
2. `python -m archive_query_log make archived-query-urls chatnoir`
3. `python -m archive_query_log make archived-raw-serps chatnoir`
4. `python -m archive_query_log make archived-parsed-serps chatnoir`
Got the idea? Now you're ready to scrape your own query logs! To scale things up and understand the data, just keep on reading. For more details on how to add more search providers, see [below](#contribute).

Expand Down Expand Up @@ -123,7 +123,7 @@ Fetch all archived URLs for a search provider from the Internet Archive's Waybac
You can run this step with the following command line, where `<PROVIDER>` is the name of the search provider you want to fetch archived URLs from:
```shell:
python -m web_archive_query_log make archived-urls <PROVIDER>
python -m archive_query_log make archived-urls <PROVIDER>
```
This will create multiple files in the `archived-urls` subdirectory under the [data directory](#pro-tip--specify-a-custom-data-directory), based on the search provider's name (`<PROVIDER>`), domain (`<DOMAIN>`), and the Wayback Machine's CDX [page number][cdx-pagination] (`<CDXPAGE>`) from which the URLs were originally fetched:
Expand Down Expand Up @@ -154,7 +154,7 @@ Parse and filter archived URLs that contain a query and may point to a search en
You can run this step with the following command line, where `<PROVIDER>` is the name of the search provider you want to parse query URLs from:
```shell:
python -m web_archive_query_log make archived-query-urls <PROVIDER>
python -m archive_query_log make archived-query-urls <PROVIDER>
```
This will create multiple files in the `archived-query-urls` subdirectory under the [data directory](#pro-tip--specify-a-custom-data-directory), based on the search provider's name (`<PROVIDER>`), domain (`<DOMAIN>`), and the Wayback Machine's CDX [page number][cdx-pagination] (`<CDXPAGE>`) from which the URLs were originally fetched:
Expand Down Expand Up @@ -191,7 +191,7 @@ Download the raw HTML content of archived search engine result pages (SERPs).
You can run this step with the following command line, where `<PROVIDER>` is the name of the search provider you want to download raw SERP HTML contents from:
```shell:
python -m web_archive_query_log make archived-raw-serps <PROVIDER>
python -m archive_query_log make archived-raw-serps <PROVIDER>
```
This will create multiple files in the `archived-urls` subdirectory under the [data directory](#pro-tip--specify-a-custom-data-directory), based on the search provider's name (`<PROVIDER>`), domain (`<DOMAIN>`), and the Wayback Machine's CDX [page number][cdx-pagination] (`<CDXPAGE>`) from which the URLs were originally fetched. Archived raw SERPs are stored as 1GB-sized WARC chunk files, that is, WARC chunks are "filled" sequentially up to a size of 1GB each. If a chunk is full, a new chunk is created.
Expand Down Expand Up @@ -228,7 +228,7 @@ Parse and filter archived SERPs from raw contents.
You can run this step with the following command line, where `<PROVIDER>` is the name of the search provider you want to parse SERPs from:
```shell:
python -m web_archive_query_log make archived-parsed-serps <PROVIDER>
python -m archive_query_log make archived-parsed-serps <PROVIDER>
```
This will create multiple files in the `archived-serps` subdirectory under the [data directory](#pro-tip--specify-a-custom-data-directory), based on the search provider's name (`<PROVIDER>`), domain (`<DOMAIN>`), and the Wayback Machine's CDX [page number][cdx-pagination] (`<CDXPAGE>`) from which the URLs were originally fetched:
Expand Down
7 changes: 0 additions & 7 deletions archive_query_log/results/test/test_qq_serp_parsing.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,13 +94,6 @@ def test_parse_query_qq_xing_xing_di_qiu_2_1582812539():
)


def test_parse_query_qq_jin_cheng_wu_1319745059():
verify_serp_parsing(
"https://web.archive.org/web/20111027195059id_/http://v.qq.com/search.html?pagetype=3&ms_key=%E9%87%91%E5%9F%8E%E6%AD%A6",
"qq",
)


def test_parse_query_qq_turn_that_finger_around_1324266860():
verify_serp_parsing(
"https://web.archive.org/web/20111219035420id_/http://v.qq.com:80/search.html?pagetype=3&ms_key=Turn+That+Finger+Around",
Expand Down
2 changes: 1 addition & 1 deletion cli
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
#!/bin/bash -e

pipenv run python -m web_archive_query_log "$@"
pipenv run python -m archive_query_log "$@"
Loading

0 comments on commit f0714e0

Please sign in to comment.