Merge branch 'main' into serp-tests

webis-de · Nov 1, 2023 · f0714e0 · f0714e0
2 parents 9d8b4e2 + 3367559
commit f0714e0
Show file tree

Hide file tree

Showing 20 changed files with 1,261 additions and 196 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -70,7 +70,6 @@ jobs:
       - name: "📤 Upload coverage to Codecov"
         uses: codecov/codecov-action@v2
         with:
-          fail_ci_if_error: true
           token: ${{ secrets.CODECOV_TOKEN }}
   release:
     name: "🚀 Create GitHub release"

diff --git a/.gitignore b/.gitignore
@@ -1,112 +1,7 @@
 ### JetBrains ###
-# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio, WebStorm and Rider
-# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
-
-# User-specific stuff
-.idea/**/workspace.xml
-.idea/**/tasks.xml
-.idea/**/usage.statistics.xml
-.idea/**/dictionaries
-.idea/**/shelf
-
-# AWS User-specific
-.idea/**/aws.xml
-
-# Generated files
-.idea/**/contentModel.xml
-*.iml
-
-# Sensitive or high-churn files
-.idea/**/dataSources/
-.idea/**/dataSources.ids
-.idea/**/dataSources.local.xml
-.idea/**/sqlDataSources.xml
-.idea/**/dynamic.xml
-.idea/**/uiDesigner.xml
-.idea/**/dbnavigator.xml
-
-# Gradle
-.idea/**/gradle.xml
-.idea/**/libraries
-
-# Gradle and Maven with auto-import
-# When using Gradle or Maven with auto-import, you should exclude module files,
-# since they will be recreated, and may cause churn.  Uncomment if using
-# auto-import.
-# .idea/artifacts
-# .idea/compiler.xml
-# .idea/jarRepositories.xml
-# .idea/modules.xml
-# .idea/*.iml
-# .idea/modules
-# *.iml
-# *.ipr
-
-# CMake
-cmake-build-*/
-
-# Mongo Explorer plugin
-.idea/**/mongoSettings.xml
-
-# File-based project format
-*.iws
-
-# IntelliJ
+.idea
 out/
-
-# mpeltonen/sbt-idea plugin
-.idea_modules/
-
-# JIRA plugin
-atlassian-ide-plugin.xml
-
-# Cursive Clojure plugin
-.idea/replstate.xml
-
-# SonarLint plugin
-.idea/sonarlint/
-
-# Crashlytics plugin (for Android Studio and IntelliJ)
-com_crashlytics_export_strings.xml
-crashlytics.properties
-crashlytics-build.properties
-fabric.properties
-
-# Editor-based Rest Client
-.idea/httpRequests
-
-# Android studio 3.1+ serialized cache file
-.idea/caches/build_file_checksums.ser
-
-### JetBrains Patch ###
-# Comment Reason: https://github.com/joeblau/gitignore.io/issues/186#issuecomment-215987721
-
-# *.iml
-# modules.xml
-# .idea/misc.xml
-# *.ipr
-
-# Sonarlint plugin
-# https://plugins.jetbrains.com/plugin/7973-sonarlint
-.idea/**/sonarlint/
-
-# SonarQube Plugin
-# https://plugins.jetbrains.com/plugin/7238-sonarqube-community-plugin
-.idea/**/sonarIssues.xml
-
-# Markdown Navigator plugin
-# https://plugins.jetbrains.com/plugin/7896-markdown-navigator-enhanced
-.idea/**/markdown-navigator.xml
-.idea/**/markdown-navigator-enh.xml
-.idea/**/markdown-navigator/
-
-# Cache file creation bug
-# See https://youtrack.jetbrains.com/issue/JBR-2257
-.idea/$CACHE_FILE$
-
-# CodeStream plugin
-# https://plugins.jetbrains.com/plugin/12206-codestream
-.idea/codestream.xml
+*.iml
 
 ### LaTeX ###
 ## Core latex/pdflatex auxiliary files:

diff --git a/.idea/.gitignore b/.idea/.gitignore
diff --git a/.idea/.name b/.idea/.name
diff --git a/.idea/codeStyles/codeStyleConfig.xml b/.idea/codeStyles/codeStyleConfig.xml
diff --git a/.idea/inspectionProfiles/Project_Default.xml b/.idea/inspectionProfiles/Project_Default.xml
diff --git a/.idea/misc.xml b/.idea/misc.xml
diff --git a/.idea/modules.xml b/.idea/modules.xml
diff --git a/.idea/php.xml b/.idea/php.xml
diff --git a/.idea/vcs.xml b/.idea/vcs.xml
diff --git a/README.md b/README.md
@@ -59,10 +59,10 @@ If you want to learn more about each step here are some more detailed guides:
 
 Let's start with a small example and construct a query log for the [ChatNoir](https://chatnoir.eu) search engine:
 
-1. `python -m web_archive_query_log make archived-urls chatnoir`
-2. `python -m web_archive_query_log make archived-query-urls chatnoir`
-3. `python -m web_archive_query_log make archived-raw-serps chatnoir`
-4. `python -m web_archive_query_log make archived-parsed-serps chatnoir`
+1. `python -m archive_query_log make archived-urls chatnoir`
+2. `python -m archive_query_log make archived-query-urls chatnoir`
+3. `python -m archive_query_log make archived-raw-serps chatnoir`
+4. `python -m archive_query_log make archived-parsed-serps chatnoir`
 
 Got the idea? Now you're ready to scrape your own query logs! To scale things up and understand the data, just keep on reading. For more details on how to add more search providers, see [below](#contribute).
 
@@ -123,7 +123,7 @@ Fetch all archived URLs for a search provider from the Internet Archive's Waybac
 You can run this step with the following command line, where `<PROVIDER>` is the name of the search provider you want to fetch archived URLs from:
 
 ```shell:
-python -m web_archive_query_log make archived-urls <PROVIDER>
+python -m archive_query_log make archived-urls <PROVIDER>
 ```
 
 This will create multiple files in the `archived-urls` subdirectory under the [data directory](#pro-tip--specify-a-custom-data-directory), based on the search provider's name (`<PROVIDER>`), domain (`<DOMAIN>`), and the Wayback Machine's CDX [page number][cdx-pagination] (`<CDXPAGE>`) from which the URLs were originally fetched:
@@ -154,7 +154,7 @@ Parse and filter archived URLs that contain a query and may point to a search en
 You can run this step with the following command line, where `<PROVIDER>` is the name of the search provider you want to parse query URLs from:
 
 ```shell:
-python -m web_archive_query_log make archived-query-urls <PROVIDER>
+python -m archive_query_log make archived-query-urls <PROVIDER>
 ```
 
 This will create multiple files in the `archived-query-urls` subdirectory under the [data directory](#pro-tip--specify-a-custom-data-directory), based on the search provider's name (`<PROVIDER>`), domain (`<DOMAIN>`), and the Wayback Machine's CDX [page number][cdx-pagination] (`<CDXPAGE>`) from which the URLs were originally fetched:
@@ -191,7 +191,7 @@ Download the raw HTML content of archived search engine result pages (SERPs).
 You can run this step with the following command line, where `<PROVIDER>` is the name of the search provider you want to download raw SERP HTML contents from:
 
 ```shell:
-python -m web_archive_query_log make archived-raw-serps <PROVIDER>
+python -m archive_query_log make archived-raw-serps <PROVIDER>
 ```
 
 This will create multiple files in the `archived-urls` subdirectory under the [data directory](#pro-tip--specify-a-custom-data-directory), based on the search provider's name (`<PROVIDER>`), domain (`<DOMAIN>`), and the Wayback Machine's CDX [page number][cdx-pagination] (`<CDXPAGE>`) from which the URLs were originally fetched. Archived raw SERPs are stored as 1GB-sized WARC chunk files, that is, WARC chunks are "filled" sequentially up to a size of 1GB each. If a chunk is full, a new chunk is created.
@@ -228,7 +228,7 @@ Parse and filter archived SERPs from raw contents.
 You can run this step with the following command line, where `<PROVIDER>` is the name of the search provider you want to parse SERPs from:
 
 ```shell:
-python -m web_archive_query_log make archived-parsed-serps <PROVIDER>
+python -m archive_query_log make archived-parsed-serps <PROVIDER>
 ```
 
 This will create multiple files in the `archived-serps` subdirectory under the [data directory](#pro-tip--specify-a-custom-data-directory), based on the search provider's name (`<PROVIDER>`), domain (`<DOMAIN>`), and the Wayback Machine's CDX [page number][cdx-pagination] (`<CDXPAGE>`) from which the URLs were originally fetched:

diff --git a/archive_query_log/results/test/test_qq_serp_parsing.py b/archive_query_log/results/test/test_qq_serp_parsing.py
@@ -94,13 +94,6 @@ def test_parse_query_qq_xing_xing_di_qiu_2_1582812539():
     )
 
 
-def test_parse_query_qq_jin_cheng_wu_1319745059():
-    verify_serp_parsing(
-        "https://web.archive.org/web/20111027195059id_/http://v.qq.com/search.html?pagetype=3&ms_key=%E9%87%91%E5%9F%8E%E6%AD%A6",
-        "qq",
-    )
-
-
 def test_parse_query_qq_turn_that_finger_around_1324266860():
     verify_serp_parsing(
         "https://web.archive.org/web/20111219035420id_/http://v.qq.com:80/search.html?pagetype=3&ms_key=Turn+That+Finger+Around",

diff --git a/cli b/cli
@@ -1,3 +1,3 @@
 #!/bin/bash -e
 
-pipenv run python -m web_archive_query_log "$@"
+pipenv run python -m archive_query_log "$@"