Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies with Parsing Robots.txt #203

Closed
elisa-luo opened this issue Dec 3, 2024 · 1 comment
Closed

Discrepancies with Parsing Robots.txt #203

elisa-luo opened this issue Dec 3, 2024 · 1 comment

Comments

@elisa-luo
Copy link

Bug Report: Discrepancies with Parsing Robots.txt

Contact: Elisa Luo ([email protected]), on behalf of the authors of Somesite I Used To Crawl.

Summary

We have also been interested in the ability of content creators to control whether their content can be crawled by AI crawlers, and as a result we read your Consent in Crisis paper with great interest. In performing our experiments with parsing robots.txt files in various data sets, we believe that we identified a few scenarios in which the parser (parse_robots.py) used in the Crisis paper did not parse robots.txt correctly according to Google’s interpretation of the RFC specification. We do not think that these errors affect the conclusions of the paper, but they appear to lead to over-reporting of the None Disallowed and under-reporting of the All Disallowed categories in Table 6 by 10-30%. As a result, we wanted to inform you of what we found in case it was useful.

Description

According to the text (Table 2) in the paper, the parser follows “Google’s crawler rules”. However, after examining the parser implementation, our understanding is that the parser does not comply with the following specifications for Google’s robots.txt rules, which is their interpretation of RFC 9309.

  1. Grouping of Lines and Rules
  2. User-agent Line Syntax
  3. Comment Syntax

Reproducing the Issues

We derive the provided robots.txt examples from examples provided in Google’s robots.txt specification. Below, we list two outputs: one expected output and one output produced by running the function analyze_robots in parse_robots.py.

1. Grouping of Lines and Rules

The parser fails to recognize directives when there are repeated user-agent lines.

Reference: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#grouping-of-lines-and-rules

Example 1:

User-agent: anthropic-ai
User-agent: Google-Extended
User-agent: GPTBot
Disallow: /
  • This is specified in Section 2.1 of the REP.
  • Issue: the parser only matches the Disallow rule to the last user-agent when there are multiple stacked user-agents.

Expected Output:
{'Google-Extended': 'all', 'anthropic-ai': 'all', 'GPTBot': 'all', '*All Agents*': 'some'}

Actual output:
{'Google-Extended': 'none', 'anthropic-ai': 'none', 'GPTBot': 'all', '*All Agents*': 'some'}

Example 2:

User-agent: GPTBot
Crawl-delay: 5

User-agent: CCBot
Disallow: /
  • Issue: the parser should ignore rules other than allow, disallow, and user-agent, so GPTBot and CCBot should be treated as one group.

Expected output:
{'GPTBot': 'all', 'CCBot': 'all', '*All Agents*': 'some'}

Actual output:
{'GPTBot': 'none', 'CCBot': 'all', '*All Agents*': 'some'}

2. User-agent Line Syntax

The parser does not consider the user-agent: line to be case-insensitive.

Reference: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#user-agent

Example 1:

user-agent: GPTBot
Disallow: /
  • Issue: the parser does not recognize the lower-cased user-agent directive. parse_robots_txt returns 'Unmatched line: user-agent: GPTBot', 'Unmatched line: Disallow: /'

Expected output:
{'GPTBot': 'all', '*All Agents*': 'none'}

Actual output:
{'*All Agents*': 'none'}

Example 2:

User-agent: Googlebot
Disallow: /

User-agent: GoogleBot
Allow: /stats
  • Issue: the parser considers Googlebot and GoogleBot to be separate user-agents

Expected output:
{'Googlebot': 'some', '*All Agents*': 'some'}

Actual output:
{'Googlebot': 'all', 'GoogleBot': 'none', '*All Agents*': 'some'}

3. Comment Syntax

The parser does not correctly ignore comments, leading to ignored directives that are underneath comments

Reference: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#syntax

Example

User-agent: GPTBot
# Block GPTBot
Disallow: /
  • Issue: The parser only considers Allow/Disallow directives that start directly below the User-agent line.

Expected output:
{'GPTBot': 'all', '*All Agents*': 'none'}

Actual output:
{'GPTBot': 'none', '*All Agents*': 'none'}

Additional Cases

  • Similar to (2), the parser does not appear to consider the allow and disallow directives to be case-insensitive. While it is true that the value of these lines should be case-sensitive, the directives itself are not.
  • Similar to (3), the parser does not appear to handle cases where there is a newline between the User-agent: line and Disallow/Allow directive.

Suggested Fixes

Since Google open-sources their robots.txt parser, one option is adopting the Google parser (e.g., by putting a wrapper around it). Alternatively, if the authors would like to maintain their own parser, another option is to compare results with Google’s parser to ensure consistency. Google’s robots.txt specification also provides a comprehensive list of example robots.txt files, which could also be useful for extending the unit tests associated with the paper.

Please let us know if you have any additional questions or need additional details on any of the cases mentioned in our bug report.

@arielnlee
Copy link
Collaborator

Thank you so much for bringing this to our attention! We've updated our code accordingly and will be putting out an updated paper soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants