Discrepancies with Parsing Robots.txt #203

elisa-luo · 2024-12-03T18:45:46Z

Bug Report: Discrepancies with Parsing Robots.txt

Contact: Elisa Luo ([email protected]), on behalf of the authors of Somesite I Used To Crawl.

Summary

We have also been interested in the ability of content creators to control whether their content can be crawled by AI crawlers, and as a result we read your Consent in Crisis paper with great interest. In performing our experiments with parsing robots.txt files in various data sets, we believe that we identified a few scenarios in which the parser (parse_robots.py) used in the Crisis paper did not parse robots.txt correctly according to Google’s interpretation of the RFC specification. We do not think that these errors affect the conclusions of the paper, but they appear to lead to over-reporting of the None Disallowed and under-reporting of the All Disallowed categories in Table 6 by 10-30%. As a result, we wanted to inform you of what we found in case it was useful.

Description

According to the text (Table 2) in the paper, the parser follows “Google’s crawler rules”. However, after examining the parser implementation, our understanding is that the parser does not comply with the following specifications for Google’s robots.txt rules, which is their interpretation of RFC 9309.

Reproducing the Issues

We derive the provided robots.txt examples from examples provided in Google’s robots.txt specification. Below, we list two outputs: one expected output and one output produced by running the function analyze_robots in parse_robots.py.

1. Grouping of Lines and Rules

The parser fails to recognize directives when there are repeated user-agent lines.

Reference: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#grouping-of-lines-and-rules

Example 1:

User-agent: anthropic-ai
User-agent: Google-Extended
User-agent: GPTBot
Disallow: /

This is specified in Section 2.1 of the REP.
Issue: the parser only matches the Disallow rule to the last user-agent when there are multiple stacked user-agents.

Expected Output:
{'Google-Extended': 'all', 'anthropic-ai': 'all', 'GPTBot': 'all', '*All Agents*': 'some'}

Actual output:
{'Google-Extended': 'none', 'anthropic-ai': 'none', 'GPTBot': 'all', '*All Agents*': 'some'}

Example 2:

User-agent: GPTBot
Crawl-delay: 5

User-agent: CCBot
Disallow: /

Issue: the parser should ignore rules other than allow, disallow, and user-agent, so GPTBot and CCBot should be treated as one group.

Expected output:
{'GPTBot': 'all', 'CCBot': 'all', '*All Agents*': 'some'}

Actual output:
{'GPTBot': 'none', 'CCBot': 'all', '*All Agents*': 'some'}

2. User-agent Line Syntax

The parser does not consider the user-agent: line to be case-insensitive.

Reference: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#user-agent

Example 1:

user-agent: GPTBot
Disallow: /

Issue: the parser does not recognize the lower-cased user-agent directive. parse_robots_txt returns 'Unmatched line: user-agent: GPTBot', 'Unmatched line: Disallow: /'

Expected output:
{'GPTBot': 'all', '*All Agents*': 'none'}

Actual output:
{'*All Agents*': 'none'}

Example 2:

User-agent: Googlebot
Disallow: /

User-agent: GoogleBot
Allow: /stats

Issue: the parser considers Googlebot and GoogleBot to be separate user-agents

Expected output:
{'Googlebot': 'some', '*All Agents*': 'some'}

Actual output:
{'Googlebot': 'all', 'GoogleBot': 'none', '*All Agents*': 'some'}

3. Comment Syntax

The parser does not correctly ignore comments, leading to ignored directives that are underneath comments

Reference: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#syntax

Example

User-agent: GPTBot
# Block GPTBot
Disallow: /

Issue: The parser only considers Allow/Disallow directives that start directly below the User-agent line.

Expected output:
{'GPTBot': 'all', '*All Agents*': 'none'}

Actual output:
{'GPTBot': 'none', '*All Agents*': 'none'}

Additional Cases

Similar to (2), the parser does not appear to consider the allow and disallow directives to be case-insensitive. While it is true that the value of these lines should be case-sensitive, the directives itself are not.
Similar to (3), the parser does not appear to handle cases where there is a newline between the User-agent: line and Disallow/Allow directive.

Suggested Fixes

Since Google open-sources their robots.txt parser, one option is adopting the Google parser (e.g., by putting a wrapper around it). Alternatively, if the authors would like to maintain their own parser, another option is to compare results with Google’s parser to ensure consistency. Google’s robots.txt specification also provides a comprehensive list of example robots.txt files, which could also be useful for extending the unit tests associated with the paper.

Please let us know if you have any additional questions or need additional details on any of the cases mentioned in our bug report.

The text was updated successfully, but these errors were encountered:

arielnlee · 2025-01-24T22:36:52Z

Thank you so much for bringing this to our attention! We've updated our code accordingly and will be putting out an updated paper soon.

arielnlee mentioned this issue Jan 7, 2025

Fix: parse_robots_txt #205

Merged

arielnlee closed this as completed Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancies with Parsing Robots.txt #203

Discrepancies with Parsing Robots.txt #203

elisa-luo commented Dec 3, 2024

arielnlee commented Jan 24, 2025

Discrepancies with Parsing Robots.txt #203

Discrepancies with Parsing Robots.txt #203

Comments

elisa-luo commented Dec 3, 2024

Bug Report: Discrepancies with Parsing Robots.txt

Summary

Description

Reproducing the Issues

1. Grouping of Lines and Rules

2. User-agent Line Syntax

3. Comment Syntax

Additional Cases

Suggested Fixes

arielnlee commented Jan 24, 2025