You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have also been interested in the ability of content creators to control whether their content can be crawled by AI crawlers, and as a result we read your Consent in Crisis paper with great interest. In performing our experiments with parsing robots.txt files in various data sets, we believe that we identified a few scenarios in which the parser (parse_robots.py) used in the Crisis paper did not parse robots.txt correctly according to Google’s interpretation of the RFC specification. We do not think that these errors affect the conclusions of the paper, but they appear to lead to over-reporting of the None Disallowed and under-reporting of the All Disallowed categories in Table 6 by 10-30%. As a result, we wanted to inform you of what we found in case it was useful.
Description
According to the text (Table 2) in the paper, the parser follows “Google’s crawler rules”. However, after examining the parser implementation, our understanding is that the parser does not comply with the following specifications for Google’s robots.txt rules, which is their interpretation of RFC 9309.
We derive the provided robots.txt examples from examples provided in Google’s robots.txt specification. Below, we list two outputs: one expected output and one output produced by running the function analyze_robots in parse_robots.py.
1. Grouping of Lines and Rules
The parser fails to recognize directives when there are repeated user-agent lines.
Issue: the parser does not recognize the lower-cased user-agent directive. parse_robots_txt returns 'Unmatched line: user-agent: GPTBot', 'Unmatched line: Disallow: /'
Actual output: {'GPTBot': 'none', '*All Agents*': 'none'}
Additional Cases
Similar to (2), the parser does not appear to consider the allow and disallow directives to be case-insensitive. While it is true that the value of these lines should be case-sensitive, the directives itself are not.
Similar to (3), the parser does not appear to handle cases where there is a newline between the User-agent: line and Disallow/Allow directive.
Suggested Fixes
Since Google open-sources their robots.txt parser, one option is adopting the Google parser (e.g., by putting a wrapper around it). Alternatively, if the authors would like to maintain their own parser, another option is to compare results with Google’s parser to ensure consistency. Google’s robots.txt specification also provides a comprehensive list of example robots.txt files, which could also be useful for extending the unit tests associated with the paper.
Please let us know if you have any additional questions or need additional details on any of the cases mentioned in our bug report.
The text was updated successfully, but these errors were encountered:
Bug Report: Discrepancies with Parsing Robots.txt
Contact: Elisa Luo ([email protected]), on behalf of the authors of Somesite I Used To Crawl.
Summary
We have also been interested in the ability of content creators to control whether their content can be crawled by AI crawlers, and as a result we read your Consent in Crisis paper with great interest. In performing our experiments with parsing robots.txt files in various data sets, we believe that we identified a few scenarios in which the parser (parse_robots.py) used in the Crisis paper did not parse robots.txt correctly according to Google’s interpretation of the RFC specification. We do not think that these errors affect the conclusions of the paper, but they appear to lead to over-reporting of the
None Disallowed
and under-reporting of theAll Disallowed
categories in Table 6 by 10-30%. As a result, we wanted to inform you of what we found in case it was useful.Description
According to the text (Table 2) in the paper, the parser follows “Google’s crawler rules”. However, after examining the parser implementation, our understanding is that the parser does not comply with the following specifications for Google’s robots.txt rules, which is their interpretation of RFC 9309.
Reproducing the Issues
We derive the provided robots.txt examples from examples provided in Google’s robots.txt specification. Below, we list two outputs: one expected output and one output produced by running the function
analyze_robots
inparse_robots.py
.1. Grouping of Lines and Rules
The parser fails to recognize directives when there are repeated user-agent lines.
Reference: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#grouping-of-lines-and-rules
Example 1:
Expected Output:
{'Google-Extended': 'all', 'anthropic-ai': 'all', 'GPTBot': 'all', '*All Agents*': 'some'}
Actual output:
{'Google-Extended': 'none', 'anthropic-ai': 'none', 'GPTBot': 'all', '*All Agents*': 'some'}
Example 2:
Expected output:
{'GPTBot': 'all', 'CCBot': 'all', '*All Agents*': 'some'}
Actual output:
{'GPTBot': 'none', 'CCBot': 'all', '*All Agents*': 'some'}
2. User-agent Line Syntax
The parser does not consider the
user-agent:
line to be case-insensitive.Reference: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#user-agent
Example 1:
parse_robots_txt
returns'Unmatched line: user-agent: GPTBot', 'Unmatched line: Disallow: /'
Expected output:
{'GPTBot': 'all', '*All Agents*': 'none'}
Actual output:
{'*All Agents*': 'none'}
Example 2:
Googlebot
andGoogleBot
to be separate user-agentsExpected output:
{'Googlebot': 'some', '*All Agents*': 'some'}
Actual output:
{'Googlebot': 'all', 'GoogleBot': 'none', '*All Agents*': 'some'}
3. Comment Syntax
The parser does not correctly ignore comments, leading to ignored directives that are underneath comments
Reference: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#syntax
Example
Allow
/Disallow
directives that start directly below theUser-agent
line.Expected output:
{'GPTBot': 'all', '*All Agents*': 'none'}
Actual output:
{'GPTBot': 'none', '*All Agents*': 'none'}
Additional Cases
allow
anddisallow
directives to be case-insensitive. While it is true that the value of these lines should be case-sensitive, the directives itself are not.User-agent:
line andDisallow
/Allow
directive.Suggested Fixes
Since Google open-sources their robots.txt parser, one option is adopting the Google parser (e.g., by putting a wrapper around it). Alternatively, if the authors would like to maintain their own parser, another option is to compare results with Google’s parser to ensure consistency. Google’s robots.txt specification also provides a comprehensive list of example robots.txt files, which could also be useful for extending the unit tests associated with the paper.
Please let us know if you have any additional questions or need additional details on any of the cases mentioned in our bug report.
The text was updated successfully, but these errors were encountered: