-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: add LLM text version to rustdoc #3751
base: master
Are you sure you want to change the base?
Conversation
I prefer my oceans unboiled, thanks. |
I'd like to add that the proposed format is not only useful to LLM models, but also much more readable than other formats to humans when you just need a synopsis of a library. |
You can already view code with implementations collapsed in most IDEs, so as far as I'm concerned, this isn't adding anything new for users. The largest issue with this format is it's not very searchable and thus not useful to humans. rustdoc is designed to be as easy for humans to digest as possible, so, I would recommend bringing up any shortcomings of it as issues for the maintainers to tackle. To clarify, I am perfectly fine with new formats for rustdoc if they serve some actual use case. Feeding information into statistical models which cannot understand it and waste massive amounts of resources to do so is not a good use case. As I said, I prefer my oceans unboiled. |
@clarfonthey thanks for comments. Let me explain why I want to add this feature. I needed an LLM to use the oas3 crate to generate OpenAPI Specifications. However, due to its knowledge cutoff, the code the LLM generated was based on v0.4.0, while the latest version is v0.13.1. I wanted the LLM to learn from the latest documentation to generate up-to-date code. While rustdoc has an experimental JSON format, it contains a lot of unrelated information that's not useful for LLMs and is quite large - the JSON output for oas3 v0.13.1 is 5.5MB. That's why I believe we need a new text format that helps AI learn new versions quickly. I estimate the text version of oas3 v0.13.1 would be less than 1KB, containing just the essential public API information. |
This comment was marked as outdated.
This comment was marked as outdated.
Here is a small part of the JSON format output from oas3 v0.13.1. The complete JSON file is 161,873 lines long (after formated), which is totally unsuitable for LLM consumption. {
"root": "0:0:2543",
"crate_version": "0.13.1",
"includes_private": false,
"index": {
"0:485": {
"id": "0:485",
"crate_id": 0,
"name": null,
"span": {
"filename": "crates/oas3/src/spec/components.rs",
"begin": [
16,
16
],
"end": [
16,
21
]
},
"visibility": "default",
"docs": null,
"links": {},
"attrs": [
"#[automatically_derived]"
],
"deprecation": null,
"inner": {
"impl": {
"is_unsafe": false,
"generics": {
"params": [],
"where_predicates": []
},
"provided_trait_methods": [
"clone_from"
],
"trait": {
"name": "Clone",
"id": "2:2906:114",
"args": {
"angle_bracketed": {
"args": [],
"constraints": []
}
}
},
"for": {
"resolved_path": {
"name": "Components",
"id": "0:482:2759",
"args": {
"angle_bracketed": {
"args": [],
"constraints": []
}
}
}
},
"items": [
"0:486:501"
],
"is_negative": false,
"is_synthetic": false,
"blanket_impl": null
}
}
},
"0:1694:731": {
"id": "0:1694:731",
"crate_id": 0,
"name": "eq",
"span": {
"filename": "crates/oas3/src/spec/license.rs",
"begin": [
11,
23
],
"end": [
11,
32
]
},
"visibility": "default",
"docs": null,
"links": {},
"attrs": [
"#[inline]"
],
"deprecation": null,
"inner": {
"function": {
"sig": {
"inputs": [
[
"self",
{
"borrowed_ref": {
"lifetime": null,
"is_mutable": false,
"type": {
"generic": "Self"
}
}
}
],
[
"other",
{
"borrowed_ref": {
"lifetime": null,
"is_mutable": false,
"type": {
"resolved_path": {
"name": "License",
"id": "0:1686:3087",
"args": {
"angle_bracketed": {
"args": [],
"constraints": []
}
}
}
}
}
}
]
],
"output": {
"primitive": "bool"
},
"is_c_variadic": false
},
"generics": {
"params": [],
"where_predicates": []
},
"header": {
"is_const": false,
"is_unsafe": false,
"is_async": false,
"abi": "Rust"
},
"has_body": true
}
}
},
"0:1960:2987": {
"id": "0:1960:2987",
"crate_id": 0,
"name": "description",
"span": {
"filename": "crates/oas3/src/spec/link.rs",
"begin": [
116,
8
],
"end": [
116,
35
]
},
"visibility": "default",
"docs": "A description of the link.\n\n[CommonMark syntax](https://spec.commonmark.org) MAY be used for rich text\nrepresentation.",
"links": {},
"attrs": [
"#[serde(skip_serializing_if = \"Option::is_none\")]"
],
"deprecation": null,
"inner": {
"struct_field": {
"resolved_path": {
"name": "Option",
"id": "2:45752:206",
"args": {
"angle_bracketed": {
"args": [
{
"type": {
"resolved_path": {
"name": "String",
"id": "5:7975:258",
"args": {
"angle_bracketed": {
"args": [],
"constraints": []
}
}
}
}
}
],
"constraints": []
}
}
}
}
}
}, |
What about using a project such as https://crates.io/crates/rusty-man to produce a reduced textual output? |
@clarfonthey I grabbed copilot recently. The goal was to watch to see when it crossed the threshold for someone at my experience level. I expected that to be in a year or more. I have avoided such tools because I figured they'd take more time to work with translating my problems into what they can understand. Then it started autocompleting lines, offering me full trait impls, and implementing macros. Without being asked in English, In a project doing horrifying things to the type system to see how far it can be pushed, in the niche domain of audio synthesis, just looking at my code and my cursor position. I've gone far enough out in the type system to have recently found an ICE; it's not typical code. I've only been using Copilot for 2 weeks and it's already saved me hours at minimum. Even the stupid small stuff it does reliably like I don't have the background to evaluate your linked paper, or really to any extent evaluate whether or not an LLM "understands", but the abstract at least doesn't seem to be saying "LLMs don't understand", only that they quantified the limits. Besides I'm not sure that philosophical understanding is what matters. To me the point is whether it has practical applications, not how it's thinking. @Folyd
This feels. premature because of the pace of progress. Premature is so the wrong word because by the nature of AI right now it will always be, but say this RFC stabilizes in winter 2025. I sure bet that "consume complex HTML" will be a solved problem by then, it seems entirely tractable via preprocessing without even needing AI advances, improving it generically is worth tons and tons of money/value, and we are still nonetheless getting AI advances on top. Even if we had it this exact second, riding the 12 week stabilization period is possibly a long enough time period as to render it entirely meaningless. In my opinion at the very least something like this should wait until someone has started really doubling down on standardized protocols for this stuff. Or alternatively maybe you would want to push for giving Rustdoc plugins or something, if it doesn't already have it (to my knowledge it doesn't but I have never had the need to look). Basically: see also O3. We won't know the impact of O3 for a while yet but this RFC is entirely a bet that it won't be invalidated by the time it's done. |
@ahicks92 I don't think this is premature. Providing text formats for LLMs is becoming a trend - see https://llmstxt.org/, which proposes standardizing /llms.txt files to help LLMs better understand websites at inference time. Will standardized protocols emerge for teaching LLMs context? I don't think so. AI models are smart enough to understand arbitrary text - what they need are suitable formats for different scenarios. Understanding versatile HTML pages to learn a new crate version isn't an intelligence problem; it actually requires significant engineering effort. By providing a text version of the docs, we can make this information easily accessible via URLs like https://docs.rs/oas3/0.11.3/docs.txt. |
/// use std::collections::BTreeMap; | ||
/// let map: BTreeMap<i32, &str> = BTreeMap::new(); | ||
/// ``` | ||
pub fn new() -> Self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn new() -> Self | |
pub fn new() -> Self; |
It would be much more useful if the output can at least be used by a standard parser.
(Alternatively reuse -Z unpretty=everybody_loops
output
pub fn new() -> Self | |
pub fn new() -> Self { loop {} } |
)
# Reference-level explanation | ||
The implementation will require: | ||
|
||
1. Add `text` as a new value for the existing [`--output-format` flag](https://doc.rust-lang.org/nightly/cargo/commands/cargo-rustdoc.html#option-cargo-rustdoc---output-format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you have mentioned in the Prior Arts, the output format is closer to the .d.ts
form. In the other words, it is structured rather than merely "plain text". So the new format should better be called something like "rust-interface" instead of just "text".
|
||
1. Add `text` as a new value for the existing [`--output-format` flag](https://doc.rust-lang.org/nightly/cargo/commands/cargo-rustdoc.html#option-cargo-rustdoc---output-format) | ||
2. New visitor pattern in rustdoc that: | ||
- Only traverses public items |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
irrelevant, i may like to --document-private-items
and --document-hidden-items
.
- Only traverses public items |
@Folyd I don't feel like official Rust tooling is the place. In the very best case, Rust does not have the ability to be nimble. Being at least 3 months plus an RFC behind advances in a field with a product cycle of 2 years or under isn't a good place to be. If O3 is actually as superhuman at coding as claimed this RFC has already fallen even before you posted it. Mature programming language communities are rigid and somewhat inflexible on purpose. I would be willing to bet significant amounts of money that "make a text version as humans" won't matter in another 2 years. Not hyperbole, if I could actually place such a bet I'd do so. Being an official part of Rust puts such tooling in a spot where they cannot respond. What would change my mind is a good argument why this is a local maximum that we are going to be stuck in for a long enough time for it to matter. Also wrt providing context in a standardized manner, Anthropic is trying to launch the Model Context Protocol. You were right that there were no real efforts on that before late November but then late November happened. That's really my point in a nutshell. You're directly arguing against Anthropic (who wants to standardize this stuff) and OpenAI (who claims that they made a coding AI better than most devs on the planet). As a really concrete way to put it: this RFC argues for stripping context. History argues for providing more context. Are you just going to "un-strip" it for the next couple years every time the AI people outdo your format? I think doing that is actually fine but it can't be done on a 4+ month cycle where every change requires community approval. |
I do not believe that any changes that T-rustdoc provides will necessarily be useful to LLMs in the future, precisely because they are an in-flux technology that will likely, over time,
These mean a proposed "redux" format targeted at LLM usage is ill-suited for a formal addition, thus stability guarantee of any kind. It is very likely it is not the desired format within 3 months, never mind 3 years. ...and, as @ahicks92 says, most importantly: This could simply be implemented as a library that filters the JSON output for useless-seeming fields and produces a summarized format, discarding data deemed currently irrelevant. There's technically no "stability guarantees" on the JSON format but the rustdoc team tries to version it in such a way that makes migration easy, and it would be an easier "target". |
@workingjubilee I don't have the time. I'm too busy figuring out what I'm doing about career development given O3, lol (but seriously only half a joke, though, having to pivot out of coding on a 5-10 year timeline as a blind person isn't funny at all). But if I was going to do it, that's what I'd try first. Nicely future proof too, since LLMs don't regress on given prompts, only improve. |
Rendered
--- Edited ---
Let me explain why I want to add this feature. I needed an LLM to use the oas3 crate to generate OpenAPI Specifications. However, due to its knowledge cutoff, the code the LLM generated was based on v0.4.0, while the latest version is v0.13.1. I wanted the LLM to learn from the latest documentation to generate up-to-date code.
While rustdoc has an experimental JSON format, it contains a lot of unrelated information that's not useful for LLMs and is quite large - the JSON output for oas3 v0.13.1 is 5.5MB. That's why I believe we need a new text format that helps AI learn new versions quickly. I estimate the text version of oas3 v0.13.1 would be less than 1KB, containing just the essential public API information.