Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add LLM text version to rustdoc #3751

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

Folyd
Copy link

@Folyd Folyd commented Dec 30, 2024

Rendered

--- Edited ---
Let me explain why I want to add this feature. I needed an LLM to use the oas3 crate to generate OpenAPI Specifications. However, due to its knowledge cutoff, the code the LLM generated was based on v0.4.0, while the latest version is v0.13.1. I wanted the LLM to learn from the latest documentation to generate up-to-date code.

While rustdoc has an experimental JSON format, it contains a lot of unrelated information that's not useful for LLMs and is quite large - the JSON output for oas3 v0.13.1 is 5.5MB. That's why I believe we need a new text format that helps AI learn new versions quickly. I estimate the text version of oas3 v0.13.1 would be less than 1KB, containing just the essential public API information.

@clarfonthey
Copy link
Contributor

I prefer my oceans unboiled, thanks.

@lebensterben
Copy link

I'd like to add that the proposed format is not only useful to LLM models, but also much more readable than other formats to humans when you just need a synopsis of a library.

@clarfonthey
Copy link
Contributor

You can already view code with implementations collapsed in most IDEs, so as far as I'm concerned, this isn't adding anything new for users.

The largest issue with this format is it's not very searchable and thus not useful to humans. rustdoc is designed to be as easy for humans to digest as possible, so, I would recommend bringing up any shortcomings of it as issues for the maintainers to tackle.

To clarify, I am perfectly fine with new formats for rustdoc if they serve some actual use case. Feeding information into statistical models which cannot understand it and waste massive amounts of resources to do so is not a good use case.

As I said, I prefer my oceans unboiled.

@Folyd
Copy link
Author

Folyd commented Dec 30, 2024

@clarfonthey thanks for comments.

Let me explain why I want to add this feature. I needed an LLM to use the oas3 crate to generate OpenAPI Specifications. However, due to its knowledge cutoff, the code the LLM generated was based on v0.4.0, while the latest version is v0.13.1. I wanted the LLM to learn from the latest documentation to generate up-to-date code.

While rustdoc has an experimental JSON format, it contains a lot of unrelated information that's not useful for LLMs and is quite large - the JSON output for oas3 v0.13.1 is 5.5MB. That's why I believe we need a new text format that helps AI learn new versions quickly. I estimate the text version of oas3 v0.13.1 would be less than 1KB, containing just the essential public API information.

@juntyr

This comment was marked as outdated.

@Folyd
Copy link
Author

Folyd commented Dec 30, 2024

Here is a small part of the JSON format output from oas3 v0.13.1. The complete JSON file is 161,873 lines long (after formated), which is totally unsuitable for LLM consumption.

{
    "root": "0:0:2543",
    "crate_version": "0.13.1",
    "includes_private": false,
    "index": {
        "0:485": {
            "id": "0:485",
            "crate_id": 0,
            "name": null,
            "span": {
                "filename": "crates/oas3/src/spec/components.rs",
                "begin": [
                    16,
                    16
                ],
                "end": [
                    16,
                    21
                ]
            },
            "visibility": "default",
            "docs": null,
            "links": {},
            "attrs": [
                "#[automatically_derived]"
            ],
            "deprecation": null,
            "inner": {
                "impl": {
                    "is_unsafe": false,
                    "generics": {
                        "params": [],
                        "where_predicates": []
                    },
                    "provided_trait_methods": [
                        "clone_from"
                    ],
                    "trait": {
                        "name": "Clone",
                        "id": "2:2906:114",
                        "args": {
                            "angle_bracketed": {
                                "args": [],
                                "constraints": []
                            }
                        }
                    },
                    "for": {
                        "resolved_path": {
                            "name": "Components",
                            "id": "0:482:2759",
                            "args": {
                                "angle_bracketed": {
                                    "args": [],
                                    "constraints": []
                                }
                            }
                        }
                    },
                    "items": [
                        "0:486:501"
                    ],
                    "is_negative": false,
                    "is_synthetic": false,
                    "blanket_impl": null
                }
            }
        },
        "0:1694:731": {
            "id": "0:1694:731",
            "crate_id": 0,
            "name": "eq",
            "span": {
                "filename": "crates/oas3/src/spec/license.rs",
                "begin": [
                    11,
                    23
                ],
                "end": [
                    11,
                    32
                ]
            },
            "visibility": "default",
            "docs": null,
            "links": {},
            "attrs": [
                "#[inline]"
            ],
            "deprecation": null,
            "inner": {
                "function": {
                    "sig": {
                        "inputs": [
                            [
                                "self",
                                {
                                    "borrowed_ref": {
                                        "lifetime": null,
                                        "is_mutable": false,
                                        "type": {
                                            "generic": "Self"
                                        }
                                    }
                                }
                            ],
                            [
                                "other",
                                {
                                    "borrowed_ref": {
                                        "lifetime": null,
                                        "is_mutable": false,
                                        "type": {
                                            "resolved_path": {
                                                "name": "License",
                                                "id": "0:1686:3087",
                                                "args": {
                                                    "angle_bracketed": {
                                                        "args": [],
                                                        "constraints": []
                                                    }
                                                }
                                            }
                                        }
                                    }
                                }
                            ]
                        ],
                        "output": {
                            "primitive": "bool"
                        },
                        "is_c_variadic": false
                    },
                    "generics": {
                        "params": [],
                        "where_predicates": []
                    },
                    "header": {
                        "is_const": false,
                        "is_unsafe": false,
                        "is_async": false,
                        "abi": "Rust"
                    },
                    "has_body": true
                }
            }
        },
        "0:1960:2987": {
            "id": "0:1960:2987",
            "crate_id": 0,
            "name": "description",
            "span": {
                "filename": "crates/oas3/src/spec/link.rs",
                "begin": [
                    116,
                    8
                ],
                "end": [
                    116,
                    35
                ]
            },
            "visibility": "default",
            "docs": "A description of the link.\n\n[CommonMark syntax](https://spec.commonmark.org) MAY be used for rich text\nrepresentation.",
            "links": {},
            "attrs": [
                "#[serde(skip_serializing_if = \"Option::is_none\")]"
            ],
            "deprecation": null,
            "inner": {
                "struct_field": {
                    "resolved_path": {
                        "name": "Option",
                        "id": "2:45752:206",
                        "args": {
                            "angle_bracketed": {
                                "args": [
                                    {
                                        "type": {
                                            "resolved_path": {
                                                "name": "String",
                                                "id": "5:7975:258",
                                                "args": {
                                                    "angle_bracketed": {
                                                        "args": [],
                                                        "constraints": []
                                                    }
                                                }
                                            }
                                        }
                                    }
                                ],
                                "constraints": []
                            }
                        }
                    }
                }
            }
        },

@juntyr
Copy link
Contributor

juntyr commented Dec 30, 2024

What about using a project such as https://crates.io/crates/rusty-man to produce a reduced textual output?

@ahicks92
Copy link

@clarfonthey
I want to provide a bit of personal experience as a skeptic as to whether or not LLMs are useful now (I have never been a skeptic that some form of AI coder is coming, but the now part):

I grabbed copilot recently. The goal was to watch to see when it crossed the threshold for someone at my experience level. I expected that to be in a year or more. I have avoided such tools because I figured they'd take more time to work with translating my problems into what they can understand.

Then it started autocompleting lines, offering me full trait impls, and implementing macros. Without being asked in English, In a project doing horrifying things to the type system to see how far it can be pushed, in the niche domain of audio synthesis, just looking at my code and my cursor position. I've gone far enough out in the type system to have recently found an ICE; it's not typical code. I've only been using Copilot for 2 weeks and it's already saved me hours at minimum. Even the stupid small stuff it does reliably like for<'il>: Signal<Input<'il>=... is super helpful--and then, sometimes, it has an insight about what I want to do and just does it without being asked for the next 20 or 30 lines. O sure none of it is senior level coding. I won't disagree with that at all. But "does my chores, and sometimes reads my mind, producing code in my style" is super valuable. I also have multiple friends using this stuff as a learning aid--what little understanding LLMs are capable of is beyond a new programmer, and they find having a personalized explanation is worth it. I think there's two usage modes, essentially: if you are just getting started chatting to it can help, if you are super experienced leaning on it for the boring thoughtless parts is great. I will say that there was a little bit of an adjustment; if I had approached it adversarially if you will I'd probably not have started developing what little of the "work with the AI" skillset I've begun to pick up (I cannot yet put this into words; suffice it to say that the fact that it's an alien mind is obvious, it's just a useful alien mind).

I don't have the background to evaluate your linked paper, or really to any extent evaluate whether or not an LLM "understands", but the abstract at least doesn't seem to be saying "LLMs don't understand", only that they quantified the limits. Besides I'm not sure that philosophical understanding is what matters. To me the point is whether it has practical applications, not how it's thinking.

@Folyd
Now that said I see a few problems with this right off the bat:

  • There's no real reason why this can't be an external tool. This RFC doesn't have a path for consumption of the data.
  • No one in AI land has yet standardized on how to provide context. There are efforts to do so but the most promising that I know of is like a month old at most.
  • Every time anyone says "AI is like X" forall X, 6 months from now that's no longer the case.

This feels. premature because of the pace of progress. Premature is so the wrong word because by the nature of AI right now it will always be, but say this RFC stabilizes in winter 2025. I sure bet that "consume complex HTML" will be a solved problem by then, it seems entirely tractable via preprocessing without even needing AI advances, improving it generically is worth tons and tons of money/value, and we are still nonetheless getting AI advances on top. Even if we had it this exact second, riding the 12 week stabilization period is possibly a long enough time period as to render it entirely meaningless. In my opinion at the very least something like this should wait until someone has started really doubling down on standardized protocols for this stuff. Or alternatively maybe you would want to push for giving Rustdoc plugins or something, if it doesn't already have it (to my knowledge it doesn't but I have never had the need to look).

Basically: see also O3. We won't know the impact of O3 for a while yet but this RFC is entirely a bet that it won't be invalidated by the time it's done.

@Folyd
Copy link
Author

Folyd commented Dec 30, 2024

@ahicks92 I don't think this is premature. Providing text formats for LLMs is becoming a trend - see https://llmstxt.org/, which proposes standardizing /llms.txt files to help LLMs better understand websites at inference time.

Will standardized protocols emerge for teaching LLMs context? I don't think so. AI models are smart enough to understand arbitrary text - what they need are suitable formats for different scenarios. Understanding versatile HTML pages to learn a new crate version isn't an intelligence problem; it actually requires significant engineering effort. By providing a text version of the docs, we can make this information easily accessible via URLs like https://docs.rs/oas3/0.11.3/docs.txt.

/// use std::collections::BTreeMap;
/// let map: BTreeMap<i32, &str> = BTreeMap::new();
/// ```
pub fn new() -> Self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn new() -> Self
pub fn new() -> Self;

It would be much more useful if the output can at least be used by a standard parser.

(Alternatively reuse -Z unpretty=everybody_loops output

Suggested change
pub fn new() -> Self
pub fn new() -> Self { loop {} }

)

# Reference-level explanation
The implementation will require:

1. Add `text` as a new value for the existing [`--output-format` flag](https://doc.rust-lang.org/nightly/cargo/commands/cargo-rustdoc.html#option-cargo-rustdoc---output-format)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you have mentioned in the Prior Arts, the output format is closer to the .d.ts form. In the other words, it is structured rather than merely "plain text". So the new format should better be called something like "rust-interface" instead of just "text".


1. Add `text` as a new value for the existing [`--output-format` flag](https://doc.rust-lang.org/nightly/cargo/commands/cargo-rustdoc.html#option-cargo-rustdoc---output-format)
2. New visitor pattern in rustdoc that:
- Only traverses public items
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

irrelevant, i may like to --document-private-items and --document-hidden-items.

Suggested change
- Only traverses public items

@ehuss ehuss added the T-rustdoc Relevant to rustdoc team, which will review and decide on the RFC. label Dec 30, 2024
@ahicks92
Copy link

@Folyd
Arguing that something is a trend isn't a useful technical argument in a field advancing faster than any other field on the planet. I'm not sure what is, just because any bet is that we won't turn around and find some other advance. Most skeptics on AI make a claim that x won't happen on y timeline and then x happens on z < y timeline. No one can even say what the limits are. To put it a way I put it for my less technical family: in significantly less time than it takes to raise a teenager, we have gone from zero to teenager-level AI. Assuming that we will slow down on very basic parts of the process isn't what history aligns with.

I don't feel like official Rust tooling is the place. In the very best case, Rust does not have the ability to be nimble. Being at least 3 months plus an RFC behind advances in a field with a product cycle of 2 years or under isn't a good place to be. If O3 is actually as superhuman at coding as claimed this RFC has already fallen even before you posted it. Mature programming language communities are rigid and somewhat inflexible on purpose. I would be willing to bet significant amounts of money that "make a text version as humans" won't matter in another 2 years. Not hyperbole, if I could actually place such a bet I'd do so. Being an official part of Rust puts such tooling in a spot where they cannot respond.

What would change my mind is a good argument why this is a local maximum that we are going to be stuck in for a long enough time for it to matter.

Also wrt providing context in a standardized manner, Anthropic is trying to launch the Model Context Protocol. You were right that there were no real efforts on that before late November but then late November happened. That's really my point in a nutshell. You're directly arguing against Anthropic (who wants to standardize this stuff) and OpenAI (who claims that they made a coding AI better than most devs on the planet).

As a really concrete way to put it: this RFC argues for stripping context. History argues for providing more context. Are you just going to "un-strip" it for the next couple years every time the AI people outdo your format? I think doing that is actually fine but it can't be done on a 4+ month cycle where every change requires community approval.

@workingjubilee
Copy link
Member

workingjubilee commented Jan 1, 2025

I do not believe that any changes that T-rustdoc provides will necessarily be useful to LLMs in the future, precisely because they are an in-flux technology that will likely, over time,

  • have different context window lengths
  • have different ways of compressing context (currently we mostly use word/subword "tokens", but that could go up or down "levels" in the future), thus the same notional window length may cover radically different sizes of text
  • have differing benefit from different inputs to their context window

These mean a proposed "redux" format targeted at LLM usage is ill-suited for a formal addition, thus stability guarantee of any kind. It is very likely it is not the desired format within 3 months, never mind 3 years.

...and, as @ahicks92 says, most importantly: This could simply be implemented as a library that filters the JSON output for useless-seeming fields and produces a summarized format, discarding data deemed currently irrelevant. There's technically no "stability guarantees" on the JSON format but the rustdoc team tries to version it in such a way that makes migration easy, and it would be an easier "target".

@ahicks92
Copy link

ahicks92 commented Jan 2, 2025

@workingjubilee
I've actually thought about this further...I bet you could implement this as a prompt to a current-gen LLM that you feed the "noisy" JSON/HTML to, then feed the output back in. "This is noisy JSON not friendly to LLMs. Produce JSON friendly to LLMs" or whatever then that output is your context that goes back in. Obviously with better prompt engineering.

I don't have the time. I'm too busy figuring out what I'm doing about career development given O3, lol (but seriously only half a joke, though, having to pivot out of coding on a 5-10 year timeline as a blind person isn't funny at all). But if I was going to do it, that's what I'd try first. Nicely future proof too, since LLMs don't regress on given prompts, only improve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-rustdoc Relevant to rustdoc team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants