-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please help me test tool-use (function calling) #514
Comments
@agzam -- in case you're interested. |
Oh wow, this is very cool. So many interesting ideas to try. I'm excited, will give it a go. Thank you! |
Excellent. Will give it a go. |
Wow, this works great. I just created tools for being able to cat files, ls, and creating new files. I had trouble with OpenAI calling the tools but Claude Sonnet works fine. I even had Claude write a couple of tools, then eval the emacs lisp blocks inside the org mode buffer, and have Claude immediately start using them.
|
Holy bootstrap, Batman!
Did it throw an error or just ignore the tools? If it was silent failure you can check the |
It seems to be calling the tools but fails at the end. It also failed with gemini, llama, and qwen for me but I have to double check my tools because for simpler use cases I think it was working a while ago. This same prompt works fine with the Claude models sonnet and haiku. Here is the Messages buffer and attached is the log |
It seems to be calling the tools but fails at the end. It also failed with gemini, llama, and qwen for me but I have to double check my tools because for simpler use cases I think it was working a while ago. This same prompt works fine with the Claude models sonnet and haiku.
Here is the *Messages* buffer and attached is the log
[gptel-tool-use-log-openai.txt](https://github.com/user-attachments/files/18225771/gptel-tool-use-log-openai.txt)
:
`
Querying OpenAI...
gptel: moving from INIT to WAIT
gptel: moving from WAIT to TYPE
gptel: moving from TYPE to TOOL
error in process sentinel: let: Wrong type argument: stringp, nil
error in process sentinel: Wrong type argument: stringp, nil
Strange, that data stream does not look like what I'd expect at all. As a result, the tool call from OpenAI is not being parsed correctly. I'll look into this.
Independent of that, it looks like there's an error in gptel's tool call handler. Could you repeat this experiment but after M-x toggle-debug-on-error, and paste the backtrace here?
|
Update: it seems Gemini, Llama, and Qwen models work only if I make a request that requires a single tool call. For example I did a request to each to summarize a URL and do a directory listing on my local machine and these types of interactions work. |
Could you try it with this OpenAI backend? (gptel-make-openai "openai-with-parallel-tool-calls"
:key YOUR_OPENAI_API_KEY
:stream t
:models gptel--openai-models
:request-params '(:parallel_tool_calls t)) Parallel tool calls are supposed to be enabled by default, so I'm not expecting that this will work, but it would be wise to verify. |
Could you also share the tool definitions you used in this failed request? I'd like to try reproducing the error here. |
For ollama, I see the tool being sent to ollama in the gptel log buffer, but none of the models ever actually seem to use the tools. Have tried with Mistral Nemo, Qwen 2.5, Mistral Small, Llama 3.2 vision. |
@ProjectMoon Could you share the tools you wrote so I can try to reproduce these issues? |
Just a copy and paste of the example one. I will try again at some point in the coming days with ollama debugging mode turned on to see what reaches the server. Edit: also I need to test with a direct ollama connection. This might be (and probably is) a bug in Open WebUI's proxied ollama API. |
I get these types of errors: Attached are my gptel config files, the regular one I use and a minimal version I made for testing tools using your suggestion "openai-with-parallel-tool-calls". |
@jester7 Thanks for the tool definitions. I've fixed parallel tool calls for the Claude and OpenAI-compatible APIs. Please update and test both the streaming and non-streaming cases. You can turn off streaming with Parallel tool calls with the Gemini and Ollama APIs are still broken. All these APIs validate their inputs differently, and the docs don't contain the validation schema so adding tool calls is a long crapshoot. Still, we truck on. |
Update: Parallel tool calls with Gemini works too, but only as long as all function calls involve arguments. Zero-arity functions like |
OK, definitely seems to be more a problem with OpenWebUI's proxied Ollama API... although it was supposedly resolved to be able to pass in structured inputs. I will have to dig into the source code to see if it even does anything with the tools parameter. I was able to make a tool call when connecting directly to the ollama instance using Mistral Nemo. Edit: Yep doesn't have the tools param in the API, so it's discarded silently. |
Thanks. When we merge this we should add a note to the README about the tool-use incompatibility with OpenWebUI. |
Parallel tool calls now work with Ollama too, but you have to disable streaming. Ollama does not support tool calls with streaming responses. |
I've updated the opening post with a status table I'll keep up to date. |
So I added the tools parameter to OpenWebUI (it was just adding a single line to the chat completions form class, it seems). Then I get a response back from the proxied ollama API containing the tool call to use. But unlike when connecting directly, gptel seems to do nothing. Looking at the elisp code, the only thing that makes sense is the content from the OWUI response being non-empty, but both OWUI response and the direct connection response have |
So I added the tools parameter to OpenWebUI (it was just adding a single line to the chat completions form class, it seems). Then I get a response back from the proxied ollama API containing the tool call to use. But unlike when connecting directly, gptel seems to do nothing. Looking at the elisp code, the only thing that makes sense is the content from the OWUI response being non-empty, but both OWUI response and the direct connection response have `"content": ""` o_O
Is there any other difference between the OWUI and direct connection in the response log?
Also you can click on the status in the header-line (the text that says "Waiting..." or "Ready...") to see the process state. It will help to know if the :state is recorded as DONE, ERRS or TOOL.
|
I’m only recently picking emacs back up, the last time I regularly used it was before tools like gpt existed. But if I understand correctly: Since the tool results aren’t echoed to the chat buffer created by If that is the case, perhaps it would be nice to provide a way to help people capture tool results to the context tooling provided by gptel? Or maybe have the tool results echoed to the chat buffer (perhaps in a folded block?) |
> 3. Same as 2, but suggest ways that the feature can be improved, especially in the UI department.
I’m only recently picking emacs back up, the last time I regularly used it was before tools like gpt existed. But if I understand correctly: Since the tool results aren’t echoed to the chat buffer created by `m-x gptel`, and if `gptel-send` sends the buffer contents before the point, then won’t any added context that results from a tool call get dropped from the conversation in the next message round?
If that is the case, perhaps it would be nice to provide a way to help people capture tool results to the context tooling provided by gptel? Or maybe have the tool results echoed to the chat buffer (perhaps in a folded block?)
That's a good point. I've had this problem already in an exchange like the following:
Prompt: Summarize the article at <some link> for me
Response: Sure, let me fetch that article first. <tool call happens here>
<summary follows>
Prompt: Why do they believe X (referring to the article here)
Response: Let me fetch the article again <tool call happens here>
<Answer to my question>
So Claude had to fetch the article again after each response. We need some generic way to include the tool results (when relevant) in chat buffers. Of course, tool use isn't always about the results or back-and-forth conversations -- for many tools the tool call output is either meaningless or irrelevant. So we probably also need a way to identify which tool results need to be echoed to buffers.
|
@ProjectMoon This can happen if you have streaming turned on, since Ollama (I will eventually handle this internally, where streaming is automatically |
I've added tool selection support to gptel's transient interface: Pressing Selecting a category (like "filesystem" or "emacs" here) will toggle all the tools in that category. Tool selection can be done globally, buffer-locally or for the next request only using the Scope ( This makes it much more convenient to select the right set of tools for the task at hand. (LLMs get confused if you include a whole bunch of irrelevant tools.) I've also updated the opening post above with the tool definitions you see in the above image. You can grab them from there and evaluate them. I'm not sure yet if gptel should include any tools by default. |
And here is a demo of using the filesystem toolset to make the LLM do something that's otherwise annoying to do: gptel-tool-use-filesystem-demo.mp4 |
@jarreds Yes. It's viable and I'm working on more of it.
I don't quite understand this answer because to me it is definitely a tool call problem. Maybe it will make more sense after my demo. |
It was supposing we did context cleaning/prompt filtering. Which we will, for sure. It's just going to take a while because I both have to and want to move slow. Have to because of severe time constraints. Want to because I find the decisions are better and the project more sustainable at lower velocities, and while I can still load the whole thing into my head.
It's going to be at least two parses, not one, but the effort isn't the issue with text properties (unlike org-element). The problem is that the above is an 80% solution, for several reasons. Chat files are not static records -- properties get added and removed programmatically, as do various other kinds of Chat files need to be persisted. If we implement That said, I'd like to whip up a quick prototype for you to test, hopefully soon.
Yeah, I'm not particularly worried about the memory, only about Emacs hitching for a second when you send the query. I hate that. I test gptel on a low-end 2012 Thinkpad now and then to ensure that creating the payload and streaming responses remains buttery-smooth. |
I'd bank on igc branch landing before long, so the upstream fix will will happen before you get to a downstream bottleneck. |
It can be useful to do this reactively via tools, but it's much simpler to include a repo map (identifiers, relationships etc) as context with the original query. This will work with every model and doesn't require tool-use capability or an expensive back-and-forth. That said, at this point you've played with tool-use applications more than I have, as I spent most of my time just getting the thing working. So I look forward to learning from your experience. |
My high-value use cases are generally dependent on lots of successive tool calls. Therefore auto-mimicry is a compounding failure rate problem. The model fakes a tool call maybe 2% of the time on the first call. I don't think I've ever seen it. After there are I've tried to prompt engineer around this without apparent success. The models are just instantly mesmerized by |
Here's a crappy first-pass at filtering tool calls header/footer: I need to test a bit and see if the LLM continues to confuse itself or not. I'd like to not guess too hard. I haven't seen any first-hand example of how tool calls usually exist in context. I'd like to be as close as possible to what they train on. Think I saw another first. Making my prompt request org mode output resulted in a fake tool call using an elisp source block instead of a normal ``` fence. |
Thanks, FSM works great! Do you have any plans to support the programmatic use of |
This is good enough for your testing, I hope? I've got a more integrated, single-scan version partially working here. I'll share it soon. It reuses the
Any thoughts on this approach? In the future I plan to change the value format to
If you want to do this properly, you have to populate the messages array with the tool call and response messages when creating the full prompt, and not include it in the response chunk. Then the LLM is guaranteed to "understand" that it called a tool with the provided results in previous conversation turns. This is doable from buffer text but it's going to be messy and fragile. It's easier if you store the tool call log in an overlay or another text property, but as I mentioned above I don't want to increase the amount of invisible state in the buffer, notwithstanding the
This is very common. gptel should handle this fine already. If it doesn't please report a bug. |
I'm not planning to, no. |
In the spirit of simplicity of the one-shot prompt example at https://github.com/karthink/gptel/wiki/Defining-custom-gptel-commands such a simple call could transparently handle tools as well. Then |
@link0ff Ah, I understand -- gptel already provides this. The (defvar gptel-lookup--history nil)
(defun gptel-lookup (prompt)
(interactive (list (read-string "Ask ChatGPT: " nil gptel-lookup--history)))
(when (string= prompt "") (user-error "A prompt is required."))
(let ((gptel-tools
(mapcar #'gptel-get-tool '("search_web" "get_youtube_transcript"))) ;or use global value
(gptel-include-tool-results t) ;or use global value
(gptel-confirm-tool-calls t)) ;or use global value
(gptel-request prompt
:callback
(lambda (response info)
(cond
;; ERROR
((null response)
(message "gptel-lookup failed with message: %s"
(plist-get info :status)))
;; Received a RESPONSE or TOOL CALL RESULT
((stringp response)
(with-current-buffer (get-buffer-create "*gptel-lookup*")
(let ((inhibit-read-only t))
;; (erase-buffer) ;might be called multiple times, don't erase!
(insert response))
(special-mode)
(display-buffer (current-buffer)
`((display-buffer-in-side-window)
(side . bottom)
(window-height . ,#'fit-window-to-buffer)))))
;; Received TOOL CALL CONFIRMATION or TOOL CALL RESULT
((consp response)
(gptel--display-tool-calls response info 'use-minibuffer))))))) If the relevant booleans are enabled or let-bound, tool call confirmations are handled via minibuffer prompts, and tool results are passed back to the callback. The one-shot example in the wiki needs to be updated to use a |
If tool calls are orthogonal to the other necessary It completely makes sense why the model auto-mimics to such a high degree now. I'll see if I can hack this up. |
This was my API error message from attempting to include tool call results in chat completion requests that weren't directly preceded by tool calls.
I don't see the nonces in the API. How does match tool responses to calls if it's all stateless? 🤷
Yes. This is a good way. I was using I did start ignoring all the extraneous stuff, including the function call and args, response separator etc. I moved the part that adds the response property upstream. The callback for inserting tool use results was the wrong place because it was clobbering 'ignore properties. General feeling is that auto-mimicry goes way down, so there's that. |
Aha! Looks like the original tool calls need to be persisted as an assistant message. I just didn't look deep enough into the assistant message spec. https://platform.openai.com/docs/api-reference/chat/create#chat-create-messages I can do this by... writing the accepted tool calls to the top of the I'm kind of breaking some things here and there. Only the ChatGPT backend will support this initially. The biggest thing I don't understand is how context works without a chat buffer. Can you fill me in a bit? Where does context come from for in-place work? Doing any kind of interposing of responses is looking more difficult though. I think a better approach is to dynamically decide where to insert response headers so that I don't have empty response prefixes. This is for my HK-47 vanity problem. |
master...psionic-k:gptel:filter-tool-call This is almost completely working on the OpenAI backend. Whenever there is a huge pile of parallel requests, I have an issue I need to look at, but parallel calls on simple tools are working. I need to completely re-do the commits and I'm sure fix up a lot of other garbage. See the diff. The commits are noisy. |
Thanks, with a Regarding the "final state", probably it can't be detected with a universal condition since such thing as |
@psionic-k Are you intending to submit a PR or just creating a proof of concept? In case it's the former -- your approach breaks several aspects of gptel, including chat persistence, internal API boundaries and gptel's "modular" design.
("modular" in quotes because I've almost fully detached the In general the idea of using text properties for role assignment (prompt, response, tool call etc) is not compatible with chat persistence (to disk), so I try to limit it to just response boundaries. One solution is to go full org-element, but that requires (i) defining new syntax, (ii) performance optimizations and (iii) backwards compatibility guarantees, since org-element is slow and changed significantly between Org 9.6 and 9.8-pre. The few bits of org-element I use in
I'm not sure I understand the question. If you're asking how gptel keeps track of tool calls while a back-and-forth multi-turn request is ongoing, it stores intermediate tool calls in the state machine, same as response text. Specifically it modifies the messages array in-place, see uses of (thread-first
gptel--fsm-last
(gptel-fsm-info)
(plist-get :data)
(plist-get :messages)) When the interaction ends -- technically, when Essentially, you are attempting to treat the buffer as a lossless, mutable store of the messages array in between calls to
I don't know what you mean by "interposing" of responses. I don't plan to support dynamic response header placement because I view the ugliness of your example image (empty response header followed by tool call drawer) as a very niche problem. It requires very specific settings to reproduce, and you can find some way to improve the aesthetics in your personal configuration, possibly using |
@link0ff This is not true unless you are using streaming responses, and even then the callback is simply not called if a streamed chunk is empty. (I haven't tested that it's impossible for the callback to get a nil response when streaming, but I'll fix it if there are exceptions.) The example callback I provided above also needs an additional cond clause to handle streaming responses. Do you have examples/logs of valid (HTTP 200) empty string responses from any LLM API when not using streaming?
Yes, this is correct.
I don't plan on doing this in gptel, at least right now. You can handle this from the callback as follows:
To be clear, you don't have to do any of the above unless you want full introspection -- the use of |
Yes. I first just drove in a straight line hack & slashing until I reached what I wanted. Now we have the motivation and context necessary to reconcile.
It's compatible with serialization or decidable re-hydration.
Since this is all about persisting and not how the live data is handled, I would choose serialization. Text properties are atomic to the contents of the conversation during edits.
These details are needed. I'll let you do review before I opine blindly.
TBD. I have two ideas, one of which is to make a feature branch and allow the general demand for other backends like R1 to drive the feature branch back into master. IMO complex tool uses are broken AF without first-class handling of tools and results.
Okay, this gives me a clear expectation. Sounds like requests made against a region lose context unless there is an associated buffer somewhere. In the current version, can there be a buffer for context while making region edits? I have some ideas on the front-end but will cook some more. For now my goal is to figure out the breakage that sometimes happens with tool-result / tool-call correspondence. |
Are you familiar with how persistence works in gptel right now?
I don't know what "decidable re-hydration" means, so all of this went over my head.
I'll add some comments.
Sorry, I don't understand what you mean here either. I'm not sure what you mean by first-class handling of tools -- there are at least two different interpretations. I think we are talking past each other, I couldn't follow most of the above points.
The buffer associated with a request on a region is the buffer the region is in. Every
👍 |
Intentional. It's speculative until about 48 hours from now?
It means instead of explicitly storing the data we want in the form that we want it, deducing from the buffer content, such as However, in this example, I'm immediately talking about org syntax, which brings up all the other problems of using the text content / structure. I don't think using the buffer structure is even worth considering as an implicit store of turns. Text properties are orthogonal to buffer contents and so don't depend on the mode or structure. They can get a little bit janky if the user inserts text with other properties, but structure-based approaches can get janky whenever the LLM outputs something that breaks out of the structure. This already happens whenever it decides to emit org mode headings that break branching context.
No. You said it needed an overhaul. Since I'm implementing new turns, I couldn't imagine not overhauling it.
Thanks. I could imagine cases where we want a working buffer and those are just hard questions to answer when digging into a code base.
First-class means... giving them an explicit representation. In master, after the first tool response to the LLM, subsequent calls just cram them in with assistant responses, which they are not. |
Cool.
It currently works exactly how you're imagining it -- text property boundaries stored as local variables or Org properties. Try saving your chat buffer to disk.
Thanks for the explanation. I agree with you, but for different reasons. I think syntax can work if the user is aware of it. For example, if they know that everything inside a
They can go completely out of sync. I've seen plenty of notebooks from users reporting bugs now where the bounds applied from text-properties as persisted to the file are completely off. There are many issues in this repo asking me to add syntax because of how unreliable storing buffer positions is. If you forget to turn on My hope is that the UI can be separated out from
Good point. Your observation about mimicry makes this harder to solve, although I expect strategic use of the
If you have a list of questions the answers to which can drastically speed up your pace of work on gptel we can hop on a voice call + screen share and I can save you some time.
You mean the 1:1 mapping between the messages array and chat buffer that I alluded to previously. I think you are narrowly focused on a particular style of LLM use which benefits from this faithful mapping, as guided by your interests/experiments with Emacs introspection. There are many other kinds of tasks where this level of fidelity is completely unnecessary, such as when your tool-use is primarily for side-effects, when the LLM edits other buffers by generating a diff, etc. Even for lookup tasks, quite often the LLM's response as informed by the tool result is all that's needed in the buffer. This is why (The other interpretation I had of "first-class tool call results" is the ability to have structured results, not just strings.) |
I'm insisting on the correctness of behavior of tools that build up the context. Whatever needs to change as a consequence and whatever needs to be fixed in the cascade is acceptable. Its a matter of short time before language servers and SLIME etc get used this way, as a RAG source. The results of calls are not out of date quickly for these use cases and should be left in the turns.
File-local-variable to activate the mode. Add a gptel hook function that can activate gptel whenever it finds the file local variable. Lots of ways to make these reliable enough.
Mainly I want to figure out the area where architecture is going one way and my changes appear to be going another. Favored client / handle etc? I think I can do google meet on [email protected] (a gmail ID). Timezone? I'm on Seoul time. |
I agree, and thanks for taking the initiative on this.
I'm not sure about the utility of a chat interface for tasks like this. But creating a more accurate mapping between the buffer and the messages array means we'll be prepared if this works great, at least.
Yeah. They're mostly small changes so far, but the vector is pointing the wrong way.
I've emailed you. |
I found my issue and am having some brainstorm and realizations. The breakage in my current changes happens because
The callback or later is responsible. I'm clearly propertizing the entire string with With or without the problem I found, I think that the markdown conversion to org mode is dangerous simply because looks like a regex solution for a parser problem. What happens when we are processing source blocks that should be about markdown syntax? I began to desire a block for the tool result. It solves another problem, not being able to see the call and the arguments while folded.
Note the escaped heading. Org can extract the un-escaped string from a block. It can also escape the string when inserting. Drawer and block folding both break if we don't do this. Folds to
|
This is happening because you did this a few commits ago:
You are propertizing the response ahead of the (defun gptel-curl--stream-insert-response (response info &optional noprop)
...) If
It's really not, take a closer look at
Have you tried it? It should work fine unless the source text would also break Markdown, and...
...except for the leading stars/bullet point issue you mention here. It is easy to fix, but I have deliberately chosen to ignore it for now, along with three other edge cases I'm aware of that no one has ever brought up. If this is a common occurrence I can fix it in the markdown converter. EDIT: sorry if I sound a little frustrated. Every now and then I get questions based on the assumption that gptel is doing the simplest/most obvious thing under the hood because the result looks simple. This is followed by the realization that things are more complicated than they assumed because the problems they are expecting to occur already did a while ago and a more complex solution was developed to handle it. (An example: persisting the response bounds as an Org property in the file involves solving a fixed point calculation problem. I've received at least four comments informing me that the bounds as calculated must be wrong because the act of writing the bounds to the buffer changes them.) |
Makes sense. I'm seeing the need for the separation in other ways. If I escape the data on the front-end, I need to un-escape it before presenting it to the backend. I think I can handle at least my org mode case in gptel-org, but it means the front-end will always copy the buffer. I think this is viable. Going to hack it together. I've had another recurring issue with the model absolutely loving to output |
I think I'm done with the hack-and-slash phase.
What is definitely bad:
Blocks for tool calls are definitely better. Seeing the arguments and function name post-call helps UX. |
How do they break the user's callbacks? Also it looks like you could use a single optional argument (
Have you seen the past issues/bug reports in this repo about parsing? Absolutely anything that can happen will happen. It has to be written assuming a two year old will be mashing on the keys. |
@psionic-k I've changed all the buffer parsers (including OpenAI) to use You'll have to rebase and fix a merge conflict with your OpenAI buffer parser, but it should be easier to work off the new version since it's closer to your implementation. |
Note
Current status of tool-use:
This feature is now merged, and available in the master branch.
I've added tool-use/function calling support to all major backends in gptel -- OpenAI-compatible, Claude, Ollama and Gemini.
Demos
screencast_20241222T075329.mp4
gptel-tool-use-filesystem-demo.mp4
Same demo as the previous one, but with permissions and tool result inclusion turned on:
gptel-tool-use-filesystem-confirm-demo.mp4
Call to action
Please help me test it! It's on the
feature-tool-use
branch. There are multiple ways in which you can help. Ranked from least to most intensive:Switch to the
feature-tool-use
branch and just use gptel as normal -- no messing around with tool use. Adding tool use required a significant amount of reworking in gptel's core, so it will help to catch any regressions first. (Remember to reinstall/re-byte-compile the package after switching branches!)Switch to the branch, define a tool or two, and try using gptel (instructions below). Let me know if something breaks.
Same as 2, but suggest ways that the feature can be improved, especially in the UI department.
What is "tool use"?
"Tool use" or "function calling" is LLM usage where
You can use this to give the LLM awareness of the world, by providing access to APIs, your filesystem, web search, Emacs etc. You can get it to control your Emacs frame, for instance.
How do I enable it in gptel?
There are three steps:
Use a model that supports tool use. Most of the big OpenAI/Anthropic/Google models do, as do llama3.1 and the newer mistral models if you're using Ollama.
(setq gptel-use-tools t)
Write tool definitions. See the documentation of
gptel-make-tool
. Here is an example of a tool definition:Tool definition example
And here are a few simple tools for Filesystem/Emacs/Web access. You can copy and evaluate them in your Emacs session:
Code:
Some tool definitions, copy to your Emacs
An async tool to fetch youtube metadata using yt-dlp
As seen in gptel's menu:
See the documentation for
gptel-make-tool
for details on the keyword arguments.Tip
@jester7 points out that you can get the LLM to write these tool definitions for you, and eval the Org Babel blocks to use them right away.
Important
Please share tools you write below so I can use them to test for issues.
In this case, the LLM may choose to ask for a call to
get_weather
if your question is related to the weather, as in the above demo video. You can help it along by saying something like:Notes
gptel-make-tool
for an example.nil
if you want to run it for side-effects only.The text was updated successfully, but these errors were encountered: