-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a DSL to generate regular expressions #1403
Conversation
I don't really have feature/style notes to add here. It looks solid to me. |
fba1106
to
4cb7da8
Compare
Term.zero_or_more = zero_or_more | ||
|
||
|
||
def to_regex(term: Term) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to draw attention to a few terms that will need extra sets of parentheses:
>>> to_regex(QuantifyExact(String("test"), 3))
'test{3}'
>>> to_regex(Sequence([Regex("dog|cat"), String("fish")]))
'dog|catfish'
I realise this PR is still in draft though :)
a7f8e77
to
b41b6e3
Compare
aa83128
to
6062b9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, add a few minor comments! The documentation page is great :)
cc20b03
to
ec49e61
Compare
63cf9ee
to
4f583b8
Compare
Interesting case of "tests pass on my machine" |
vLLM can only currently be run on GPU (unless you want to go to extreme lengths to make it work on CPU), and we thus cannot run the tests in CI. We thus separate the test dependencies in "with GPU" and "without GPU" (a subset of "with GPU") that the user has to pick manually.
This PR introduces a DSL to simplify the use of regular expressions in Outlines.
Why a DSL
Not everyone is, and wants to be, familiar with regular expressions. However, regex-based structured generation can be very powerful, sometimes even more so than JSON Schema-based structured generation. This PR thus introduces new abstractions that allow users to create regular expressions from very simple and intuitive primitives.
Requirements
Besides providing an intuitive DSL we want:
sentence
digit
paragraph
email
,phone_number
, etc.Features
The DSL works by building a graph from several kinds of nodes, or types. The generated graph can be inspected, used to match strings of text, used as a type in a pydantic
BaseModel
, or converted to a regular expression.Basic types
There is only one object that the user should ever have to manipulate,
Regex
. While we also define aString
type, we try to automatically convert Python strings toString
objects whenever possible. For instance:Other types
Optional
The
Optional
node corresponds to the(...)?
pattern in regular expression. It can be instantiated directly with a factory function:We also provide the following method:
KleneeStar
The Kleene star is an operator (
(...)*
) that means "zero or more". We provide a factory function and method to generateKleeneStar
objects:KleenePlus
The Kleene plus is an operator (
(...)+
) that means "one or more". We provide a factory function and method to generateKleenePlus
objects:Sequence
Sequence
objects represent sequences of sub-patterns, and loosely is equivalent to an "and" operator. We overload the__add__
and__radd__
methods of theTerm
base object to generate sequences. Strings are automatically converted to aString
object:Alternatives
Altenatives
objects are equivalent to "or" operators. We overload the__or__
and__ror__
methods of theTerm
base object to generate alternatives. Strings are automatically converted to aString
object:QuantifyExact
QuantifyExact
objects are used to specify the number of repetitions of a sub-pattern. We provide arepeat
function andtimes
method to create these objects:QuantifyMininum
QuantifyMinimum
objects are used to define repetitions of a sub-pattern, with a minimum of repetitions. We provide arepeat
function andrepeat
method to create these objects:QuantifyMaximum
QuantifyMinimum
objects are used to define repetitions of a sub-pattern, with a maximum of repetitions. We provide arepeat
function andrepeat
method to create these objects:QuantifyBetween
QuantifyBetween
objects are used to define a bounded number of repetitions of a sub-pattern. We provide arepeat
function andrepeat
method to create these objects:Pre-defined types
We provide pre-defined types based on regular expressions, and that can be composed to define a more complex expression:
sentence
paragraph
quoted_text
char
integer
newline
Generating JSON
It is possible to use a Pydantic model or JSON Schema to append JSON to the template:
Pydantic integration
The expressions can be used as types in Pydantic objects, and are translated to a
string
type with thepattern
keyword when converted to JSON Schema:Debugging
The DSL offers two ways to test and debug the generated expressions. The first one is to be able to test whether the expression matches a given string:
The second one is pretty-printing of the graph generated:
Todo
fsm
totypes
json
factory to convert a Pydantic model / json schema string to a regular expressionCloses #1302.