Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a DSL to generate regular expressions #1403

Merged
merged 3 commits into from
Feb 19, 2025
Merged

Conversation

rlouf
Copy link
Member

@rlouf rlouf commented Feb 5, 2025

This PR introduces a DSL to simplify the use of regular expressions in Outlines.

Why a DSL

Not everyone is, and wants to be, familiar with regular expressions. However, regex-based structured generation can be very powerful, sometimes even more so than JSON Schema-based structured generation. This PR thus introduces new abstractions that allow users to create regular expressions from very simple and intuitive primitives.

Requirements

Besides providing an intuitive DSL we want:

  • The ability to use the regular expressions thus defined as types in Pydantic models;
  • Pre-defined types for commonly used constructs:
    • sentence
    • digit
    • paragraph
    • And other types.
  • A few custom types that can be useful: email, phone_number, etc.
  • Make it easy to debug / iterate on the regular expression

Features

The DSL works by building a graph from several kinds of nodes, or types. The generated graph can be inspected, used to match strings of text, used as a type in a pydantic BaseModel, or converted to a regular expression.

Basic types

There is only one object that the user should ever have to manipulate, Regex. While we also define a String type, we try to automatically convert Python strings to String objects whenever possible. For instance:

from outlines.types import regex

temp = "The answer is " + regex("[0-9]")
repr(temp)
# Sequence(terms=[String(value="The answer is "), Regex(pattern="[0-9]")])

temp = "Word" | regex("[0-9]")
repr(temp)
# Alternatives(terms=[String(value="Word"), Regex(pattern="[0-9]")])

Other types

Optional

The Optional node corresponds to the (...)? pattern in regular expression. It can be instantiated directly with a factory function:

from outlines.types import regex, optional

temp = optional(regex("[0-9]"))
repr(temp)
# Optional(term=Regex(pattern="[0-9]"))

We also provide the following method:

from outlines.types import regex

temp = regex("[0-9]").optional()
repr(temp)
# Optional(term=Regex(pattern="[0-9]"))

KleneeStar

The Kleene star is an operator ((...)*) that means "zero or more". We provide a factory function and method to generate KleeneStar objects:

from outlines.types import regex, zero_or_more

temp = zero_or_more(regex("[0-9]"))
repr(temp)
# KleeneStar(term=Regex(pattern="[0-9]"))

temp = regex("[0-9]").zero_or_more()
repr(temp)
# KleeneStar(term=Regex(pattern="[0-9]"))

KleenePlus

The Kleene plus is an operator ((...)+) that means "one or more". We provide a factory function and method to generate KleenePlus objects:

from outlines.types import regex, one_or_more

temp = one_or_more(regex("[0-9]"))
repr(temp)
# KleenePlus(term=Regex(pattern="[0-9]"))

temp = regex("[0-9]").one_or_more()
repr(temp)
# KleenePlus(term=Regex(pattern="[0-9]"))

Sequence

Sequence objects represent sequences of sub-patterns, and loosely is equivalent to an "and" operator. We overload the __add__ and __radd__ methods of the Term base object to generate sequences. Strings are automatically converted to a String object:

from outlines.types import regex

temp = "digit: " + regex("[0-9]")
repr(temp)
# Sequence(terms=[String(value="digit :"), Regex(pattern="[0-9]")])

Alternatives

Altenatives objects are equivalent to "or" operators. We overload the __or__ and __ror__ methods of the Term base object to generate alternatives. Strings are automatically converted to a String object:

from outlines.types import regex

temp = "word" | regex("[0-9]")
repr(temp)
# Sequence(terms=[String(value="digit :"), Regex(pattern="[0-9]")])

QuantifyExact

QuantifyExact objects are used to specify the number of repetitions of a sub-pattern. We provide a repeat function and times method to create these objects:

from outlines.types import regex, repeat

digits = repeat(regex("[0-9]"), 2, 2)
repr(digits)
# QuantifyExact(term=Regex(pattern="[0-9]"), 2)

digits = regex("[0-9]").times(2)
repr(digits)
# QuantifyExact(term=Regex(pattern="[0-9]"), 2)

QuantifyMininum

QuantifyMinimum objects are used to define repetitions of a sub-pattern, with a minimum of repetitions. We provide a repeat function and repeat method to create these objects:

from outlines.types import regex, repeat

digits = repeat(regex("[0-9]"), 2, None)
repr(digits)
# QuantifyMinimum(term=Regex(pattern="[0-9]"), 2)

digits = regex("[0-9]").repeat(2, None)
repr(digits)
# QuantifyMinimum(term=Regex(pattern="[0-9]"), 2)

QuantifyMaximum

QuantifyMinimum objects are used to define repetitions of a sub-pattern, with a maximum of repetitions. We provide a repeat function and repeat method to create these objects:

from outlines.types import regex, repeat

digits = repeat(regex("[0-9]"), None, 2)
repr(digits)
# QuantifyMaximum(term=Regex(pattern="[0-9]"), 2)

digits = regex("[0-9]").repeat(None, 2)
repr(digits)
# QuantifyMaximum(term=Regex(pattern="[0-9]"), 2)

QuantifyBetween

QuantifyBetween objects are used to define a bounded number of repetitions of a sub-pattern. We provide a repeat function and repeat method to create these objects:

from outlines.types import regex, repeat

digits = repeat(regex("[0-9]"), 1, 2)
repr(digits)
# QuantifyBetween(term=Regex(pattern="[0-9]"), 1, 2)

digits = regex("[0-9]").repeat(None, 2)
repr(digits)
# QuantifyBetween(term=Regex(pattern="[0-9]"), 1, 2)

Pre-defined types

We provide pre-defined types based on regular expressions, and that can be composed to define a more complex expression:

  • sentence
  • paragraph
  • quoted_text
  • char
  • integer
  • newline
  • (...)
from outlines.types import paragraph

document = "A document with 3 paragraphs:" + newline + paragraph.times(3) + newline + "The end."

Generating JSON

It is possible to use a Pydantic model or JSON Schema to append JSON to the template:

from pydantic import BaseModel
from outlines.types import json_schema, paragraph, newline


class User(BaseModel):
    name: str
    age: int


template = "Think hard and generate JSON." + newline + paragraph.times(4) + newline + json_schema(User)

Pydantic integration

The expressions can be used as types in Pydantic objects, and are translated to a string type with the pattern keyword when converted to JSON Schema:

from outlines.types import Regex, sentence

digit = Regex("[0-9]")
answer_template = "So the answer is " + digit

class GSM8K(BaseModel):
    reasoning: sentence.times(4)
    answer: answer_template

Debugging

The DSL offers two ways to test and debug the generated expressions. The first one is to be able to test whether the expression matches a given string:

from outlines.types import regex

digit = regex("[0-9]")
answer_template = "So the answer is " + digit

does_match = answer_template.matches("So the answer is 1")
print(does_match)
# True

does_match = answer_template.matches("So the answer is one")
print(does_match)
# False

The second one is pretty-printing of the graph generated:

from outlines.types.dsl import String, Regex, KleeneStart

a = String("a")
b = String("b")
c = Regex("[0-9]")
d = KleeneStar(a | b) + c

print(d)
#└── Sequence
#    ├── KleeneStar(*)
#    │   └── Alternatives(|)
#    │       ├── String('a')
#    │       └── String('b')
#    └── Regex('[0-9]')

Todo

  • Add more basic types: sentence, pargraph, quoted text
  • Move the python types defined in fsm to types
  • Add json factory to convert a Pydantic model / json schema string to a regular expression
  • Add more tests
    • Regexes
    • Factory functions
    • Custom types
  • Add documentation
  • Clean the imports

Closes #1302.

@rlouf rlouf added enhancement regex impact/user interface Related to improving the user interface labels Feb 5, 2025
@rlouf rlouf self-assigned this Feb 5, 2025
@cpfiffer
Copy link
Contributor

cpfiffer commented Feb 5, 2025

I don't really have feature/style notes to add here. It looks solid to me.

@rlouf rlouf force-pushed the regex-dsl branch 2 times, most recently from fba1106 to 4cb7da8 Compare February 6, 2025 22:33
Term.zero_or_more = zero_or_more


def to_regex(term: Term) -> str:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to draw attention to a few terms that will need extra sets of parentheses:

>>> to_regex(QuantifyExact(String("test"), 3))
'test{3}'
>>> to_regex(Sequence([Regex("dog|cat"), String("fish")]))
'dog|catfish'

I realise this PR is still in draft though :)

@rlouf rlouf force-pushed the regex-dsl branch 2 times, most recently from a7f8e77 to b41b6e3 Compare February 7, 2025 08:51
@rlouf rlouf force-pushed the regex-dsl branch 6 times, most recently from aa83128 to 6062b9f Compare February 13, 2025 17:53
@rlouf rlouf marked this pull request as ready for review February 13, 2025 17:54
Copy link
Contributor

@yvan-sraka yvan-sraka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, add a few minor comments! The documentation page is great :)

@rlouf rlouf force-pushed the regex-dsl branch 2 times, most recently from cc20b03 to ec49e61 Compare February 17, 2025 21:53
@rlouf rlouf force-pushed the regex-dsl branch 4 times, most recently from 63cf9ee to 4f583b8 Compare February 17, 2025 22:52
@rlouf
Copy link
Member Author

rlouf commented Feb 18, 2025

Interesting case of "tests pass on my machine"

rlouf and others added 3 commits February 19, 2025 00:52
vLLM can only currently be run on GPU (unless you want to go to extreme
lengths to make it work on CPU), and we thus cannot run the tests in CI.
We thus separate the test dependencies in "with GPU" and "without
GPU" (a subset of "with GPU") that the user has to pick manually.
@rlouf rlouf merged commit 116ebff into dottxt-ai:main Feb 19, 2025
5 of 6 checks passed
@rlouf rlouf deleted the regex-dsl branch February 19, 2025 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement impact/user interface Related to improving the user interface regex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Regex DSL
5 participants