Show HN: LLMs can generate valid JSON 100% of the time
526 by remilouf | 165 comments on Hacker News.
Outlines is a Python library that focuses on text generation with large language models. Brandon and I are not LLM experts and started the project a few months ago because we wanted to understand better how the generation process works. Our original background is probabilistic, relational and symbolic programming. Recently we came up with a fast way to generate text that matches a regex ( https://ift.tt/7OPjnS8... ). The basic idea is simple: regular expressions have an equivalent Deterministic-Finite Automaton (DFA) representation. We can transform this DFA into a generative model: in each state we get a list of symbols which correspond to completions that partially match the regular expression. We mask the other symbols in the logits returned by a large language model, sample a new symbol and move to the next state. The subtelty is that language models work with tokens, not symbols, so we derive a new FSM whose alphabet is the model's vocabulary. We can do this in only one pass over the vocabulary. Generating the token masks thus only requires a dictionary lookup at each state. Our method blows other libraries like Microsoft's guidance out of the water. From there it was only a small leap to be able to generate text that follows a JSON schema ( https://ift.tt/6vSd4Ui ), or is parseable into a Pydantic model ( https://ift.tt/QGvi5gE ). The method works with union types, optional types, nested schemas, arrays, everything. It is guaranteed that the output is parseable. I think it's cool, and I've spent a lot of time watching even tiny models output valid JSON over the weekend. Hope you will too. I look forward to feedback, bug reports, feature requests and discussions! Edit: Link to our pre-print explaining the method and how this can be extended to generate text that follows a Context-Free Grammar https://ift.tt/epGhtcM