Datalab’s new 9B model extracts clean JSON from PDFs in one pass

Datalab has just dropped lift, a 9-billion-parameter open-weights vision model that converts PDFs and images into clean JSON in one shot. Feed it a standard JSON Schema and the model returns a matching JSON object without extra parsing steps. It’s the first model built specifically for structured extraction from Datalab, extending their existing open-source OCR stack which already includes chandra, marker, and surya.

One-pass extraction for multi-page files

lift reads entire documents at once, even when values span multiple pages. There’s no need to split, OCR, and reassemble—every page is processed in a single inference pass. The model clocks a median of 9.5 seconds per document on Datalab’s benchmark and hits 90.2% field accuracy across 225 test documents, placing it at the top of the small self-hostable extraction models the team evaluated.

Schema-constrained decoding keeps JSON valid

The secret sauce is schema-constrained decoding: lift compiles the provided JSON Schema into a grammar that the vLLM server applies token-by-token during generation. Only schema-compliant tokens remain in the sampling pool, so the output is guaranteed to match the required structure without post-processing. Supported types include strings, numbers, booleans, nested objects, and arrays. Scalar fields can also abstain by emitting null, giving the model a graceful way to skip uncertain values without breaking the JSON shape. If the schema contains constructs that can’t be compiled—enums, oneOf/anyOf, $refs, or additionalProperties—the system logs a warning and continues without constraints, keeping the pipeline running while sacrificing the structural guarantee for that run.

Two ways to run, Apache 2.0 code

The package offers both local inference via HuggingFace and remote inference through a vLLM server, with the latter recommended for production. The code is released under Apache 2.0, while the model weights use a modified OpenRAIL-M license, aligning with Datalab’s open ecosystem approach. lift slots into a growing niche of open extraction models, sitting alongside purpose-built tools like NuExtract and general vision-language models such as Qwen3.5-9B.

Source: MarkTechPost. AI-assisted editorial synthesis — TechnoExpress.

Datalab’s new 9B model extracts clean JSON from PDFs in one pass

One-pass extraction for multi-page files

Schema-constrained decoding keeps JSON valid

Two ways to run, Apache 2.0 code

Essential tech, every morning