Stilta | Agentic AI for high-stakes patent work

Box plot comparing prior art recall for Stilta, patent search tools, and general-purpose LLMs.

No public benchmark exists for measuring whether AI prior art search tools actually find the references that matter in patent litigation. We built one. Here is how Stilta performed.

Prior art search is the foundation of every IPR petition. It typically costs $15,000-50,000, takes 4-8 weeks, and the quality of the result is almost impossible to evaluate objectively - because there has been no standard to measure against.

Existing evaluations benchmark AI search tools against patent examiner citations during prosecution. [1][2] But prosecution and litigation are different tasks with different quality bars. An examiner looking for reasons to reject an application is solving a fundamentally different problem than a litigator building a petition that will survive adversarial challenge at the PTAB.

We wanted a ground truth that reflects the reality of patent invalidity work. PTAB IPR proceedings provide exactly that. The prior art cited in a successful IPR petition has been selected by experienced litigators, supported by expert declarations, and found sufficient by the Board to warrant institution. These references have cleared a meaningful threshold.[3]

We selected 40 PTAB IPR institution decisions spanning electrical/computer, mechanical, bio/pharma, and chemical technology domains, with decisions issued between February 2023 and January 2026. For each case, we extracted every prior art reference cited in the petition as ground truth. Then we gave Stilta only the challenged patent number - no information about the IPR, the petitioner, or the cited references - and measured what it found.

We wanted a ground truth that reflects the reality of patent invalidity work. PTAB IPR proceedings provide exactly that. The prior art cited in a successful IPR petition has been selected by experienced litigators, supported by expert declarations, and found sufficient by the Board to warrant institution. These references have cleared a meaningful threshold.[3]

We selected 40 PTAB IPR institution decisions spanning electrical/computer, mechanical, bio/pharma, and chemical technology domains, with decisions issued between February 2023 and January 2026. For each case, we extracted every prior art reference cited in the petition as ground truth. Then we gave Stilta only the challenged patent number - no information about the IPR, the petitioner, or the cited references - and measured what it found.

We wanted a ground truth that reflects the reality of patent invalidity work. PTAB IPR proceedings provide exactly that. The prior art cited in a successful IPR petition has been selected by experienced litigators, supported by expert declarations, and found sufficient by the Board to warrant institution. These references have cleared a meaningful threshold.[3]

We selected 40 PTAB IPR institution decisions spanning electrical/computer, mechanical, bio/pharma, and chemical technology domains, with decisions issued between February 2023 and January 2026. For each case, we extracted every prior art reference cited in the petition as ground truth. Then we gave Stilta only the challenged patent number - no information about the IPR, the petitioner, or the cited references - and measured what it found.

Annotated patent decision excerpt showing asserted grounds, challenged claim 16, and cited patent references.

We ran the same benchmark against five of the largest commercial patent search tools and three general-purpose LLMs.[4]

Stilta achieved 71% petition recall across 40 cases, with a median search time of 17 minutes per patent. That means Stilta independently surfaced more than two out of three references that professional search teams - often backed by specialized firms and expert declarations - identified and cited in successful petitions. Stilta also surfaced 161 novel references across the 40 cases - prior art not cited in the original petitions that may support alternative or additional invalidity theories.

In comparative evaluation, Stilta surfaced approximately twice as many petition-grade references as the patent search tool category and several times more than general-purpose LLMs - confirming that chat-based models lack the structured retrieval and reasoning needed for litigation-grade prior art search. Even Stilta's below-average cases outperformed the typical result from commercial patent search tools.

Invalidity search is inherently subjective. Two experienced attorneys given the same patent will decompose claims differently, prioritize different elements, and construct different obviousness narratives from the prior art landscape. The combinatorial space of possible reference combinations is vast. The 71% recall measures overlap with one team's chosen analytical path, not the universe of valid paths. Some of Stilta's "missed" references are effectively replaced by alternative references among the novel art it surfaced.

Where Stilta performed below 50%, the cases share a pattern: cross-domain analogous art drawn from entirely different technical fields and applied by analogy. A petitioner's expert might recognize that an aerospace fluid dynamics technique is relevant to a microfluidics patent - connections that depend on deep specialist knowledge rather than any form of search. The second pattern involves combination depth - when petitions rely on three or more references, the specific choice of secondary gap-filling references is highly subjective, and Stilta frequently surfaced alternative secondary references addressing the same claim gaps through different technical disclosures.

Invalidity search is inherently subjective. Two experienced attorneys given the same patent will decompose claims differently, prioritize different elements, and construct different obviousness narratives from the prior art landscape. The combinatorial space of possible reference combinations is vast. The 71% recall measures overlap with one team's chosen analytical path, not the universe of valid paths. Some of Stilta's "missed" references are effectively replaced by alternative references among the novel art it surfaced.

Where Stilta performed below 50%, the cases share a pattern: cross-domain analogous art drawn from entirely different technical fields and applied by analogy. A petitioner's expert might recognize that an aerospace fluid dynamics technique is relevant to a microfluidics patent - connections that depend on deep specialist knowledge rather than any form of search. The second pattern involves combination depth - when petitions rely on three or more references, the specific choice of secondary gap-filling references is highly subjective, and Stilta frequently surfaced alternative secondary references addressing the same claim gaps through different technical disclosures.

Invalidity search is inherently subjective. Two experienced attorneys given the same patent will decompose claims differently, prioritize different elements, and construct different obviousness narratives from the prior art landscape. The combinatorial space of possible reference combinations is vast. The 71% recall measures overlap with one team's chosen analytical path, not the universe of valid paths. Some of Stilta's "missed" references are effectively replaced by alternative references among the novel art it surfaced.

Where Stilta performed below 50%, the cases share a pattern: cross-domain analogous art drawn from entirely different technical fields and applied by analogy. A petitioner's expert might recognize that an aerospace fluid dynamics technique is relevant to a microfluidics patent - connections that depend on deep specialist knowledge rather than any form of search. The second pattern involves combination depth - when petitions rely on three or more references, the specific choice of secondary gap-filling references is highly subjective, and Stilta frequently surfaced alternative secondary references addressing the same claim gaps through different technical disclosures.

An AI search that surfaces the majority of petition-grade references in 21 minutes does not replace human judgment. It compresses the time between receiving a patent and having a strong set of candidate references to evaluate. Firms can use AI for initial triage and reserve expensive manual search for the gaps.[5]

This is not keyword or semantic search. Stilta's agents read the patent, decompose claims, search across 170 million patents, one trillion web archive pages, and 250 million scientific publications, then evaluate references against specific claim elements and build invalidity reasoning. The output is not a raw list requiring manual screening. It is structured results that shift attorney time from low-value triage to high-value strategic judgment.

Stilta supports both a fully autonomous workflow (used for this benchmark) and an interactive, attorney-led mode where practitioners steer the analysis in real time - focusing on specific claim elements, trying different hypotheses, and applying the domain expertise that makes the difference on hard cases.

A less obvious advantage is consistency. Every attorney and every search firm approaches an invalidity search differently - different queries, different databases, different judgment calls. Stilta applies the same methodology to every case, producing reproducible results regardless of who initiates the search.