STILTA RESEARCH

STILTA RESEARCH

February 2026 • 8 min read

February 2026 • 8 min read

Can AI Find What Was Cited at the PTAB?

No public benchmark exists for measuring whether AI prior art search tools actually find the references that matter in patent litigation. We built one. Here is how Stilta performed.

No public benchmark exists for measuring whether AI prior art search tools actually find the references that matter in patent litigation. We built one. Here is how Stilta performed.

No public benchmark exists for measuring whether AI prior art search tools actually find the references that matter in patent litigation. We built one. Here is how Stilta performed.

Oskar Block, Tobias Estreen

Stilta

71%

Recall

21m

Time

+31pp

vs. Search Tools

+49pp

vs. LLMs

The problem

Prior art search is the foundation of every IPR petition. It typically costs $15,000-50,000, takes 4-8 weeks, and the quality of the result is almost impossible to evaluate objectively - because there has been no standard to measure against.

Prior art search is the foundation of every IPR petition. It typically costs $15,000-50,000, takes 4-8 weeks, and the quality of the result is almost impossible to evaluate objectively - because there has been no standard to measure against.

Prior art search is the foundation of every IPR petition. It typically costs $15,000-50,000, takes 4-8 weeks, and the quality of the result is almost impossible to evaluate objectively - because there has been no standard to measure against.

Existing evaluations benchmark AI search tools against patent examiner citations during prosecution. [1][2] But prosecution and litigation are different tasks with different quality bars. An examiner looking for reasons to reject an application is solving a fundamentally different problem than a litigator building a petition that will survive adversarial challenge at the PTAB.

Existing evaluations benchmark AI search tools against patent examiner citations during prosecution. [1][2] But prosecution and litigation are different tasks with different quality bars. An examiner looking for reasons to reject an application is solving a fundamentally different problem than a litigator building a petition that will survive adversarial challenge at the PTAB.

Existing evaluations benchmark AI search tools against patent examiner citations during prosecution. [1][2] But prosecution and litigation are different tasks with different quality bars. An examiner looking for reasons to reject an application is solving a fundamentally different problem than a litigator building a petition that will survive adversarial challenge at the PTAB.

A litigation-grade benchmark

We wanted a ground truth that reflects the reality of patent invalidity work. PTAB IPR proceedings provide exactly that. The prior art cited in a successful IPR petition has been selected by experienced litigators, supported by expert declarations, and found sufficient by the Board to warrant institution. These references have cleared a meaningful threshold.[3]

We selected 40 PTAB IPR institution decisions spanning electrical/computer, mechanical, bio/pharma, and chemical technology domains, with decisions issued between February 2023 and January 2026. For each case, we extracted every prior art reference cited in the petition as ground truth. Then we gave Stilta only the challenged patent number - no information about the IPR, the petitioner, or the cited references - and measured what it found.

We wanted a ground truth that reflects the reality of patent invalidity work. PTAB IPR proceedings provide exactly that. The prior art cited in a successful IPR petition has been selected by experienced litigators, supported by expert declarations, and found sufficient by the Board to warrant institution. These references have cleared a meaningful threshold.[3]

We selected 40 PTAB IPR institution decisions spanning electrical/computer, mechanical, bio/pharma, and chemical technology domains, with decisions issued between February 2023 and January 2026. For each case, we extracted every prior art reference cited in the petition as ground truth. Then we gave Stilta only the challenged patent number - no information about the IPR, the petitioner, or the cited references - and measured what it found.

We wanted a ground truth that reflects the reality of patent invalidity work. PTAB IPR proceedings provide exactly that. The prior art cited in a successful IPR petition has been selected by experienced litigators, supported by expert declarations, and found sufficient by the Board to warrant institution. These references have cleared a meaningful threshold.[3]

We selected 40 PTAB IPR institution decisions spanning electrical/computer, mechanical, bio/pharma, and chemical technology domains, with decisions issued between February 2023 and January 2026. For each case, we extracted every prior art reference cited in the petition as ground truth. Then we gave Stilta only the challenged patent number - no information about the IPR, the petitioner, or the cited references - and measured what it found.

We ran the same benchmark against five of the largest commercial patent search tools and three general-purpose LLMs.[4]

We ran the same benchmark against five of the largest commercial patent search tools and three general-purpose LLMs.[4]

We ran the same benchmark against five of the largest commercial patent search tools and three general-purpose LLMs.[4]

Results

Stilta achieved 71% petition recall across 40 cases, with a median search time of 17 minutes per patent. That means Stilta independently surfaced more than two out of three references that professional search teams - often backed by specialized firms and expert declarations - identified and cited in successful petitions. Stilta also surfaced 161 novel references across the 40 cases - prior art not cited in the original petitions that may support alternative or additional invalidity theories.

Stilta achieved 71% petition recall across 40 cases, with a median search time of 17 minutes per patent. That means Stilta independently surfaced more than two out of three references that professional search teams - often backed by specialized firms and expert declarations - identified and cited in successful petitions. Stilta also surfaced 161 novel references across the 40 cases - prior art not cited in the original petitions that may support alternative or additional invalidity theories.

Stilta achieved 71% petition recall across 40 cases, with a median search time of 17 minutes per patent. That means Stilta independently surfaced more than two out of three references that professional search teams - often backed by specialized firms and expert declarations - identified and cited in successful petitions. Stilta also surfaced 161 novel references across the 40 cases - prior art not cited in the original petitions that may support alternative or additional invalidity theories.

71%

Recall

21m

Time

+31pp

vs. Search Tools

+49pp

vs. LLMs

In comparative evaluation, Stilta surfaced approximately twice as many petition-grade references as the patent search tool category and several times more than general-purpose LLMs - confirming that chat-based models lack the structured retrieval and reasoning needed for litigation-grade prior art search. Even Stilta's below-average cases outperformed the typical result from commercial patent search tools.

In comparative evaluation, Stilta surfaced approximately twice as many petition-grade references as the patent search tool category and several times more than general-purpose LLMs - confirming that chat-based models lack the structured retrieval and reasoning needed for litigation-grade prior art search. Even Stilta's below-average cases outperformed the typical result from commercial patent search tools.

In comparative evaluation, Stilta surfaced approximately twice as many petition-grade references as the patent search tool category and several times more than general-purpose LLMs - confirming that chat-based models lack the structured retrieval and reasoning needed for litigation-grade prior art search. Even Stilta's below-average cases outperformed the typical result from commercial patent search tools.

Results

Want to run this benchmark on your own patents?

See how Stilta performs on your specific technology area.

Book a Demo

Want to run this benchmark on your own patents?

See how Stilta performs on your specific technology area.

Book a Demo

Interpreting 71%

Invalidity search is inherently subjective. Two experienced attorneys given the same patent will decompose claims differently, prioritize different elements, and construct different obviousness narratives from the prior art landscape. The combinatorial space of possible reference combinations is vast. The 71% recall measures overlap with one team's chosen analytical path, not the universe of valid paths. Some of Stilta's "missed" references are effectively replaced by alternative references among the novel art it surfaced.

Where Stilta performed below 50%, the cases share a pattern: cross-domain analogous art drawn from entirely different technical fields and applied by analogy. A petitioner's expert might recognize that an aerospace fluid dynamics technique is relevant to a microfluidics patent - connections that depend on deep specialist knowledge rather than any form of search. The second pattern involves combination depth - when petitions rely on three or more references, the specific choice of secondary gap-filling references is highly subjective, and Stilta frequently surfaced alternative secondary references addressing the same claim gaps through different technical disclosures.

Invalidity search is inherently subjective. Two experienced attorneys given the same patent will decompose claims differently, prioritize different elements, and construct different obviousness narratives from the prior art landscape. The combinatorial space of possible reference combinations is vast. The 71% recall measures overlap with one team's chosen analytical path, not the universe of valid paths. Some of Stilta's "missed" references are effectively replaced by alternative references among the novel art it surfaced.

Where Stilta performed below 50%, the cases share a pattern: cross-domain analogous art drawn from entirely different technical fields and applied by analogy. A petitioner's expert might recognize that an aerospace fluid dynamics technique is relevant to a microfluidics patent - connections that depend on deep specialist knowledge rather than any form of search. The second pattern involves combination depth - when petitions rely on three or more references, the specific choice of secondary gap-filling references is highly subjective, and Stilta frequently surfaced alternative secondary references addressing the same claim gaps through different technical disclosures.

Invalidity search is inherently subjective. Two experienced attorneys given the same patent will decompose claims differently, prioritize different elements, and construct different obviousness narratives from the prior art landscape. The combinatorial space of possible reference combinations is vast. The 71% recall measures overlap with one team's chosen analytical path, not the universe of valid paths. Some of Stilta's "missed" references are effectively replaced by alternative references among the novel art it surfaced.

Where Stilta performed below 50%, the cases share a pattern: cross-domain analogous art drawn from entirely different technical fields and applied by analogy. A petitioner's expert might recognize that an aerospace fluid dynamics technique is relevant to a microfluidics patent - connections that depend on deep specialist knowledge rather than any form of search. The second pattern involves combination depth - when petitions rely on three or more references, the specific choice of secondary gap-filling references is highly subjective, and Stilta frequently surfaced alternative secondary references addressing the same claim gaps through different technical disclosures.

Interpreting 71%

Download Full Report

Book a Demo

What this means in practice

An AI search that surfaces the majority of petition-grade references in 21 minutes does not replace human judgment. It compresses the time between receiving a patent and having a strong set of candidate references to evaluate. Firms can use AI for initial triage and reserve expensive manual search for the gaps.[5]

This is not keyword or semantic search. Stilta's agents read the patent, decompose claims, search across 170 million patents, one trillion web archive pages, and 250 million scientific publications, then evaluate references against specific claim elements and build invalidity reasoning. The output is not a raw list requiring manual screening. It is structured results that shift attorney time from low-value triage to high-value strategic judgment.

Stilta supports both a fully autonomous workflow (used for this benchmark) and an interactive, attorney-led mode where practitioners steer the analysis in real time - focusing on specific claim elements, trying different hypotheses, and applying the domain expertise that makes the difference on hard cases.

A less obvious advantage is consistency. Every attorney and every search firm approaches an invalidity search differently - different queries, different databases, different judgment calls. Stilta applies the same methodology to every case, producing reproducible results regardless of who initiates the search.

An AI search that surfaces the majority of petition-grade references in 21 minutes does not replace human judgment. It compresses the time between receiving a patent and having a strong set of candidate references to evaluate. Firms can use AI for initial triage and reserve expensive manual search for the gaps.[5]

This is not keyword or semantic search. Stilta's agents read the patent, decompose claims, search across 170 million patents, one trillion web archive pages, and 250 million scientific publications, then evaluate references against specific claim elements and build invalidity reasoning. The output is not a raw list requiring manual screening. It is structured results that shift attorney time from low-value triage to high-value strategic judgment.

Stilta supports both a fully autonomous workflow (used for this benchmark) and an interactive, attorney-led mode where practitioners steer the analysis in real time - focusing on specific claim elements, trying different hypotheses, and applying the domain expertise that makes the difference on hard cases.

A less obvious advantage is consistency. Every attorney and every search firm approaches an invalidity search differently - different queries, different databases, different judgment calls. Stilta applies the same methodology to every case, producing reproducible results regardless of who initiates the search.

An AI search that surfaces the majority of petition-grade references in 21 minutes does not replace human judgment. It compresses the time between receiving a patent and having a strong set of candidate references to evaluate. Firms can use AI for initial triage and reserve expensive manual search for the gaps.[5]

This is not keyword or semantic search. Stilta's agents read the patent, decompose claims, search across 170 million patents, one trillion web archive pages, and 250 million scientific publications, then evaluate references against specific claim elements and build invalidity reasoning. The output is not a raw list requiring manual screening. It is structured results that shift attorney time from low-value triage to high-value strategic judgment.

Stilta supports both a fully autonomous workflow (used for this benchmark) and an interactive, attorney-led mode where practitioners steer the analysis in real time - focusing on specific claim elements, trying different hypotheses, and applying the domain expertise that makes the difference on hard cases.

A less obvious advantage is consistency. Every attorney and every search firm approaches an invalidity search differently - different queries, different databases, different judgment calls. Stilta applies the same methodology to every case, producing reproducible results regardless of who initiates the search.

What this means in practice

Download Full Report

Book a Demo

Transparency and limitations

Transparency and limitations

We want to be direct about the limitations. This study was conducted by the developers of the tool being evaluated. The ground truth is objective and public - anyone can verify the PTAB record - but the evaluation of novel references involves subjective judgment. At N=40, a case-level bootstrap yields a 95% confidence interval of approximately 64%-77% on the aggregate recall.

We are publishing the full benchmark dataset - including IPR identifiers and challenged patents - so that other tool developers and researchers can run the same evaluation.

We want to be direct about the limitations. This study was conducted by the developers of the tool being evaluated. The ground truth is objective and public - anyone can verify the PTAB record - but the evaluation of novel references involves subjective judgment. At N=40, a case-level bootstrap yields a 95% confidence interval of approximately 64%-77% on the aggregate recall.

We are publishing the full benchmark dataset - including IPR identifiers and challenged patents - so that other tool developers and researchers can run the same evaluation.

We want to be direct about the limitations. This study was conducted by the developers of the tool being evaluated. The ground truth is objective and public - anyone can verify the PTAB record - but the evaluation of novel references involves subjective judgment. At N=40, a case-level bootstrap yields a 95% confidence interval of approximately 64%-77% on the aggregate recall.

We are publishing the full benchmark dataset - including IPR identifiers and challenged patents - so that other tool developers and researchers can run the same evaluation.

Appendix: Benchmark Dataset

Table 1 presents all 40 PTAB IPR proceedings in the benchmark dataset. All cases resulted in institution decisions where the Board granted review on at least one ground, providing verified ground truth for prior art evaluation. TTR = Stilta time to results (autonomous mode).

Table 1: PTAB benchmark dataset (N=40). All cases are PTAB IPR proceedings where institution was granted. TTR = Stilta time to results (autonomous mode).

Read the Full Technical Report

The full report includes the complete benchmark dataset of 40 IPR proceedings, detailed methodology, and all references.

Download Full Report

References

[1]

Hafner, A., Modic, D., Damij, N., Furlan, A., Rampih, D. (2025). "AI-Powered Prior Art Search: Towards Enriching Intellectual Property Management?" Lecture Notes in Networks and Systems, vol. 1446, Springer.

[2]

Setchi, R., Spasic, I., Morgan, J., Harrison, R., Sheridan, C. (2021). "Artificial intelligence for patent prior art searching." World Patent Information, 64, 102021.

[3]

USPTO. "Trial Statistics FY 2025." Patent Trial and Appeal Board, October 2025.

[4]

This comparative evaluation is intended to provide directional context rather than a definitive ranking. Each tool was used in a standardized manner, but expert users of any individual tool might achieve different results through optimized query strategies.

[5]

Mintz. "The PTAB Pendulum Swings: How IPR Denials are Reshaping Patent Owner and Challenger Strategies." August 2025.

Stilta is an agentic AI system purpose-built for patent invalidity analysis - both prior art search and claim-level analysis. Built by ex-McKinsey QuantumBlack AI engineers.

To see how Stilta performs on your patents, reach out at oskar@stilta.com or book a demo below.

Book a Demo

References

[1]

Hafner, A., Modic, D., Damij, N., Furlan, A., Rampih, D. (2025). "AI-Powered Prior Art Search: Towards Enriching Intellectual Property Management?" Lecture Notes in Networks and Systems, vol. 1446, Springer.

[2]

Setchi, R., Spasic, I., Morgan, J., Harrison, R., Sheridan, C. (2021). "Artificial intelligence for patent prior art searching." World Patent Information, 64, 102021.

[3]

USPTO. "Trial Statistics FY 2025." Patent Trial and Appeal Board, October 2025.

[4]

This comparative evaluation is intended to provide directional context rather than a definitive ranking. Each tool was used in a standardized manner, but expert users of any individual tool might achieve different results through optimized query strategies.

[5]

Mintz. "The PTAB Pendulum Swings: How IPR Denials are Reshaping Patent Owner and Challenger Strategies." August 2025.

Stilta is an agentic AI system purpose-built for patent invalidity analysis - both prior art search and claim-level analysis. Built by ex-McKinsey QuantumBlack AI engineers.

To see how Stilta performs on your patents, reach out at oskar@stilta.com or book a demo below.

Book a Demo