Skip to content

Topic · A7

Promptfoo alternatives after the OpenAI acquisition (2026)

OpenAI acquired Promptfoo in March 2026. ClickHouse acquired Langfuse in January. Two of the three biggest OSS eval tools changed hands in 8 weeks. Here's what to use now.

# Promptfoo alternatives after the OpenAI acquisition Two of the three biggest open-source LLM eval tools changed hands in 8 weeks. Langfuse was acquired by ClickHouse in January 2026 (HN 46656552). Promptfoo was acquired by OpenAI in March 2026 (HN 47312346). For the eval tooling layer this is a structural event, not a footnote. The practitioner web has not caught up. This page tells you what actually changed, what's still independent, and how to make your eval setup survive the next acquisition — because there will be one.

What the acquisitions actually changed

Both deals were framed in the press as "the projects stay open source." That part is true. The MIT licenses are intact, the GitHub repos still accept PRs, the SDKs still run against the providers they always supported. If you're looking at the changelogs in isolation, nothing dramatic happened. What changed is governance. Roadmap priorities, hiring decisions, paid-tier shape, integration scope, and — eventually — default model choice now sit inside the acquirer. The Hacker News skepticism on the Langfuse thread named this directly: "Raised $4M in 2023, likely depleted those funds over 2yr without disclosed follow-on." The acquisition is the funding event, and the funding event determines product direction. The OpenAI / Promptfoo case is sharper because OpenAI is one of the providers Promptfoo evaluates against. Independence-of-judgment matters in eval tooling the way it matters in audit. The repo can stay multi-provider while the development emphasis quietly tilts toward features that flatter the parent's models. Nobody has alleged this is happening; we're noting that the structural incentive now exists.

The independent landscape today

The remaining independent open-source players, in rough order of overlap with what Promptfoo and Langfuse covered: | Tool | License | Self-host | OTEL-native | Strongest at | |---|---|---|---|---| | Phoenix (Arize) | Apache-2 | Yes | OpenInference | Tracing, span-level inspection, framework-agnostic | | Helicone | Apache-2 | Yes | Partial | Proxy-based cost tracking, simplest install | | Opik (Comet) | Apache-2 | Yes | Yes | Prompt optimization, rising fast (0→12.5k stars in 9 months) | | Pydantic Logfire | OSS SDK + closed core | Limited | Yes | Python-stack teams already on Pydantic AI / FastAPI | | Laminar | Apache-2 | Yes | Yes | Agent-focused tracing, Vercel AI SDK integration | | OpenLLMetry (Traceloop) | Apache-2 | Exports to your stack | Yes | Adding LLM spans to existing Datadog / Grafana / Honeycomb | | DeepEval | MIT | N/A (library) | No | Pytest-style eval, 30+ metrics, CI-native | | Inspect AI | MIT | N/A (library) | No | Research-grade evals, 200+ pre-built tasks, sandboxed | None of these is a one-for-one Promptfoo replacement. Promptfoo's value was the bundle: a CLI runner, a YAML config format, and a red-team module in one tool. The closest single-tool match for the eval-runner half is DeepEval. For the YAML-config half there is currently no equivalent — you either move to a Python-defined eval suite, or you keep Promptfoo and accept the governance risk. For the red-team half, the picks are cleaner: Garak (NVIDIA, 37+ probes, focused on jailbreak/PII/toxicity), PyRIT (Microsoft, more adversarial-ML academic), and DeepTeam (Confident AI, the team behind DeepEval).

The MCP eval gap nobody is publishing

The single most consequential thing Promptfoo never did was MCP server testing. From the Ask HN thread on the topic (id 47412524):
Promptfoo was great for LLM evaluation but didn't handle MCP's transport layer, tool schema validation, or MCP-specific vulnerabilities like Tool Poisoning.
The alternatives developers named in that thread — MCP Inspector (interactive-only), MCPSpec, MCPjam, mcp-record, mcpbr, agent-vcr — are all nascent. None of them is a clear winner today. If your eval problem is "does my MCP server hold up under tool-calling agents," the OpenAI acquisition didn't make things worse, because the gap was there the whole time. See /topic/mcp-eval for the current state of that landscape.

How to make your eval setup acquisition-proof

The lesson from the past 8 weeks is that which tool you use is the wrong unit of decision. Tools get acquired. Processes don't. Shreya Shankar said this directly on X (thread):
AI evals curricula should be tool-agnostic. It is better to learn the processes, because then you can (i) evaluate any tool and (ii) build your own.
Hamel Husain's evals FAQ puts a number on the process: 60-80% of dev time should sit in error analysis, not in eval tool selection. Picking the right tool is a one-day decision; running 100 traces through error analysis is a one-month discipline. The discipline is what produces working evals. The practical encoding of "process, not tool" is a written ruleset that any runner can execute against. The same judge prompts work in Promptfoo today, DeepEval next quarter, and whatever-OpenAI-ships-instead-of-Promptfoo in 2027. The same error-taxonomy categories survive every migration. The same trace-annotation template moves between Langfuse and Phoenix in an afternoon.

A migration sketch, by use case

If you used Promptfoo for CI evals on standard providers. Keep it. Add 30 days to your annual planning cycle to re-evaluate. Set a calendar reminder for January 2027 and watch the changelog for OpenAI-specific feature work that signals a tilt. If you used Promptfoo for red-teaming. Move now to Garak or DeepTeam. The red-team module was always the weakest part of Promptfoo's bundle, and the dedicated red-team tools have caught up. If you used Langfuse for self-hosted EU-residency observability. Audit your threat model. The data plane is unchanged, but the project's governance is now under a US-acquired entity. For some teams this is a real compliance question; for others it isn't. The HN thread's most-upvoted point: "US companies can be legally compliant with GDPR, it's just that the likes of the CLOUD Act and FISA make it completely meaningless." That's a policy view, not a legal opinion, but it captures the chill. If you used Langfuse for general observability. Stay. Phoenix is the closest like-for-like if you ever need to leave, and it's been Apache-licensed and OTEL-native the whole time. Migration cost is moderate, not catastrophic. If you depended on either tool for MCP testing. You were already in the gap. See /topic/mcp-eval.

What to read next

Sources

  • Husain, Hamel. "Evals FAQ"60-80% of dev time on error analysis.

Related GitHub projects

Frequently asked

Is Promptfoo still open source after the OpenAI acquisition?
Yes. As of the March 2026 acquisition announcement (HN thread 47312346), the repo stays MIT-licensed and multi-provider — Anthropic, Google, Bedrock, and 50+ other providers continue to work. What changed is governance: roadmap, hiring, and prioritization now sit inside OpenAI. The pattern from previous OpenAI acquisitions is that the OSS project keeps shipping for 12-18 months, then quietly pivots toward platform integration. Plan as if you have one year of stability and start budgeting a migration path now.
What changed about Langfuse when ClickHouse acquired it?
Langfuse stays MIT, self-hosting still works, the SDKs are unchanged. The acquisition (announced January 2026, HN thread 46656552) was structurally a hire of the team into ClickHouse's product org. The skepticism on HN was sharper than the press release: multiple commenters noted Langfuse had raised $4M in 2023 with no disclosed follow-on, and the EU-data-residency story is muddied by US-acquirer CLOUD Act exposure. For EU teams who chose Langfuse specifically to avoid US vendors, this is a meaningful change.
What's left that's actually independent?
Phoenix (Arize), Helicone (YC W23), Opik (Comet), Pydantic Logfire (Pydantic team), Laminar (YC S24), and OpenLLMetry (Traceloop). Phoenix is the closest like-for-like to Langfuse on the tracing/observability side. DeepEval and Inspect AI cover the dataset/eval side that Promptfoo owned. None of these is a drop-in replacement for either tool, which is the point of this page.
Should I migrate off Promptfoo right now?
Not yet — and not the way most blog posts imply. The migration question depends on what you used Promptfoo for. If you used it for CI-native YAML-driven eval runs on standard providers, you can keep using it for 6-12 months and re-evaluate. If you depended on it for MCP testing (which Promptfoo never actually handled, see HN 47412524), the gap was always there and you need a different tool today. The honest move is to encode your eval process as a tool-agnostic ruleset now, so the choice of runner becomes interchangeable.
Is there a single 'best replacement' for Promptfoo?
No, because Promptfoo was three things bundled — a CLI runner, a YAML config format, and a red-team toolkit. The closest pure-eval replacement is DeepEval (pytest-style, 30+ metrics). For red-team, the picks are Garak (NVIDIA, 37+ probes), PyRIT (Microsoft), or DeepTeam (Confident AI). For YAML-driven prompt iteration the gap is real and is part of why the acquisition matters.

Related topics