Skip to content
Back to blog

Two of the Three Biggest LLM Eval Tools Were Acquired in 8 Weeks. Here's What to Do.

/RuleSell Team

Langfuse went to ClickHouse in January. Promptfoo went to OpenAI in March. The eval stack consolidated faster than anyone planned for. Here's the migration map.

In a span of eight weeks, two of the three most-installed open-source LLM evaluation tools changed hands. ClickHouse acquired Langfuse in January 2026. OpenAI acquired Promptfoo in March 2026. The practitioner web has not caught up — and the teams who built their eval pipelines on these tools are quietly re-architecting.

If you run LLM features in production, this is the migration you didn't plan for.

What actually got acquired, and what stays free

Both deals preserve the open-source license — Langfuse stays MIT, Promptfoo stays MIT — but neither preserves the trajectory you bought into when you adopted the tool.

Langfuse → ClickHouse (January 2026). Langfuse's storage engine was already ClickHouse under the hood, so the acquisition is vertically integrated rather than predatory. The skepticism on the original HN thread was sharp anyway: "Raised $4M in 2023, likely depleted those funds over 2yr without disclosed follow-on" was the top-voted comment. Multiple replies flagged the EU compliance problem — "US companies can be legally compliant with GDPR, it's just that the likes of the CLOUD Act and FISA make it completely meaningless." For European teams that adopted Langfuse specifically because it was an EU-headquartered open-source observability platform, the math just changed. Promptfoo → OpenAI (March 2026). This one is louder. Promptfoo was the de facto CI-native LLM eval tool — YAML configs, deterministic runs, 50+ provider support, ~300k weekly downloads. Now it's owned by a foundation model lab. The repo stays multi-provider on paper. The roadmap, the priorities, and the eventual decisions about which providers get first-class treatment now sit inside OpenAI. The HN thread and a fast-moving Ask HN about MCP eval after the acquisition made the discomfort explicit: devs are evaluating Anthropic models with a tool maintained by OpenAI.

The third tool — Helicone — is still independent, still YC-funded, still proxy-first. So is Phoenix (Arize), Opik (Comet), Pydantic Logfire, and Laminar. The "Big Three" of open-source LLM eval is now one: Helicone, surrounded by smaller specialized players.

What this means for the eval stack

Three concrete things change for teams who built on Langfuse or Promptfoo.

1. The "tool-agnostic processes" thesis got cheaper to argue. Shreya Shankar said it plainly before either deal closed: "AI evals curricula should be tool-agnostic. It is better to learn the processes, because then you can (i) evaluate any tool and (ii) build your own." That was good advice in 2025. After two acquisitions in eight weeks, it's the only defensible stance. If your eval workflow is encoded as a process — "review 100 traces, write a binary judge, regression-test in CI" — you can swap the tool. If your eval workflow is encoded as promptfoo eval --config promptfooconfig.yaml, you have a vendor problem. 2. The MCP eval gap got worse. Promptfoo never handled MCP's transport layer, tool-schema validation, or MCP-specific vulnerabilities like Tool Poisoning. We covered the structural reasons in our state-of-MCP-2026 piece. The OpenAI acquisition didn't fix that gap — it just made it less likely Promptfoo will close it on a timeline that matches Anthropic's MCP roadmap. The nascent alternatives (MCPSpec, MCPjam, mcpbr, agent-vcr) are all under 1k stars and none has the CI-native UX Promptfoo earned. 3. EU data-residency teams need to revisit their stack. Langfuse's EU operations are still EU — but the parent is now ClickHouse Inc., a Delaware C-corp. For teams that picked Langfuse partly because the legal entity sat in Berlin, the CLOUD Act exposure is real. Self-host stays clean. Cloud doesn't.

Three alternatives ranked

Here is how we'd think about the migration if we were running an LLM feature in production today. Ranked by how much process you have to learn, not how many stars the repo has.

1. Helicone (the pragmatic pick)

Helicone is the closest like-for-like Langfuse replacement for teams that want a single observability layer with cost tracking and minimal config. It's a proxy — you swap your base_url to point at Helicone, and it captures every request, response, latency, and cost. Apache-2 license, free tier covers 10k requests/month, self-host available.

The trade-off: proxy-only means you don't get framework-aware spans the way Langfuse or Phoenix do. If your stack is "OpenAI SDK + Anthropic SDK + Bedrock," Helicone is plug-and-play. If your stack is "LangChain + LlamaIndex + DSPy," you'll miss the deep span tracing.

When to pick: small team, multiple providers, want cost tracking without rewriting code.

2. Phoenix (Arize) (the OTEL-native pick)

Phoenix is the OpenInference reference implementation — meaning it speaks OpenTelemetry natively, so your spans land in any OTEL backend (Datadog, Honeycomb, Grafana Tempo). It's Apache-2, the SDK is open, and Phoenix runs locally or in Docker.

Phoenix is the right pick if you already have an OTEL stack and don't want LLM observability to become a separate silo. The "LLM-as-judge" features are built in. The downside: setup is heavier than Helicone, and the UI has a steeper learning curve than Langfuse.

When to pick: existing OTEL pipeline, polyglot framework stack, want LLM data alongside HTTP traces.

3. Pydantic Logfire (the Python-stack pick)

Logfire is the new one to watch. Built by the Pydantic team, OTEL-native, deeply integrated with Pydantic AI and FastAPI. Free tier exists, self-host limited, but the Python ergonomics are unmatched if your backend is FastAPI or Pydantic AI agents. When to pick: Python-only stack, you already use Pydantic AI, you want one tool for app traces + LLM traces.

What replaces Promptfoo

This is the harder migration. Promptfoo was lovely because it was YAML-driven, CI-native, and didn't care which provider you used. The closest open-source alternatives, ranked:

DeepEval — pytest-style, 10k+ stars, 30+ built-in metrics. Best for teams who already write Python tests and want eval to live in the same harness. The metric library is genuinely good (G-Eval, hallucination, faithfulness, RAGAS-equivalents). The trade-off: you write Python, not YAML. Inspect AI — UK AISI's framework, MIT license, designed for safety researchers but increasingly used by app teams. 200+ pre-built evals, sandbox support (Docker/K8s). Best for teams who want a research-grade eval harness and don't mind the safety-research vocabulary. OpenEvals (LangChain) — newer, LangChain-aligned, prompt-built around the things LangSmith teams already do. Worth watching, not yet mature enough to bet on.

Honest note: none of these match Promptfoo's "drop a YAML, run it in GitHub Actions" UX out of the box. If that was your reason for adopting Promptfoo, the migration is going to cost you a half-day. If you used the deeper features (red-teaming, dataset generation), DeepEval covers more of them than Inspect.

The decision tree

If you're starting from scratch in May 2026, here is the shortest defensible path:

  • Just need cost tracking + traces, multi-provider, small team: Helicone. Self-host if EU.
  • Already on OTEL, multi-framework, want one observability story: Phoenix.
  • Python-only, FastAPI or Pydantic AI: Logfire.
  • Need CI-native eval that doesn't care about provider: DeepEval for now, watch Inspect AI.
  • Need to evaluate MCP servers specifically: roll your own using MCPjam or mcpbr as a base, and follow the MCP security guidance. Don't wait for Promptfoo to ship MCP support — its incentives have shifted.

The Hamel Husain framework still works

Hamel Husain, the practitioner most teams cite when they think about LLM evals, has been consistent: tools change, process doesn't. His three load-bearing claims:
  • "60-80% of dev time on error analysis." If your eval tool is shaping less than half of your engineering time, you're using it as theater.
  • "Review at least 100 traces." Most teams skip this because no tool makes it pleasant. Build the annotation UI into your app, not into a vendor dashboard.
  • "70% pass rate might indicate a more meaningful evaluation." If you're passing 100% of evals, your evals aren't testing your system.
None of those three depend on Promptfoo or Langfuse. They depend on your team having a process. The acquisitions are inconvenient. They're not catastrophic.

What's still independent (as of May 2026)

The open-source LLM observability and eval space, post-acquisitions:

ToolLicenseIndependent?Best for
HeliconeApache-2Yes (YC W23)Cost tracking + proxy traces
Phoenix (Arize)Apache-2Yes (Arize AI)OTEL-native, OpenInference reference
OpikApache-2Yes (Comet)Prompt optimization + tracing
Pydantic LogfireClosed core, OSS SDKYes (Pydantic team)Python-stack full-stack
LaminarApache-2Yes (YC S24)Agent-focused tracing
OpenLLMetry (Traceloop)Apache-2Yes (Traceloop)OTEL instrumentation library
Inspect AIMITYes (UK AISI)Safety + app evals
DeepEvalApache-2Yes (Confident AI)pytest-style eval
LangfuseMITNo (ClickHouse, Jan 2026)
PromptfooMITNo (OpenAI, Mar 2026)
The independent column is longer than the acquired column. That is the actual story. The eval ecosystem didn't get smaller — it got more fragmented, with the two best-marketed tools now owned by infrastructure vendors. Smaller specialized tools are still shipping.

Where this analysis fails

We don't know what either acquisition will do to the roadmap. Langfuse's leadership stayed; Promptfoo's founder tweeted that the team is staying intact. Acquisitions often look benign for 12-18 months and then the parent company's priorities take over. If you're making a migration call in May 2026 because of these deals, you might be early. If you're making it in May 2027 because the roadmap drifted, you'll be late.

We also don't know what happens to the MCP eval story specifically. Anthropic could ship a first-party MCP eval tool tomorrow and absorb the category. The MCP ecosystem moves fast.

What we do know: the teams who built their eval pipelines on tool-specific YAML configs are doing more migration work than the teams who built theirs on process. That is the durable lesson.

What to do today

If you have a production LLM feature and you're on Promptfoo or Langfuse:

  1. Audit which features you actually use. If you use Promptfoo for promptfoo eval --config X in CI and nothing else, your migration is a half-day to DeepEval or Inspect.
  2. Self-host Langfuse if you're EU and on cloud. Their docker-compose works. The crossover math (when self-host beats SaaS) lands around 100k traces/month for most teams.
  3. Don't rip-and-replace yet. Both tools still work. The acquisitions matter for new commitments, not for systems that ship today.
  4. Write down your eval process. What questions does the eval answer? What are your judge prompts? What's the regression threshold? When that's written down, the tool is fungible.
If you're starting fresh: pick from the independent column and build the process Hamel describes. Tools will keep moving. Your team's eval discipline is what compounds.
Browse verified eval rulesets on RuleSell. Read more on LLM evals as process, the post-Promptfoo migration map, and Claude Code skill evaluation. For teams shipping AI features, the /for/startup-with-paying-customers page covers the first 30 days of production eval discipline. If you're a solo AI developer, the lightweight Helicone + DeepEval combo gets you 80% of what Langfuse gave you, free.

Sources