How we measure quality: the signals behind RuleSell's Quality Score

March 19, 2026/Nalba Alkan

Star ratings measure popularity. Download counts measure marketing. We measure quality directly — with automated signals and zero voting.

Update 2026-04-17 — honest status: Three of the six signals below ship in production today: freshness, schema cleanliness, and review score. Token efficiency, install success rate, and security scan are on the v2 roadmap and are not measured today. The article below is the long-term vision; live pipeline state is at /trust.

# How we measure quality: the signals behind RuleSell's Quality Score Every marketplace eventually faces the same problem: how do you surface the best stuff? The standard answer is user ratings. Five stars. Thumbs up. "Was this helpful?" It sounds democratic. It's actually broken, and we have a decade of evidence to prove it.

Why star ratings fail

The VS Code Marketplace problem

The VS Code Marketplace has been fighting this battle since its inception. In a long-running GitHub issue about the rating system, developers identified the core failure mode: most users who dislike an extension just uninstall it. They don't rate it. This creates a selection bias where the only people who leave ratings are either enthusiastic fans or angry enough to complain. The result is extensions with "several thousands of installs with very few ratings, if any at all." And with few ratings, a single unhappy user can destroy an extension's score. One developer noted that "many [ratings] that do exist are probably just friends helping out" — the friend-rating problem that plagues every small-scale marketplace. The proposed fix? Display "a percentage of users that still have that extension installed" instead of stars. Retention rate as a quality signal. VS Code hasn't implemented this, but the insight is correct: behavioral signals are more honest than self-reported ones.

The App Store fraud machine

Apple removed 143 million fraudulent reviews in 2024 and terminated over 146,000 developer accounts tied to review fraud. Roughly 1 in 9 reviews submitted to Apple that year was classified as fraudulent. The Games category — the most competitive — accounted for 41% of apps with fake reviews. The damage goes beyond misleading users. As one analysis noted, "fake reviews don't just mess with your rating — they throw off your entire feedback loop, with teams sometimes delaying critical launches because they were trying to fix complaints that came from bots." When your quality signal is corrupted by bots, the marketplace can't function. Legitimate creators can't compete with bought ratings. Users can't trust what they see. The marketplace decays.

The review bombing pattern

Review bombing — coordinated waves of negative reviews triggered by controversy rather than product quality — hits every rating-dependent marketplace. Genshin Impact got bombed after a gameplay update. Political controversies regularly tank restaurant ratings on Google. The problem isn't that negative feedback exists — it's that rating systems can't distinguish "this product is bad" from "I'm angry about an unrelated decision."

npm's invisible quality crisis

npm doesn't have star ratings, which is its own kind of problem. Download counts are the de facto quality signal, but download counts measure distribution, not quality. A package can have millions of weekly downloads because it's a transitive dependency — pulled in by other packages — not because anyone chose it. Meanwhile, excellent packages with small audiences languish in obscurity because the discovery algorithm rewards momentum. The npm ecosystem has quality problems that downloads completely obscure: abandoned packages with known vulnerabilities, packages that ship 10x more code than necessary, packages with zero tests. None of this shows up in the download count.

Figma Community's review gap

Figma's plugin and widget marketplace takes a different approach — manual review. According to Figma's review guidelines, they check that plugins "are completed, function as intended, and do not include temporary content." Plugins that "stop working or offer a low quality experience may be removed." The manual review model catches obvious failures, but it doesn't scale to nuanced quality assessment. As one Figma user put it: "Half of the plugins are fantastic, but the other half seem like they were thrown together hastily." Binary pass/fail review catches the truly broken. It doesn't distinguish between "works" and "works well."

What we wanted instead

When we designed RuleSell's quality system, we started with three principles:

Measure, don't ask. Behavioral and technical signals are harder to game than self-reported ratings.

Automate, don't manually review. Manual review doesn't scale, creates bottlenecks, and introduces subjective bias.

Multiple signals, not one number. A single metric (stars, downloads, retention) can be gamed. Six signals working together are much harder to manipulate.

The six signals

Every listing on RuleSell gets a Quality Score between 0 and 100, composed of six weighted signals. Here's what they are, how we measure them, and why each one matters.

1. Trigger reliability (weight: 20%)

What it measures: Does the skill/agent/plugin actually activate when a user needs it? How we test it: We maintain a curated corpus of real user phrases — things developers actually type when they need a capability. We test each listing's description and trigger configuration against this corpus and measure the match rate. Why it matters: The best skill in the world is worthless if it doesn't trigger. Claude Code's skill discovery relies on the description field — Claude's language model matches user intent against skill descriptions. A vague description like "helps with database stuff" won't match "I need to write a migration for Postgres." We test this empirically. What fails: Descriptions shorter than 80 characters. Descriptions that use abstract language instead of concrete user scenarios. Skills that trigger on everything (too broad) or nothing (too narrow).

2. Token efficiency (weight: 15%)

What it measures: How many tokens does the asset cost to load and use? How we calculate it: We measure the total token footprint: SKILL.md body + average reference file loads + per-turn overhead. A skill that loads a 3,000-line reference file on every activation costs more than one that uses progressive disclosure to load only what's needed. Why it matters: The context window is a shared resource. Your skill competes with the conversation history, other skills, the system prompt, and the user's actual request. A great skill that costs 5,000 tokens per activation is objectively worse than a good skill that costs 500 — because it leaves less room for everything else. Research confirms this matters: skill overhead runs approximately 1,500+ tokens per turn compared to ~100 tokens for a normal tool call. Skills that bloat beyond necessity degrade the entire session. What fails: Inlined reference material that should be in separate files. Verbose explanations of things Claude already knows. Skills over 500 lines without progressive disclosure.

3. Schema cleanliness (weight: 15%)

What it measures: Does the YAML frontmatter, manifest, or MCP capability definition actually validate? How we test it: We validate against the published specifications. For Claude Code skills, that means: name is max 64 characters, lowercase letters/numbers/hyphens only, no reserved words. description is max 1024 characters, non-empty, no XML tags. For MCP servers, we validate the tool declarations against the actual tool implementations. Why it matters: An invalid schema means the asset either won't install, won't be discovered, or will behave unpredictably. Surprisingly common: MCP servers that declare tools in their manifest that don't match their actual implementation. Skills with name fields that violate the spec. What fails: Missing required frontmatter fields. Names with spaces or uppercase characters. MCP servers with mismatched tool declarations. Plugin directories missing plugin.json.

4. Install success rate (weight: 20%)

What it measures: Does the asset work on first install on a clean environment? How we test it: We run automated install tests on fresh environments — no pre-existing configuration, no assumed dependencies. If the listing says it works with Claude Code, we install it in Claude Code. If it says it works with Cursor, we test in Cursor. Why it matters: The number one reason users abandon marketplace assets is failed first-run experience. A developer installs something, it doesn't work, they uninstall it and never come back. This is the retention signal VS Code should be measuring but isn't. What fails: Hardcoded paths (/Users/me/code/myplugin/). Missing dependency declarations. Platform-specific code without platform checks. Scripts that assume python is in PATH.

5. Freshness (weight: 15%)

What it measures: When was the asset last meaningfully updated? How we track it: We monitor the source repository (if linked) for commits, and we track version bumps on the listing itself. We distinguish between meaningful updates (functionality changes, compatibility fixes) and cosmetic ones (README tweaks). Why it matters: AI tools move fast. Claude Code's plugin API evolves. MCP's spec evolves. Cursor's rules format evolves. An asset that was perfect six months ago may be broken today. Freshness isn't a guarantee of quality, but staleness is a strong signal of decay. What fails: Any listing with no updates in 90+ days that claims "latest" compatibility. Version-locked dependencies on fast-moving tools. Assets that were built for one version and never tested against subsequent releases.

6. Security scan (weight: 15%)

What it measures: Does the asset touch credentials, make unexpected network calls, or fail static analysis? How we scan: We run static analysis for common vulnerability patterns: command injection (the #1 MCP vulnerability category at 43% of CVEs), path traversal, SSRF, credential access, and unrestricted CORS. For MCP servers, we check authentication requirements and validate that tool scoping matches declared capabilities. Why it matters: When 38.7% of MCP servers require no authentication and 82% of implementations are vulnerable to path traversal, security can't be optional. A marketplace that distributes insecure assets is distributing risk. What fails: MCP servers that read the home directory on startup. Hooks without exit timeouts. Scripts with shell injection vulnerabilities. Assets that require tools: "*" (unrestricted tool access).

How the score combines

The six signals are weighted and combined into a single 0-100 score. The weights reflect what we believe matters most for a user's experience: | Signal | Weight | Rationale | |---|---|---| | Trigger reliability | 20% | If it doesn't activate, nothing else matters | | Install success rate | 20% | If it doesn't install, nothing else matters | | Token efficiency | 15% | Daily cost of using the asset | | Schema cleanliness | 15% | Correctness of the specification | | Freshness | 15% | Likelihood the asset still works | | Security scan | 15% | Risk of installing the asset | The score is recalculated periodically, not just at submission time. Freshness decays. New security patterns get added to the scanner. Install success rates can change as target tools update. A listing's Quality Score is a living number.

What we explicitly don't use

Star ratings. Users don't rate, friends inflate, bots corrupt.

Download counts. Distribution measures marketing, not quality. Transitive installs inflate counts.

"Trending" algorithms. Recency bias penalizes stable, well-maintained assets.

Manual editorial curation. Doesn't scale. Introduces bias. Creates bottlenecks.

Users can still leave reviews on RuleSell — we value qualitative feedback. But reviews don't influence the Quality Score or sort order. The score is purely measured.

The hard part: calibration

The honest truth is that v1 of our scoring pipeline uses fixture data — predefined test cases and static analysis rules. It's good enough to separate obviously broken from obviously working, and it catches the anti-patterns we've documented elsewhere. But it's not perfect. v2, coming in our second month, will incorporate live telemetry from real installs. When users install an asset through RuleSell and it works (or doesn't), that data feeds back into install success rate scoring. When a skill triggers in a real session (or fails to), that refines trigger reliability. We're building this in the open because we think the marketplace quality problem matters. Star ratings had their run. It's time for measured quality. See the full scoring model at our Trust & Quality page, or browse listings sorted by Quality Score. If you're building skills or MCP servers, our skill-building guide walks through how to optimize for each of these six signals. For the security angle specifically, read The real state of MCP servers in 2026 to understand why security scanning matters. And check out Claude Code skills and MCP servers that already score well.