Skip to content

Category

Datasetsrules, skills & MCP servers

Datasets are the training, evaluation, and retrieval corpora that everything downstream depends on — fine-tuning runs, eval suites, RAG indexes. The quality of your model output is capped by the quality of the data underneath it, and most public datasets ship with no provenance, unclear licensing, and silent contamination. The datasets below are quality-scored on how completely they document their schema, their source, and their license, and each is attributed to the author who curated it so you can trace where the rows came from. License clarity matters most here — training on data you can't legally use is a liability you inherit. Pull a vetted dataset and you skip the collection and cleaning grind, starting from a corpus other engineers have already inspected. Browse the top datasets below ranked by quality score, or open the full catalog to filter by domain, format, and the task each one is built to support.

Top datasets right now

Ranked by quality score — freshness, schema completeness, and review signals. Refreshed daily.

No datasetslistings yet during beta — but they're landing fast. Be the first to publish in Datasets:

Browse other categories