Post Detail Image
LLM Scorecard
Contents
LLM

LLM Scorecard

The 2025 LLM Scorecard from Datasaur offers a practical framework to evaluate top language models across Privacy, Quality, Cost, and Speed helping you choose the right model based on real-world performance, not hype. Whether you value self-hosting, affordability, or raw intelligence, this guide simplifies your decision with clear, comparative grades.
by
Datasaur
on
June 3, 2025

The leading language models of 2025 should not be judged solely on raw intelligence; instead we should evaluate their utility based on a careful balance of trust, performance, affordability, and speed. At Datasaur, we believe every organization and university deserves a practical framework to evaluate which large‑language model (LLM) is right for them.

This current scorecard compares eight of the most prominent models—ChatGPT‑4o, ChatGPT‑4.1, Claude 3.7 Sonnet, Claude Opus 4, Llama 3.3 (70B), Llama 4 Scout, Gemini 2.5 Pro, and Mistral Small 3.1—across four of those essential pillars: Privacy, Quality, Cost, and Speed. We will continue adding models to our scorecard as they are released, keeping this as an updated page you can always revisit for quick reference. 

These models were chosen for their widespread adoption, market relevance, and technical versatility. Whether you prioritize open‑source control, reasoning depth, or real‑time responsiveness, this scorecard is designed to move you beyond the hype and toward the best fit for you and your team.

Why These Models?

Model Why It Matters
ChatGPT‑4o OpenAI’s multimodal flagship with ~800M weekly active users sets the consumer standard for conversational AI.
ChatGPT‑4.1 The newly‑released GPT‑4 upgrade offers a 1M‑token context window, faster inference, and ~25 % lower token costs.
Claude 3.7 Sonnet Anthropic’s model excels at long‑form reasoning and safe completion. Its “extended thinking” mode has gained traction in coding, education, and research.
Claude Opus 4 Known for its thoughtful, structured responses and high safety alignment, it’s quickly become a favorite for research, legal, and enterprise use cases.
Llama 3.3 (70B) Meta’s open‑source workhorse balances strong accuracy with full self‑hosting freedom—over 650 M downloads attest to its popularity.
Llama 4 Scout Meta’s new mixture‑of‑experts (MoE) architecture delivers GPT‑4‑class quality while remaining open‑weight and remarkably compute‑efficient.
Gemini 2.5 Pro Google DeepMind’s multimodal powerhouse boasts a 2M‑token window and deep integration across Google products.
Mistral Small 3.1 A lightweight Apache‑2.0 model that punches above its size on reasoning benchmarks and achieves industry‑leading throughput.

Grading Criteria

  • Quality – Accuracy, reasoning depth, benchmark performance.
  • Cost – Per‑token pricing, scalability, and open‑source availability.
  • Speed – Typical response latency and throughput under real‑world loads.
  • Privacy – Self‑hosting ability, fine‑tuning freedom, and data‑control guarantees.

Quality Scoring Methodology

To assess overall model quality, we combined two complementary perspectives: objective benchmark performance (MMLU) and subjective human preference (Chatbot Arena Elo scores). We weighted these at 80% MMLU and 20% Arena, prioritizing academic reasoning while still accounting for real-world user experience. This approach rewards models that demonstrate strong general knowledge and logical consistency, while acknowledging the practical value of conversational fluency and helpfulness.

Going by standard academic grades, if a LLM scores a 87 they receive a B+; if a LLM scores a 72 they will receive a C-, etc…

Cost Scoring Rubric (SaaS Pricing Only)

Grade Price Range per 1 Million Tokens
A+ ≤ $0.30
A $0.31 – $0.50
A− $0.51 – $1.00
B+ $1.01 – $2.00
B $2.01 – $3.00
B− $3.01 – $4.00
C+ $4.01 – $5.00
C $5.01 – $6.00
C− > $6.00

Note: All cost grades reflect the pricing of standard SaaS offerings, not self-hosted or open-weight deployments. This ensures a consistent, fair comparison.

Speed Scoring Rubric

We grade model speed primarily on output tokens per second, as it directly impacts how quickly users receive full responses—especially for long-form content. High output speed ensures smooth user experience and reflects a model’s real-world efficiency.

Grade TPS Range
A+ ≥ 140
A 130 – 139
A− 120 – 129
B+ 110 – 119
B 100 – 109
B− 90 – 99
C+ 75 – 89
C 60 – 74
C− < 60

Privacy Scoring Rubric

Grade Description
A Fully self-hostable; no third-party data exposure. Total control over inference, storage, and data retention.
B Cloud-hosted with strong guarantees (e.g., no data retention, opt-out controls, SOC 2 or ISO 27001 compliance, encryption at rest/in transit).
C Cloud-only and lacks clear data controls or guarantees; shared infrastructure with potential retention risk.

The 2025 LLM Scorecard

Model Privacy Quality Cost Speed
ChatGPT‑4o B B+ C+ B+
ChatGPT‑4.1 B B+ B− A
Claude 3.7 Sonnet B C C C+
Claude Opus 4 B B+* C− C−
Llama 3.3 (70B) A C A− B−
Llama 4 Scout A D+ A+ A−
Gemini 2.5 Pro B A+ B− A+
Mistral Small 3.1 A D+ A+ A−

Model Summaries

ChatGPT‑4o (OpenAI)

  • Privacy: B
    Cloud-only, but OpenAI does not use API data for training, encrypts all data, and supports SOC 2 compliance and opt-out settings for data usage.
  • Quality: A− 
    Strong scores, slightly behind GPT-4.1.
  • Cost: C 
    $4.40 per 1M tokens
  • Speed: A 
    Achieves 112.5 tokens/sec

ChatGPT‑4.1 (OpenAI)

  • Privacy: B
    Same as 4o; OpenAI provides encryption, retention limits, and user-level privacy controls but is not self-hostable.
  • Quality: A−
    High MMLU benchmark and user-rated fluency.
  • Cost: C
    $3.50 per 1M tokens.
  • Speed: A
    131 tokens per second

Claude 3.7 Sonnet (Anthropic)

  • Privacy: B
    Cloud-only but backed by SOC 2 certification, 30-day default retention, no training on inputs, and strong encryption protocols.
  • Quality: B
    Excellent at long-form and nuanced completions.
  • Cost: C
    Premium API cost, priced around $6 per 1M tokens.
  • Speed: B
    78 tokens per second

Llama 3.3 (70B) (Meta)

  • Privacy: A 
    Fully open-weight; self-hostable.
  • Quality: C
    Lower user preference and academic scores.
  • Cost: A
    Output token price: $0.70 per 1M Tokens.
  • Speed: B
    Around 96 tokens per second.

Llama 4 Scout (Meta)

  • Privacy: A
    Self-hostable mixture-of-experts model.
  • Quality: B
    Okay MMLU/Feedback scores but below other major models
  • Cost: A
    $0.27 per 1M tokens.
  • Speed: A
    121 tokens per second.

Gemini 2.5 Pro (Google DeepMind)

  • Privacy: B
    Cloud-hosted via Google Cloud with opt-out data caching, no training on prompts by default, and enterprise-grade encryption and compliance.
  • Quality: A
    Frontier model on MMLU and user feedback.
  • Cost: B
    $3.44 per 1M tokens. 
  • Speed: A
    148 tokens per second.

Mistral Small 3.1 (Mistral)

  • Privacy: A
    Apache 2.0 licensed; fully deployable.
  • Quality: C
    Impressive for its size, but trails larger models in both MMLU and feedback.
  • Cost: A
    $0.15 per 1M tokens
  • Speed: A
    ~125 tokens per second.

Claude Opus 4 (Anthropic)

  • Privacy: B
    Cloud-hosted by Anthropic with no training on user inputs by default, 30-day data deletion policy, encryption in transit and at rest, and optional enterprise zero-retention agreements.
  • Quality: B+*
    Strong MMLU performance (87.4%), but not yet ranked on Chatbot Arena, so the blended score is incomplete.
  • Cost: C−
    ~$30 per 1M tokens 
  • Speed: B−
    ~54 tokens per second

Interpreting This Scorecard

If you prioritise… Consider these models
Privacy & self‑hosting Llama 3.3, Llama 4, Mistral 3.1
Best raw quality ChatGPT‑4.1, ChatGPT‑4o, Gemini 2.5 Pro, Claude Opus 4
Lowest cost to scale Llama 3.3, Llama 4, Mistral 3.1
Fastest real‑time output Gemini 2.5, Mistral 3.1, ChatGPT‑4.1

You: The Nuance Beyond the Scorecard

This scorecard is a starting point, not a verdict. The best model depends on your goals—whether it's real-time speed, deployment control, or advanced reasoning. Open-weight models like Llama 4 offer efficiency and data ownership, while hosted giants like Gemini 2.5 Pro unlock massive 2M-token contexts for complex workloads. Use these scores to narrow the field—but let your infrastructure and use case make the final call.

Datasaur’s LLM Labs helps you move from paper grades to hands‑on proof. You can test and compare over 250 models in LLM Labs. Use Sandbox to run head‑to‑head prompts, tweak hyper‑parameters, and measure real‑time cost, latency, and accuracy—all against your own data. The best LLM is the one that excels for the most important stakeholder: you.

No items found.