Methodology — How we test and score reviews

What we review

We review the tools we actually use in client engagements or our own studio operations. The selection is opinionated, not exhaustive. If we haven't run a tool on real production work for at least a few weeks, it doesn't get a review — even if it's popular.

We don't accept submissions, pitches, or "have you considered reviewing X?" requests from vendors. We will sometimes add a tool because a paying client asked us to evaluate it for their stack.

How we test

For every reviewed tool:

Real workload, real money. We pay for our own account on a publicly available plan and run the tool on real client or studio work. No vendor seats, no extended trials, no special pricing.
Minimum two weeks of usage before we publish, longer for tools where the failure modes only show up at scale (rate limits, deliverability decay, agent loops).
Cross-comparison on the same job. Where a tool has obvious competitors, we run the same task on at least one alternative and compare the result, the cost, and the friction.
Read the docs cover-to-cover. Pricing footnotes, rate limits, deprecation policy, the contracts page. Most of the gotchas live there.
Talk to people running it in production. When we know operators using the tool at meaningful scale, we ask them what's broken before we write the "what's not" section.

The 10-point scale, anchored

Every numeric rating maps to the same plain-English benchmark. The score isn't a feeling — it's a position on this ladder.

9.5 – 10.0 — Category-defining. Best-in-class on every axis we test, including price-performance. We'd be surprised if a competitor displaced it within 12 months. (Rare. Two tools currently sit here.)
9.0 – 9.4 — Default pick. We reach for it first for the canonical use case in its category, and we'd recommend it without caveats to a client.
8.5 – 8.9 — Strong pick with one trade-off. The right answer for most teams in the category, but with one specific situation where we'd recommend a competitor instead. We name the situation in the "Watch out" section.
8.0 – 8.4 — Solid, with caveats. Works well, but at least one rough edge means it's not the default for everyone. Often the right answer for a specific niche or budget.
7.5 – 7.9 — Conditional. Ships value when the conditions are right (specific use case, specific scale, specific stack alignment), and is a clear miss outside those.
7.0 – 7.4 — Niche or transitional. Real value for a small audience, or a tool we expect to either improve or be displaced soon.
Below 7.0 — We didn't review it. If we tested a tool and it scored below this, we generally don't publish — partly because the AI tool market moves fast enough that the team is probably already fixing whatever broke it for us, and partly because there's no signal in writing "this is bad" without a constructive replacement.

What moves a score

Five inputs, weighted roughly equally:

Production reliability. Does it do what you asked, twice in a row, under real load? Schema adherence, instruction-following, retry rates, error handling.
Price-performance for the canonical use case. Per-call, per-token, per-seat — whatever the tool charges on. We weigh it against the value delivered, not against the cheapest competitor in the category.
Developer experience. API ergonomics, SDK quality, docs accuracy, time-to-first-call, rate-limit behaviour, observability.
Honesty of the product surface. Does the marketing match the product? Are the limits and gotchas in the docs or hidden in support tickets? We discount tools where shipping the obvious thing requires fighting the abstraction.
Stability over time. Deprecation cadence, breaking-change policy, contract terms. A clean v2 migration story beats a flashier feature.

What does not affect a score

Whether the vendor sponsors anything we do. They don't — see our disclosure page for the complete vendor-relationship list.
Whether the vendor pushes back on a review. If they're factually right we correct it; we don't change opinions.
How much we like the founders. Several people we know personally run tools on this site that do not score 10/10.
Brand recognition or category leader status. Being the famous one in a category doesn't earn a higher score; doing the job better than alternatives does.
How recently the vendor shipped a feature. A new release that doesn't change real usage doesn't move the rating.

How often we re-score

Every review carries an UPDATED · YYYY-MM-DD stamp. We re-test and re-publish when:

The vendor ships a meaningful new version (usually a major model release or pricing change).
We notice the rating no longer reflects how we actually use the tool.
A reader emails us with a specific factual correction.

We don't run reviews on a calendar. A tool that hasn't materially changed in 9 months keeps its rating; a tool that ships a new model tier we re-evaluate within a few weeks.

How to flag a mistake

Email hello@pintoedai.com with the page URL and the specific claim you think is wrong. We respond within one business day. Factual errors get corrected fast; we'll discuss judgement calls on the merits but we don't edit opinions under pressure.

Where the data lives

Every review's structured rating, category, pricing summary, best-for and watch-out lines live in a single source-of-truth file at data/reviews-index.js. That's what powers the reviews index, the all-reviews table, the comparison builder, and the inline review-vs-review widget on each page. If a number is stale, fixing the index updates everywhere.