Skip to content

How the numbers work

The About page covers why this exists. This one is the how: where each figure comes from, how the history is rebuilt, and - just as important - what the suite doesn't tell you.

How a score is worked out

Every test runs against live AWS DynamoDB first. Whatever real DynamoDB does is recorded as the expected answer, and an emulator passes a test only if it gives that same answer. That's why DynamoDB sits at the top of every table at a flat 100%: it isn't scored against anything, it is the thing everything else is scored against.

The tests only ever look at observable behaviour. They drive the standard AWS SDK against the target's HTTP endpoint and assert on the response: its shape, the exact error, the order validation fires in. Nothing reaches inside the implementation. If your application would see it through the SDK, the suite checks it; if it wouldn't, the suite doesn't care about it.

Results are split into three tiers - Core, Complete, and Strict - so a single percentage can't hide a fatal gap behind a pile of passing edge cases. Each tier gets its own score, and the total rolls them together.

The percentage is correctness over the operations a target implements: passed divided by passed plus failed. A skipped test is the target's feature-probe declining to run, because it doesn't implement that operation at all - so a skip is honest scope documentation, not a wrong answer, and it doesn't count against the score. A fail is a different thing: the operation is there, but it behaves differently from real DynamoDB, and that does count. Skips and fails are kept apart because they mean opposite things.

That leaves one gap to close: correctness alone says nothing about how much a target attempts. An emulator that implements a sliver and gets it right would score 100%. So every score on the site travels with a coverage figure - the operations implemented out of the total - and a narrow surface reads as narrow no matter how high its correctness.

One consequence worth spelling out: the suite grows. It had 526 tests in March 2026 and over 600 by May. Raw counts from different runs aren't comparable, so every chart and every movement figure on this site is a percentage, never a count.

How runs and movement are reconstructed

The suite publishes each run's results as JSON in its repository, and it has done since the first run. That means the full history is sitting in the git log, and this site rebuilds the timeline from it: it reads every version of those result files, scores each one with the suite's own logic, and assembles the runs you browse here.

A "run" is defined by the timestamp stamped into each result file, grouped by date - not by commit. That distinction matters more than it sounds. A single commit often refreshes only some targets, and one commit can carry results that were actually produced in different runs, so grouping by commit would invent runs that never happened and stitch unrelated results together. Grouping by date is robust to both, even when one run's targets finish over an hour apart.

When a target isn't re-tested in a run, its last measured result is carried forward and labelled as such, rather than dropped or silently restated as fresh. Movement compares a target against the previous run it was actually tested in, so the arrow always means "since last measured", never "since some run where nothing changed".

The scoring logic here is a direct port of the suite's own, pinned by a test that fails the build if the two ever disagree. That's deliberate. The single rule behind this whole site is that no figure is ever typed in by hand, because the moment the same number lives in two places it starts to drift, and a number that has quietly drifted is worse than no number at all.

Limitations

A score here is a useful signal, not a certificate. Worth keeping in mind:

  • It only tests what it tests. A behaviour with no test is a blind spot, not a pass. Coverage is good and growing, but "100% Tier 1" means "100% of the Tier 1 tests that exist", not "every Core behaviour DynamoDB has".
  • Scores are tier-level, not per-operation. The suite records how many tests passed in each tier, not a stable per-operation pass/fail map. You can see that an emulator is weak on Tier 2; you can't read off exactly which operations from this site alone. The suite's repo has the detail.
  • Every result is a point in time. A score is tied to the specific version of the target tested on that date, against DynamoDB's behaviour on that date. Both move. An old run reflects an old version of both sides.
  • Behaviour only, nothing else. The suite says nothing about performance, scalability, durability, cost, or operational fit. An emulator can match DynamoDB's behaviour perfectly and still be the wrong tool for your job, or the right one despite a lower score.
  • Configuration matters. Targets are tested in a representative setup. A differently configured deployment may behave differently.

Trademarks and attribution

Amazon DynamoDB, DynamoDB, and AWS are trademarks of Amazon.com, Inc. or its affiliates. This is an independent project and is not affiliated with, endorsed by, or sponsored by Amazon, and nothing here grants any right to use those names or marks. DynamoDB Local, Dynalite, LocalStack, Ministack, Floci, ExtendDB, and every other target named on this site are the trademarks or property of their respective owners.

The conformance suite is the work of Martin Hicks and its contributors, released under the Apache License 2.0; see the NOTICE for the full attribution. This site reads that suite's published results and presents them; it's built and maintained by Martin Hicks.