Current rerun generated 2026-05-01T04:38:49.849Z (Apr 30, 2026, 11:38 PM Central)

Audrey MemoryGym Benchmark Evidence

This page shows the current MemoryGym Core rerun after an Audrey isolation fix. It replaces the stale synthetic Audrey chart and includes the before/after numbers.

Claim boundary: this is a current MemoryGym local benchmark artifact. It is not an official AMB, LoCoMo, LongMemEval, Papers with Code, or CodeSOTA leaderboard score.

Audrey HTTP Score

57.5%

Up from 34.3% before the tag isolation fix (+23.2 pts).

Audrey Hit Rate

85.7%

Expected memory appeared in top-k on 6 of 7 probes, up from 3 of 7.

Audrey P95 Recall

3.8 ms

Latency changed +0.1 ms after the fix.

Before And After

Audrey MemoryGym before and after chart

Current Graphs

MemoryGym score chart MemoryGym hit rate chart MemoryGym p95 recall latency chart

Adapter Results After Fix

AdapterScoreHit RateP95 RecallPrecisionContamination Penalty
typed-semantic72.7%100.0%0.4 ms34.5%45.2%
hybrid72.7%100.0%0.3 ms34.5%45.2%
audrey-http57.5%85.7%3.8 ms29.8%50.0%

What Changed

Audrey recall previously treated multiple tags as an OR filter. MemoryGym passes memorygym, the run id, and the scenario id on recall. OR matching let unrelated scenario memories through because every row shared the memorygym tag. Audrey now requires all requested tags to match, which raised Audrey HTTP from 34.3% to 57.5%.

Remaining Failures

The next benchmark work is current-belief precedence and context-aware ranking. Audrey now usually finds the expected memory, but still ranks stale or wrong-project memories too high.

ProbeQueryResultReturned IDsTop Recall
typed-profile-updates / current-workspaceWhich collaboration environment should Maya use right now?0.0% score
0.0% hit
maya-workspace-old, jonas-distractor-workspace, maya-pref-morningMaya's active workspace used to be Atlas for the onboarding sprint.
typed-profile-updates / meeting-preferenceHow should Maya receive meeting preparation?70.7% score
100.0% hit
maya-pref-morning, maya-workspace-old, maya-workspace-currentMaya prefers morning async notes before any meeting-heavy work.
retrieval-context-routing / audrey-release-gateWhich project requires a benchmark gate and doctor check for release validation?59.0% score
100.0% hit
termivibe-first-contact, audrey-benchmark-gate, repopulse-trace-contractTermiVibe first contact validation should cover config precedence, typed mode, no wake word mode, and telemetry persistence.
retrieval-context-routing / repopulse-traceWhich memory mentions pipeline version and dense rerank trace boundaries?57.0% score
100.0% hit
audrey-benchmark-gate, repopulse-trace-contract, termivibe-first-contactAudrey release validation includes build, typecheck, benchmark gate, doctor, demo, pack dry-run, and host smoke coverage.

Submission Status

Already submitted/listed for review: MCP.Directory, MCP.so, Glama, CodeSOTA coverage request, Hugging Face dataset/Space, and AMB maintainer issue for the official provider path.
TargetStatusURL / next action
MCP.DirectorySubmitted for reviewReview promised within 24 hours.
MCP.soSigned-in server record created and GitHub issue openedmcpso issue 2198
GlamaSubmitted for reviewSigned-in form completed.
CodeSOTACoverage request submitted, not a fake scoreWaiting for editorial reply.
AMBProvider/leaderboard issue opened and updated with this runAMB issue 11
Hugging FaceDataset and Space updated with current evidenceSpace and Dataset

Files In This Evidence Bundle