Audrey MemoryGym Current Benchmark Evidence

Audrey HTTP Score

57.5%

Up from 34.3% before the tag isolation fix (+23.2 pts).

Audrey Hit Rate

85.7%

Expected memory appeared in top-k on 6 of 7 probes, up from 3 of 7.

Audrey P95 Recall

3.8 ms

Latency changed +0.1 ms after the fix.

Before And After

Current Graphs

Adapter Results After Fix

Adapter	Score	Hit Rate	P95 Recall	Precision	Contamination Penalty
typed-semantic	72.7%	100.0%	0.4 ms	34.5%	45.2%
hybrid	72.7%	100.0%	0.3 ms	34.5%	45.2%
audrey-http	57.5%	85.7%	3.8 ms	29.8%	50.0%

What Changed

Audrey recall previously treated multiple tags as an OR filter. MemoryGym passes memorygym, the run id, and the scenario id on recall. OR matching let unrelated scenario memories through because every row shared the memorygym tag. Audrey now requires all requested tags to match, which raised Audrey HTTP from 34.3% to 57.5%.

Remaining Failures

The next benchmark work is current-belief precedence and context-aware ranking. Audrey now usually finds the expected memory, but still ranks stale or wrong-project memories too high.

Probe	Query	Result	Returned IDs	Top Recall
typed-profile-updates / current-workspace	Which collaboration environment should Maya use right now?	0.0% score 0.0% hit	maya-workspace-old, jonas-distractor-workspace, maya-pref-morning	Maya's active workspace used to be Atlas for the onboarding sprint.
typed-profile-updates / meeting-preference	How should Maya receive meeting preparation?	70.7% score 100.0% hit	maya-pref-morning, maya-workspace-old, maya-workspace-current	Maya prefers morning async notes before any meeting-heavy work.
retrieval-context-routing / audrey-release-gate	Which project requires a benchmark gate and doctor check for release validation?	59.0% score 100.0% hit	termivibe-first-contact, audrey-benchmark-gate, repopulse-trace-contract	TermiVibe first contact validation should cover config precedence, typed mode, no wake word mode, and telemetry persistence.
retrieval-context-routing / repopulse-trace	Which memory mentions pipeline version and dense rerank trace boundaries?	57.0% score 100.0% hit	audrey-benchmark-gate, repopulse-trace-contract, termivibe-first-contact	Audrey release validation includes build, typecheck, benchmark gate, doctor, demo, pack dry-run, and host smoke coverage.

Submission Status

Already submitted/listed for review: MCP.Directory, MCP.so, Glama, CodeSOTA coverage request, Hugging Face dataset/Space, and AMB maintainer issue for the official provider path.

Target	Status	URL / next action
MCP.Directory	Submitted for review	Review promised within 24 hours.
MCP.so	Signed-in server record created and GitHub issue opened	mcpso issue 2198
Glama	Submitted for review	Signed-in form completed.
CodeSOTA	Coverage request submitted, not a fake score	Waiting for editorial reply.
AMB	Provider/leaderboard issue opened and updated with this run	AMB issue 11
Hugging Face	Dataset and Space updated with current evidence	Space and Dataset

Files In This Evidence Bundle

memorygym-before-tag-filter.json
memorygym-after-tag-filter.json
memorygym-current-run.json
memorygym-current-summary.json
memorygym-before-after.svg
memorygym-score.svg
memorygym-hit-rate.svg
memorygym-p95-recall-latency.svg
audrey-benchmark-report.png