Objective
After locking the F1 cheap screen (basin-finder-02-f1-cheap-screen) and the F2 medium refiner (basin-finder-03-f2-medium-refiner), the cascade still has a free design choice: how to select the top-N candidates that get promoted to expensive local optimization. Test whether any of (staged narrowing, diversity constraints, exploration quotas, hybrids) improves on the trivial baseline of “rank by F2 and take top N”.
Setup
- 5 policy families: P0 (baseline), P1 (staged narrowing), P2 (diversity-constrained), P3 (diversity + exploration), P4 (staged + diversity).
- 28 configurations across width grid M1 in {500, 1000}, M2 in {50, 100}, N in {25, 50}.
- All comparisons are offline: frozen F1-0b scores, locked F2-1 scores on top-1000, strong benchmark scores on top-300 (extended from Exp 03’s top-100 during Phase 1A).
- Diversity uses frozen 10-cluster labels from Exp 01; exploration uses fixed seeds with a 2-slot quota (N=25) or 4-slot quota (N=50) drawn from F2 ranks N+1..3N.
- Acceptance rule: top-N recall vs strong benchmark must beat P0.
Procedure
- Score each policy configuration on the frozen score tables and compute top-N recall and best benchmark NLL.
- Auditor reimplements all 28 configurations independently in Python; runs 5 adversarial attacks (e.g., shuffling cluster labels, changing exploration seeds).
Results
- No policy beats P0. P0 recall = 0.800 at N=25 and 1.000 at N=50. Best non-P0 (P1, P4 at M1=500) ties P0 at 0.800. Worst (P2 with k=15 clusters) is 0.560.
- F2-benchmark Spearman = 0.986 — F2 already approximates the benchmark ranking, so reranking adds nothing.
- Diversity constraints hurt monotonically (k=8 → 0.76, k=10 → 0.68, k=15 → 0.56). The benchmark top-25 concentrates in 3 of 10 clusters; forcing per-cluster representation displaces real elites.
- Staged narrowing P1 is algebraically redundant: when N << M2 it equals P0 exactly.
- N=50 is a generous budget — all policies hit 100% top-25 recall there.
- Audit: fragility audit (2026-03-28) confirms 8 checks and 5 adversarial attacks pass; Tier A 55/55, Tier B 9/9 — null result is structurally robust.
Analysis
The null result is not a failure to design a good policy — it is a finding that, given F2-1 already tracks the benchmark at Spearman 0.986, the elite ranking is set by the F2 stage and the promotion step is essentially a no-op. The binding constraint is F2 quality, not policy choice. The diversity result (concentrated elite cluster structure) is itself useful: it warns that any future policy assuming balanced cluster distribution will hurt on this archive.
Claim updates
- forward-solution-unique-fundamental-ree-with-nbc: tested_by, strength weak. The null result corroborates that the cascade’s elite set is set by the forward-solution likelihood ranking, not by selection policy — relevant only as an architecture observation.
Follow-up
- Locks P0 (top-N by F2-1 score) as the promotion policy entering basin-finder-05-global-driver.
- Open: would diversity help if the archive had more uniformly-distributed basins? Not testable on this archive.
- See also: basin-finder-complete-program-summary.