Objective

Lock the medium-tier refiner that sits between the F1 cheap screen (basin-finder-02-f1-cheap-screen) and the strong 6D benchmark. The refiner must close the F1benchmark ranking gap (F1-0b is near-random on F1-promoted candidates: Spearman 0.05) at a runtime that lets the cascade actually use it.

Setup

  • Model / parameterization: 54-D constrained MSRE in phi-space.
  • Evaluator: 5 candidate F2 screens (varying mc_R in {100, 300}, sparse vs full time-points, pure MC vs F1/MC hybrid F2-5).
  • Comparator: F2-0 (no F2 — direct F1benchmark promotion) and the strong benchmark hamilton_loglik_cap_mc at mc_R=1000, mc_H_burn=150.
  • Acceptance rule: runtime hard cap 3 s/point (“not-F2” threshold 5 s), Spearman vs benchmark, top-K recall at promotion budget 25.

Procedure

  • Evaluate all candidates on the top-100 of the F1-conditioned 1000-point pool.
  • 3-seed robustness sweep: pairwise Spearman 0.994-0.995, top-25 overlap 92-96%.
  • Pool-depth audit extension: re-score on top-200 to check that the result is not a top-100 artifact.

Results

  • F2-1 (mc_R=100, full-sample): Spearman 0.996, top-10 recall 1.00, top-25 recall 0.96, median 2.24 s/point — meets the 3 s cap; F2-2 (mc_R=300) is at 0.999 Spearman but 6.72 s/point and fails the cap.
  • F2 is essential, not optional: the no-F2 baseline F2-0 captures only 40% of benchmark top-10.
  • mc_R=100 vs mc_R=300: ranking difference 0.996 vs 0.999 — extra MC paths are wasted compute at the F2 tier.
  • Sparse-time variants offer no speed advantage (MC rebuild dominates runtime).
  • Signal-blending hybrid F2-5 degrades performance (mixing near-zero F1 signal with strong F2 dilutes both).
  • Audit: fragility audit reproduces all primary metrics; alt-methods script extension confirms top-200 robustness (Spearman 0.981, perfect recall). Both original caveats (seed sensitivity, narrow pool) closed by extended robustness checks. Auditor caught a invperm(sortperm(sortperm(x))) Spearman implementation bug; recall metrics were unaffected.

Analysis

The locked F2-1 closes the F1benchmark ranking gap at ~10x the F1 cost but still ~10x cheaper than the benchmark. Strong-eval savings of 89.7% means the cascade can actually afford to use it. The runtime tail (max 5.32 s/point on outliers) is the only operational caveat — flagged for Experiment 04’s promotion policy.

Claim updates

  • forward-solution-unique-fundamental-ree-with-nbc: tested_by, strength moderate. F2-1 ranks parameter vectors via the full 6-D forward-solution likelihood with MC cap rates and matches the strong benchmark at top-K, validating that the forward-solution evaluator can be used as a refiner stage in a cascade.
  • asset-pricing-no-bubble-condition-regime: tested_by, strength moderate. The refiner uses MC cap-rate pricing inside the no-bubble feasibility region; high Spearman is empirical evidence that the NBC-feasible interior is well-ordered by the production likelihood.

Follow-up