Objective
Build the complete robustness testing infrastructure needed to validate estimation results before publication: RBPF justification, variance stability, MCMC diagnostics, out-of-sample forecasts, model comparison (2 vs 4 regimes), MAP robustness, residual diagnostics, prior sensitivity, economic plausibility, and ESS bottleneck analysis.
Setup
- Model: full 55-parameter CRE asset pricing model with RBPF likelihood
- Evaluator: RBPF at N=3000 particles (~0.35 s/eval, std ~0.64)
- Comparators: bootstrap PF (Task 1), 2-regime restricted model (Task 5)
- Acceptance: all self-tests pass; infrastructure ready for full-run execution
Procedure
- Task 1: RBPF vs bootstrap PF head-to-head (efficiency comparison)
- Task 2: Variance stability across seeds and particle counts
- Tasks 3, 6, 11: MCMC infrastructure (Gelman-Rubin, ESS, trace plots, posterior summaries)
- Tasks 4, 5, 8, 10: Model fit (out-of-sample forecasts, 2-regime LRT/BIC, residual diagnostics, economic plausibility)
- Tasks 7, 9: Estimate stability (MAP robustness across N/seeds/starts, prior sensitivity)
- Task 12: Integration (self-tests pass, regression tests 37/37)
Results
- RBPF is 100x-300,000x more efficient than bootstrap PF (bootstrap ESS collapses to 2-3 regardless of N)
- All 4 self-test blocks pass (residual diagnostics, ESS analysis, two-regime round-trip, posterior diagnostics)
- Phase E.1 regression tests: 37/37 pass
- 17 economic plausibility checks across 5 categories implemented
- Infrastructure ready; heavy-compute experiments (Tasks 4, 5, 7, 9) await MAP completion
- No separate audit session
Analysis
This is an infrastructure experiment, not a numerical result. The key finding is that the RBPF’s advantage over the bootstrap PF is structural: particles only need to track regimes (a discrete 4-state variable) rather than the full 6D continuous state, so Kalman filtering handles the high-precision macro observations exactly. The robustness suite is designed to answer the five reviewer questions: why this filter, are estimates stable, does the model fit, how prior-sensitive, and is it economically sensible.
Claim updates
No specific claim updated (infrastructure experiment).
Follow-up
- Run full-mode experiments once MAP is finalized (FULL_RUN=true, ~12+ hours)
- 2-regime restricted model LRT/BIC comparison is the key model-selection deliverable
- Prior sensitivity analysis will identify which parameters are prior-driven vs data-driven