We operate two competitive intelligence systems built on a semantic layer architecture. The SBPI (Structural Brand Power Index) tracks 17 micro-drama companies across 5 scoring dimensions, producing weekly composite scores stored as RDF triples. A prediction engine forecasts directional movement for each company each week.
Experiment 2 applied Markovick et al.'s autoresearch methodology to optimize the knowledge graph to prediction interface, recovering 69.9% directional accuracy from a 23.5% baseline. This report identifies 5 additional experiments that compound on that result.
Each experiment's optimized output feeds the next. The chain progresses from safety infrastructure through feature expansion to cross-domain generalization.
The interface between the knowledge representation layer and the prediction/reasoning layer contains tunable parameters that have disproportionate impact on output quality. These parameters are almost always set by intuition, never systematically optimized. Karpathy's autoresearch framing treats this systematic optimization as automated research that compounds over time.
Phase 1: ETL Load sbpi_to_rdf.py --all --validate Phase 2: Accuracy Check prediction_engine.py --report Phase 3: Prediction Gen prediction_engine.py --generate Phase 4: Attestation attestation_upgrade.py --upgrade Phase 5: Insights nightly-insights.py --schedule all Phase 6: Optimization kg_interface_optimizer.py --nightly ← experiments modify this
| Parameter | Optimized Value | Default | Change |
|---|---|---|---|
direction_threshold | 1.295 | 0.500 | +159% |
confidence_base | 0.443 | 0.600 | -26% |
mean_reversion_rate | 0.257 | 0.100 | +157% |
anomaly_contributes | true | false | flipped |
divergence_weight | 0.180 | 0.000 | new |
tier_proximity_weight | 0.096 | 0.000 | new |
magnitude_thresh_1 | 3.020 | 3.000 | +1% |
magnitude_thresh_2 | 5.076 | 5.000 | +2% |
consistency_thresh | 1.980 | 2.000 | -1% |
magnitude_bonus_1 | 0.120 | 0.100 | +20% |
magnitude_bonus_2 | 0.136 | 0.100 | +36% |
consistency_bonus | 0.040 | 0.050 | -20% |
Experiments 3-4 expand this to 18 parameters by adding dimension weights and temporal decay.
| Exp 1 | Exp 2 | Exp 3 | Exp 4 | Exp 5 | |
|---|---|---|---|---|---|
| Exp 1 | — | Synergy | Safe | Safe | Required |
| Exp 2 | Synergy | — | Safe | Synergy | Required |
| Exp 3 | Safe | Safe | — | Confound | Partial |
| Exp 4 | Safe | Synergy | Confound | — | Transfer |
| Exp 5 | Required | Required | Partial | Transfer | — |
Schneider, Bischl, Feurer (2025). "Overtuning in Hyperparameter Optimization." arXiv:2506.19540. AutoML-Conf 2025.
Karwowski et al. (2023). "Goodhart's Law in Reinforcement Learning." arXiv:2310.09144. ICLR 2024.
| Paper Parameter | SBPI Parameter | Notes |
|---|---|---|
| HPO budget (# trials) | 30 TPE trials per nightly run | Fewer trials reduces overtuning |
| Validation set size | 51 company-week observations | Below their "small-data" threshold |
| Proxy metric | Directional accuracy | "Predict stable everywhere" is the exploit |
| True objective | Accuracy + Brier + MAE | Joint prediction quality |
| Optimization pressure | 30 trials / 51 obs = 0.59 | Should be < 0.3 for safe HPO |
| Early stopping threshold | Accuracy-Brier divergence | Stop when proxy-true gap widens |
| Metric | Degenerate config detection rate |
| Direction | Protective (prevents regression) |
| Magnitude | ~10% of nightly runs produce overtuned configs. Prevents 2-5pp accuracy drops. |
| Confidence | High — our data regime is exactly the danger zone |
| Min Data | 2 weeks (current data sufficient) |
| Code Change | ~30 lines, modification to Phase 6 |
Barker, Bell, Thomas, Carr, Andrews, Bhatt (2025). "Faster, Cheaper, Better: Multi-Objective HPO for LLM and RAG Systems." arXiv:2502.18635. ICLR 2025 Workshop.
| Paper Parameter | SBPI Parameter | Notes |
|---|---|---|
| Obj 3: Safety/Faithfulness | Brier Score (calibration) | "Is the system's confidence trustworthy?" |
| Obj 4: Alignment/Helpfulness | Directional Accuracy | "Did it produce the right answer?" |
| (No analog) | MAE (magnitude) | Added: our system predicts magnitude |
| 9 hyperparameters | 12 SBPI interface params | Same concept, different count |
| qLogNEHVI sampler | Optuna MOTPESampler | Built-in equivalent |
| Metric | Brier score improvement + accuracy maintained |
| Direction | Brier decreases 10-20%; accuracy stable or +3-8% |
| Magnitude | Pareto configs beat single-objective by 15-30% on secondary metrics |
| Confidence | Med-High — noise-aware variant matches our small-sample regime |
| Min Data | 4 weeks (W13, late March) |
| Code Change | ~40 lines in kg_interface_optimizer.py |
Lu et al. (2025). "Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting." arXiv:2509.11452.
Wakayama, Sugasawa (2024). "Ensemble Prediction via Covariate-dependent Stacking." arXiv:2408.09755.
| Paper Parameter | SBPI Parameter | Notes |
|---|---|---|
| 3 reward objectives — Lu | 5 SBPI dimensions | Fixed scalarization → learned adaptive |
| Hypervolume-guided weights — Lu | TPE over wi ∈ [0.05, 0.40] | Start with TPE for pipeline consistency |
| Covariates — Wakayama | Company tier, composite, divergence | ReelShort vs FlexTV get different weights |
| Fixed stacking — Wakayama | Current 0.25/0.20/0.20/0.20/0.15 | The baseline both papers beat |
| Metric | Directional accuracy via better composite discrimination |
| Direction | +5-15% static weights; +3-8% additional from covariate-dependent |
| Magnitude | Lu: Pareto-dominant in 6.1 fewer steps. Wakayama: 11% over equal-weight. |
| Confidence | Med-High — two papers confirm learned beats intuition |
| Min Data | 6 weeks (W16, mid April) |
| Code Change | ~80 lines, expands search space 12 → 16 params |
Gastinger et al. (2024). "History Repeats Itself: A Baseline for TKG Forecasting." arXiv:2404.16726.
Liao, Jia, Li, Ma, Tresp (2023). "GenTKG: Generative Forecasting on Temporal Knowledge Graphs." arXiv:2310.07793.
| Paper Parameter | SBPI Parameter | Notes |
|---|---|---|
| Exponential decay rate (λ) | temporal_decay_rate [0.1, 0.9] | λ=0.5 means each prior week has half the weight |
| Lookback window (k) | temporal_lookback [2, 8] | Currently hardcoded to 2 |
| Relation-specific weights | Dimension-specific decay rates | Community stickier than Distribution |
| TLogic temporal rules | SPARQL temporal queries | Gap: their rules are learned; ours hand-written |
| Metric | Directional accuracy improvement |
| Direction | +8-15% relative (Gastinger: 24% MRR improvement over non-temporal) |
| Magnitude | Adding decay over 4-8 weeks vs fixed 2-week equal-weight |
| Confidence | Medium — depends on whether dynamics are decay-like or regime-shift |
| Min Data | 8 weeks (W18, late April) Blocked |
| Code Change | ~120 lines + new SPARQL query, expands to 18 params |
Zeng, Maus, Jones, et al. (2025). "BOLT: Large Scale Multi-Task Bayesian Optimization with LLMs." arXiv:2503.08131.
| Paper Parameter | SBPI Parameter | Notes |
|---|---|---|
| Source tasks (~1,500 configs) | Micro-drama vertical (1 task, 30 trials) | Scale gap: 1 vs 1,500. But 30 trials provide a trajectory. |
| Target task | AI Agent vertical (same 12-param space) | Identical parameter names, different domain |
| LLM warm-start | Optuna enqueue_trial() | Seed study with best config + neighbors |
| 21% one-shot improvement | Trials-to-convergence reduction | Expect 40-60% fewer trials needed |
| Metric | Trials-to-convergence (primary), ceiling accuracy (secondary) |
| Direction | 40-60% fewer trials; 5-10% higher ceiling accuracy |
| Magnitude | Zeng et al.: 21% one-shot. Full warm-start stronger. |
| Confidence | Med-Low — interface params likely transfer; dimension weights likely don't |
| Min Data | 4+ weeks AI Agent data; both pipelines operational |
| Code Change | ~200 lines (new script + optimizer mods + SPARQL query) |
| Risk | Trigger | Detection | Mitigation |
|---|---|---|---|
| "Predict stable everywhere" | Low direction_threshold | Stable rate >50% | Exp 1 Guard + Exp 2 Brier |
| Weight collapse | Weights converge to 0/1 | Entropy H < 0.5 | Constrain [0.05, 0.40] |
| Temporal overfitting | High lookback + low decay | Train/holdout divergence | Exp 1 detector on temporal params |
| Transfer poisoning | Overtuned source config | Exp 1 must clear source | Reject if Guard fails |
| Pareto collapse | MOTPE converges to 1 point | Hypervolume < 3 solutions | Increase trials or ranges |
Experiments are sequenced by data requirements and dependency chain. Each experiment must stabilize before the next begins to enable clean attribution of improvements.
| Week | Date (approx) | Data Points | Unlocks |
|---|---|---|---|
| W10-W12 | Now | 51 obs (3 weeks x 17 co.) | Exp 1 |
| W13 | Late March | 68 obs | Exp 2 |
| W14-W15 | Early April | 85-102 obs | — |
| W16 | Mid April | 102 obs | Exp 3 |
| W17 | Late April | 119 obs | — |
| W18 | Late April | 136 obs | Exp 4 |
| W20+ | May | 170+ obs | Exp 5 (with AI Agent data) |
Baseline (Exp 2): 12 parameters — interface tuning only After Exp 3: 16 parameters — + 4 dimension weights (5th constrained) After Exp 4: 18 parameters — + temporal_decay_rate + temporal_lookback Transfer (Exp 5): 18 params transferred, dimension weights tested separately
Cost estimates are based on observed token usage from Experiments 1-2 and this research session. The SBPI optimization pipeline runs locally (Python + Optuna + pyoxigraph) with zero API cost per trial. The primary cost driver is Claude Code development time to implement each experiment.
| Item | Tokens | Tool Calls | Duration |
|---|---|---|---|
| Exp 2: KG Interface Optimizer (implementation) | ~180K | ~45 | ~25 min |
| Exp 2: TPE optimization run (30 trials) | 0 (local Python) | 0 | ~90 sec |
| This session: research (4 parallel agents) | ~362K | ~154 | ~8 min (parallel) |
| Research phase total | ~542K | ~199 | ~33 min |
| Phase | Tokens | % of Total | Notes |
|---|---|---|---|
| Research (this session) | ~542K | 42% | 4 parallel research agents + codebase exploration |
| Implementation (Exps 1-5) | ~780K | 58% | Code writing, testing, validation |
| Nightly pipeline execution | 0 | 0% | All local Python (Optuna + pyoxigraph) |
| Total | ~1.32M | 100% |
Token costs assume Claude Opus for implementation. Using Sonnet for routine code changes would reduce implementation costs by ~60%. The nightly optimization loop is pure Python — no LLM API calls per TPE trial.
The autoresearch pipeline is asymptotically free after implementation. Each nightly run executes 30 TPE trials purely in Python (Optuna sampler + pyoxigraph SPARQL queries + numpy computation). Zero API calls per optimization cycle. The only ongoing cost is the electricity to run the launchd agent at 6:13 AM daily.
This is the core value proposition of the Karpathy autoresearch framing: front-load the research and implementation cost, then let the optimization compound nightly at zero marginal cost.
10 papers verified via web fetch. All arXiv IDs confirmed to resolve to real papers with matching titles and authors.
4 specialized agents ran in parallel for approximately 8 minutes:
| Agent | Focus | Tokens | Tool Calls | Duration |
|---|---|---|---|---|
| sbpi-explorer | Full codebase mapping | 160K | 49 | 7.4 min |
| paper-agent-1 | Dimension weighting + ensemble calibration | 81K | 49 | 7.6 min |
| paper-agent-2 | Cross-domain transfer + temporal KG | 64K | 32 | 7.3 min |
| paper-agent-3 | Multi-objective eval + adversarial stability | 57K | 24 | 4.7 min |
| Total | 362K | 154 | ~8 min (parallel) |
Each paper agent searched arXiv, Semantic Scholar, and Google Scholar, then verified every paper by fetching the actual arXiv abstract page. Papers with unverifiable IDs were rejected.