Autoresearch Expansion — SBPI Stack-Ranking Systems

Executive Summary

We operate two competitive intelligence systems built on a semantic layer architecture. The SBPI (Structural Brand Power Index) tracks 17 micro-drama companies across 5 scoring dimensions, producing weekly composite scores stored as RDF triples. A prediction engine forecasts directional movement for each company each week.

Experiment 2 applied Markovick et al.'s autoresearch methodology to optimize the knowledge graph to prediction interface, recovering 69.9% directional accuracy from a 23.5% baseline. This report identifies 5 additional experiments that compound on that result.

69.9%

Current Accuracy

Optimized Params

Papers Verified

New Experiments

~470

Lines of Code

12 wk

Full Execution

The Compounding Chain

Each experiment's optimized output feeds the next. The chain progresses from safety infrastructure through feature expansion to cross-domain generalization.

EXP 1

Goodhart Guard

safety

→

EXP 2

MOTPE

objectives

→

EXP 3

Dim Weights

features

→

EXP 4

Temporal Decay

depth

→

EXP 5

Cross-Vertical

generalization

Core Insight

The interface between the knowledge representation layer and the prediction/reasoning layer contains tunable parameters that have disproportionate impact on output quality. These parameters are almost always set by intuition, never systematically optimized. Karpathy's autoresearch framing treats this systematic optimization as automated research that compounds over time.

System Architecture

Nightly Pipeline (6 Phases)

Phase 1: ETL Load          sbpi_to_rdf.py --all --validate
Phase 2: Accuracy Check     prediction_engine.py --report
Phase 3: Prediction Gen     prediction_engine.py --generate
Phase 4: Attestation        attestation_upgrade.py --upgrade
Phase 5: Insights           nightly-insights.py --schedule all
Phase 6: Optimization       kg_interface_optimizer.py --nightly  ← experiments modify this

Current 12-Parameter Search Space

Parameter	Optimized Value	Default	Change
`direction_threshold`	1.295	0.500	+159%
`confidence_base`	0.443	0.600	-26%
`mean_reversion_rate`	0.257	0.100	+157%
`anomaly_contributes`	true	false	flipped
`divergence_weight`	0.180	0.000	new
`tier_proximity_weight`	0.096	0.000	new
`magnitude_thresh_1`	3.020	3.000	+1%
`magnitude_thresh_2`	5.076	5.000	+2%
`consistency_thresh`	1.980	2.000	-1%
`magnitude_bonus_1`	0.120	0.100	+20%
`magnitude_bonus_2`	0.136	0.100	+36%
`consistency_bonus`	0.040	0.050	-20%

Experiments 3-4 expand this to 18 parameters by adding dimension weights and temporal decay.

Interference Matrix

	Exp 1	Exp 2	Exp 3	Exp 4	Exp 5
Exp 1	—	Synergy	Safe	Safe	Required
Exp 2	Synergy	—	Safe	Synergy	Required
Exp 3	Safe	Safe	—	Confound	Partial
Exp 4	Safe	Synergy	Confound	—	Transfer
Exp 5	Required	Required	Partial	Transfer	—

Experiment Specifications

EXP 1 Goodhart Guard — Overtuning Detection and Early Stopping

The nightly TPE loop running 30 trials on 3 weeks of data (~51 company-week observations) is in the high-risk zone for overtuning. Implementing early stopping and default-baseline comparison will detect and prevent degenerate configurations before they enter production.

Paper Analogs

Schneider, Bischl, Feurer (2025). "Overtuning in Hyperparameter Optimization." arXiv:2506.19540. AutoML-Conf 2025.
Karwowski et al. (2023). "Goodhart's Law in Reinforcement Learning." arXiv:2310.09144. ICLR 2024.

Parameter Mapping

Paper Parameter	SBPI Parameter	Notes
HPO budget (# trials)	30 TPE trials per nightly run	Fewer trials reduces overtuning
Validation set size	51 company-week observations	Below their "small-data" threshold
Proxy metric	Directional accuracy	"Predict stable everywhere" is the exploit
True objective	Accuracy + Brier + MAE	Joint prediction quality
Optimization pressure	30 trials / 51 obs = 0.59	Should be < 0.3 for safe HPO
Early stopping threshold	Accuracy-Brier divergence	Stop when proxy-true gap widens

Expected Signal

Metric	Degenerate config detection rate
Direction	Protective (prevents regression)
Magnitude	~10% of nightly runs produce overtuned configs. Prevents 2-5pp accuracy drops.
Confidence	High — our data regime is exactly the danger zone
Min Data	2 weeks (current data sufficient)
Code Change	~30 lines, modification to Phase 6

EXP 2 Multi-Objective Interface Optimization (MOTPE)

Replacing single-objective TPE with multi-objective TPE over accuracy + Brier score + MAE will produce Pareto-optimal configurations that generalize better and resist the "predict stable everywhere" degenerate solution.

Paper Analog

Barker, Bell, Thomas, Carr, Andrews, Bhatt (2025). "Faster, Cheaper, Better: Multi-Objective HPO for LLM and RAG Systems." arXiv:2502.18635. ICLR 2025 Workshop.

Parameter Mapping

Paper Parameter	SBPI Parameter	Notes
Obj 3: Safety/Faithfulness	Brier Score (calibration)	"Is the system's confidence trustworthy?"
Obj 4: Alignment/Helpfulness	Directional Accuracy	"Did it produce the right answer?"
(No analog)	MAE (magnitude)	Added: our system predicts magnitude
9 hyperparameters	12 SBPI interface params	Same concept, different count
qLogNEHVI sampler	Optuna MOTPESampler	Built-in equivalent

Expected Signal

Metric	Brier score improvement + accuracy maintained
Direction	Brier decreases 10-20%; accuracy stable or +3-8%
Magnitude	Pareto configs beat single-objective by 15-30% on secondary metrics
Confidence	Med-High — noise-aware variant matches our small-sample regime
Min Data	4 weeks (W13, late March)
Code Change	~40 lines in kg_interface_optimizer.py

EXP 3 Dimension Weight Optimization

The 5 SBPI dimension weights (currently 0.25/0.20/0.20/0.20/0.15, set by domain intuition) are an unoptimized interface. Including them in the TPE search space will improve composite score quality and downstream prediction accuracy. Phase 2 extends to per-company covariate-dependent weights.

Paper Analogs

Lu et al. (2025). "Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting." arXiv:2509.11452.
Wakayama, Sugasawa (2024). "Ensemble Prediction via Covariate-dependent Stacking." arXiv:2408.09755.

Parameter Mapping

Paper Parameter	SBPI Parameter	Notes
3 reward objectives — Lu	5 SBPI dimensions	Fixed scalarization → learned adaptive
Hypervolume-guided weights — Lu	TPE over w_i ∈ [0.05, 0.40]	Start with TPE for pipeline consistency
Covariates — Wakayama	Company tier, composite, divergence	ReelShort vs FlexTV get different weights
Fixed stacking — Wakayama	Current 0.25/0.20/0.20/0.20/0.15	The baseline both papers beat

Expected Signal

Metric	Directional accuracy via better composite discrimination
Direction	+5-15% static weights; +3-8% additional from covariate-dependent
Magnitude	Lu: Pareto-dominant in 6.1 fewer steps. Wakayama: 11% over equal-weight.
Confidence	Med-High — two papers confirm learned beats intuition
Min Data	6 weeks (W16, mid April)
Code Change	~80 lines, expands search space 12 → 16 params

EXP 4 Temporal Decay Signal Architecture

Adding exponential temporal decay weighting — where recent weeks contribute more to predictions than older weeks — will capture the recency bias inherent in competitive dynamics and improve directional accuracy by 8-15%.

Paper Analogs

Gastinger et al. (2024). "History Repeats Itself: A Baseline for TKG Forecasting." arXiv:2404.16726.
Liao, Jia, Li, Ma, Tresp (2023). "GenTKG: Generative Forecasting on Temporal Knowledge Graphs." arXiv:2310.07793.

Parameter Mapping

Paper Parameter	SBPI Parameter	Notes
Exponential decay rate (λ)	`temporal_decay_rate` [0.1, 0.9]	λ=0.5 means each prior week has half the weight
Lookback window (k)	`temporal_lookback` [2, 8]	Currently hardcoded to 2
Relation-specific weights	Dimension-specific decay rates	Community stickier than Distribution
TLogic temporal rules	SPARQL temporal queries	Gap: their rules are learned; ours hand-written

Expected Signal

Metric	Directional accuracy improvement
Direction	+8-15% relative (Gastinger: 24% MRR improvement over non-temporal)
Magnitude	Adding decay over 4-8 weeks vs fixed 2-week equal-weight
Confidence	Medium — depends on whether dynamics are decay-like or regime-shift
Min Data	8 weeks (W18, late April) Blocked
Code Change	~120 lines + new SPARQL query, expands to 18 params

EXP 5 Cross-Vertical Configuration Transfer

Warm-starting the AI Agent vertical's TPE optimization from the micro-drama vertical's optimized config will converge faster (fewer trials) and to a better optimum than cold-start initialization.

Paper Analog

Zeng, Maus, Jones, et al. (2025). "BOLT: Large Scale Multi-Task Bayesian Optimization with LLMs." arXiv:2503.08131.

Parameter Mapping

Paper Parameter	SBPI Parameter	Notes
Source tasks (~1,500 configs)	Micro-drama vertical (1 task, 30 trials)	Scale gap: 1 vs 1,500. But 30 trials provide a trajectory.
Target task	AI Agent vertical (same 12-param space)	Identical parameter names, different domain
LLM warm-start	Optuna enqueue_trial()	Seed study with best config + neighbors
21% one-shot improvement	Trials-to-convergence reduction	Expect 40-60% fewer trials needed

Expected Signal

Metric	Trials-to-convergence (primary), ceiling accuracy (secondary)
Direction	40-60% fewer trials; 5-10% higher ceiling accuracy
Magnitude	Zeng et al.: 21% one-shot. Full warm-start stronger.
Confidence	Med-Low — interface params likely transfer; dimension weights likely don't
Min Data	4+ weeks AI Agent data; both pipelines operational
Code Change	~200 lines (new script + optimizer mods + SPARQL query)

Degenerate Solution Risk Map

Risk	Trigger	Detection	Mitigation
"Predict stable everywhere"	Low direction_threshold	Stable rate >50%	Exp 1 Guard + Exp 2 Brier
Weight collapse	Weights converge to 0/1	Entropy H < 0.5	Constrain [0.05, 0.40]
Temporal overfitting	High lookback + low decay	Train/holdout divergence	Exp 1 detector on temporal params
Transfer poisoning	Overtuned source config	Exp 1 must clear source	Reject if Guard fails
Pareto collapse	MOTPE converges to 1 point	Hypervolume < 3 solutions	Increase trials or ranges

Implementation Timeline

Experiments are sequenced by data requirements and dependency chain. Each experiment must stabilize before the next begins to enable clean attribution of improvements.

WEEK 0-1 — Now (W10-W12 data)

Exp 1: Goodhart Guard

Safety net for all subsequent experiments. Adds overtuning detection, default baseline comparison, and early stopping to Phase 6. ~30 lines, no new data needed. Can start immediately.

WEEK 1-2 — Late March (W13 data loads)

Exp 2: Multi-Objective TPE

Replace single-objective TPE with MOTPE (accuracy + Brier + MAE). Changes optimization landscape for all future experiments. Requires 4 weeks data for calibration evaluation. ~40 lines.

WEEK 4-6 — Mid April (W16 data loads)

Exp 3: Dimension Weight Optimization

Expand search space from 12 to 16 parameters. Learn optimal dimension weights under multi-objective optimization. Requires 6 weeks for sufficient transition pairs. ~80 lines. Must freeze before Exp 4.

WEEK 8-10 — Late April (W18 data loads)

Exp 4: Temporal Decay Signals

Add exponential decay and variable lookback window. Expands to 18 parameters. Requires 8 weeks of history for decay patterns to differentiate from equal-weight. ~120 lines + new SPARQL query.

WEEK 10-12 — May (AI Agent pipeline operational)

Exp 5: Cross-Vertical Transfer

Capstone experiment. Transfer optimized config from micro-drama to AI agent vertical. Tests which parameters are universal vs domain-specific. ~200 lines (new script). Requires stable source config.

Data Accumulation Schedule

Week	Date (approx)	Data Points	Unlocks
W10-W12	Now	51 obs (3 weeks x 17 co.)	Exp 1
W13	Late March	68 obs	Exp 2
W14-W15	Early April	85-102 obs	—
W16	Mid April	102 obs	Exp 3
W17	Late April	119 obs	—
W18	Late April	136 obs	Exp 4
W20+	May	170+ obs	Exp 5 (with AI Agent data)

Search Space Evolution

Baseline (Exp 2):     12 parameters — interface tuning only
After Exp 3:          16 parameters — + 4 dimension weights (5th constrained)
After Exp 4:          18 parameters — + temporal_decay_rate + temporal_lookback
Transfer (Exp 5):     18 params transferred, dimension weights tested separately

Cost Model

Cost estimates are based on observed token usage from Experiments 1-2 and this research session. The SBPI optimization pipeline runs locally (Python + Optuna + pyoxigraph) with zero API cost per trial. The primary cost driver is Claude Code development time to implement each experiment.

Prior Experiment Costs (Observed)

Item	Tokens	Tool Calls	Duration
Exp 2: KG Interface Optimizer (implementation)	~180K	~45	~25 min
Exp 2: TPE optimization run (30 trials)	0 (local Python)	0	~90 sec
This session: research (4 parallel agents)	~362K	~154	~8 min (parallel)
Research phase total	~542K	~199	~33 min

Implementation Cost Estimates (per experiment)

Exp 1: Goodhart Guard ~80K ~30 $0 (local)

Exp 2: MOTPE ~100K ~40 $0 (local)

Exp 3: Dimension Weights ~150K ~80 $0 (local)

Exp 4: Temporal Decay ~200K ~120 $0 (local)

Exp 5: Cross-Vertical Transfer ~250K ~200 $0 (local)

Total Implementation ~780K ~470 $0/night

Total Project Cost Estimate

~1.3M

Total Tokens (research + impl)

Ongoing Nightly Cost

~470

Lines of New Code

12 wk

Calendar Time

Cost Breakdown by Phase

Phase	Tokens	% of Total	Notes
Research (this session)	~542K	42%	4 parallel research agents + codebase exploration
Implementation (Exps 1-5)	~780K	58%	Code writing, testing, validation
Nightly pipeline execution	0	0%	All local Python (Optuna + pyoxigraph)
Total	~1.32M	100%

Token costs assume Claude Opus for implementation. Using Sonnet for routine code changes would reduce implementation costs by ~60%. The nightly optimization loop is pure Python — no LLM API calls per TPE trial.

Key Cost Insight

The autoresearch pipeline is asymptotically free after implementation. Each nightly run executes 30 TPE trials purely in Python (Optuna sampler + pyoxigraph SPARQL queries + numpy computation). Zero API calls per optimization cycle. The only ongoing cost is the electricity to run the launchd agent at 6:13 AM daily.

This is the core value proposition of the Karpathy autoresearch framing: front-load the research and implementation cost, then let the optimization compound nightly at zero marginal cost.

Paper Bibliography

10 papers verified via web fetch. All arXiv IDs confirmed to resolve to real papers with matching titles and authors.

1. Overtuning in Hyperparameter Optimization

Lennart Schneider, Bernd Bischl, Matthias Feurer

arXiv:2506.19540 · AutoML-Conf 2025

Used in: Exp 1 (Goodhart Guard) — ~10% of HPO runs produce overtuned configs in small-data regimes

2. Goodhart's Law in Reinforcement Learning

Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, Joar Skalse

arXiv:2310.09144 · ICLR 2024

Used in: Exp 1 (Goodhart Guard) — provides early stopping criterion with provable regret bounds

3. Faster, Cheaper, Better: Multi-Objective HPO for LLM and RAG Systems

Matthew Barker, Andrew Bell, Evan Thomas, James Carr, Thomas Andrews, Umang Bhatt

arXiv:2502.18635 · ICLR 2025 Workshop

Used in: Exp 2 (MOTPE) — noise-aware multi-objective BO outperforms single-objective in small samples

4. History Repeats Itself: A Baseline for Temporal KG Forecasting

Julia Gastinger, Christian Meilicke, Federico Errica, Timo Sztyler, Anett Schuelke, Heiner Stuckenschmidt

arXiv:2404.16726 · 2024

Used in: Exp 4 (Temporal Decay) — 2-param exponential decay beats 9 of 11 complex methods

5. BOLT: Large Scale Multi-Task Bayesian Optimization with LLMs

Yimeng Zeng, Natalie Maus, Haydn Thomas Jones, Jeffrey Tao, et al.

arXiv:2503.08131 · 2025

Used in: Exp 5 (Cross-Vertical Transfer) — 21% improvement from warm-start across tasks

6. GenTKG: Generative Forecasting on Temporal Knowledge Graphs

Ruotong Liao, Xu Jia, Yangzhe Li, Yunpu Ma, Volker Tresp

arXiv:2310.07793 · 2023 (revised April 2024)

Used in: Exp 4 (Temporal Decay) — few-shot TKG forecasting with 16 training samples

7. Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

Yiming Lu, Zhepeng Wang, Shenghao Li, Xiao Liu, Chenyu Yu, Qianyi Yin, Zhengwei Shi, Zhengyu Zhang, Mingrui Jiang

arXiv:2509.11452 · 2025

Used in: Exp 3 (Dimension Weights) — hypervolume-guided adaptive weights find Pareto-dominant solutions

8. Multi-layer Stack Ensembles for Time Series Forecasting

Niklas Bosch, Oleksandr Shchur, Nick Erickson, Michael Bohlke-Schneider, Caner Turkmen

arXiv:2511.15350 · 2025

Future Exp 6 candidate — learned stacking (Elo 1306) vs fixed averaging (Elo 1000)

9. Ensemble Prediction via Covariate-dependent Stacking

Tomoya Wakayama, Shonosuke Sugasawa

arXiv:2408.09755 · 2024

Used in: Exp 3 Phase 2 — per-company signal weights (11% improvement over equal weighting)

10. Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning

Markovick et al.

arXiv:2505.24478v1 · 2025

Foundation paper — Experiments 1-2 baseline methodology

Research Methodology

How This Research Was Conducted

4 specialized agents ran in parallel for approximately 8 minutes:

Agent	Focus	Tokens	Tool Calls	Duration
sbpi-explorer	Full codebase mapping	160K	49	7.4 min
paper-agent-1	Dimension weighting + ensemble calibration	81K	49	7.6 min
paper-agent-2	Cross-domain transfer + temporal KG	64K	32	7.3 min
paper-agent-3	Multi-objective eval + adversarial stability	57K	24	4.7 min
Total		362K	154	~8 min (parallel)

Each paper agent searched arXiv, Semantic Scholar, and Google Scholar, then verified every paper by fetching the actual arXiv abstract page. Papers with unverifiable IDs were rejected.