Executive Summary

We operate two competitive intelligence systems built on a semantic layer architecture. The SBPI (Structural Brand Power Index) tracks 17 micro-drama companies across 5 scoring dimensions, producing weekly composite scores stored as RDF triples. A prediction engine forecasts directional movement for each company each week.

Experiment 2 applied Markovick et al.'s autoresearch methodology to optimize the knowledge graph to prediction interface, recovering 69.9% directional accuracy from a 23.5% baseline. This report identifies 5 additional experiments that compound on that result.

69.9%
Current Accuracy
12
Optimized Params
10
Papers Verified
5
New Experiments
~470
Lines of Code
12 wk
Full Execution

The Compounding Chain

Each experiment's optimized output feeds the next. The chain progresses from safety infrastructure through feature expansion to cross-domain generalization.

EXP 1
Goodhart Guard
safety
EXP 2
MOTPE
objectives
EXP 3
Dim Weights
features
EXP 4
Temporal Decay
depth
EXP 5
Cross-Vertical
generalization

Core Insight

The interface between the knowledge representation layer and the prediction/reasoning layer contains tunable parameters that have disproportionate impact on output quality. These parameters are almost always set by intuition, never systematically optimized. Karpathy's autoresearch framing treats this systematic optimization as automated research that compounds over time.

System Architecture

Nightly Pipeline (6 Phases)

Phase 1: ETL Load          sbpi_to_rdf.py --all --validate
Phase 2: Accuracy Check     prediction_engine.py --report
Phase 3: Prediction Gen     prediction_engine.py --generate
Phase 4: Attestation        attestation_upgrade.py --upgrade
Phase 5: Insights           nightly-insights.py --schedule all
Phase 6: Optimization       kg_interface_optimizer.py --nightly  ← experiments modify this

Current 12-Parameter Search Space

ParameterOptimized ValueDefaultChange
direction_threshold1.2950.500+159%
confidence_base0.4430.600-26%
mean_reversion_rate0.2570.100+157%
anomaly_contributestruefalseflipped
divergence_weight0.1800.000new
tier_proximity_weight0.0960.000new
magnitude_thresh_13.0203.000+1%
magnitude_thresh_25.0765.000+2%
consistency_thresh1.9802.000-1%
magnitude_bonus_10.1200.100+20%
magnitude_bonus_20.1360.100+36%
consistency_bonus0.0400.050-20%

Experiments 3-4 expand this to 18 parameters by adding dimension weights and temporal decay.

Interference Matrix

Exp 1Exp 2Exp 3Exp 4Exp 5
Exp 1SynergySafeSafeRequired
Exp 2SynergySafeSynergyRequired
Exp 3SafeSafeConfoundPartial
Exp 4SafeSynergyConfoundTransfer
Exp 5RequiredRequiredPartialTransfer

Experiment Specifications

EXP 1 Goodhart Guard — Overtuning Detection and Early Stopping
The nightly TPE loop running 30 trials on 3 weeks of data (~51 company-week observations) is in the high-risk zone for overtuning. Implementing early stopping and default-baseline comparison will detect and prevent degenerate configurations before they enter production.

Paper Analogs

Schneider, Bischl, Feurer (2025). "Overtuning in Hyperparameter Optimization." arXiv:2506.19540. AutoML-Conf 2025.
Karwowski et al. (2023). "Goodhart's Law in Reinforcement Learning." arXiv:2310.09144. ICLR 2024.

Parameter Mapping
Paper ParameterSBPI ParameterNotes
HPO budget (# trials)30 TPE trials per nightly runFewer trials reduces overtuning
Validation set size51 company-week observationsBelow their "small-data" threshold
Proxy metricDirectional accuracy"Predict stable everywhere" is the exploit
True objectiveAccuracy + Brier + MAEJoint prediction quality
Optimization pressure30 trials / 51 obs = 0.59Should be < 0.3 for safe HPO
Early stopping thresholdAccuracy-Brier divergenceStop when proxy-true gap widens

Expected Signal

MetricDegenerate config detection rate
DirectionProtective (prevents regression)
Magnitude~10% of nightly runs produce overtuned configs. Prevents 2-5pp accuracy drops.
ConfidenceHigh — our data regime is exactly the danger zone
Min Data2 weeks (current data sufficient)
Code Change~30 lines, modification to Phase 6
EXP 2 Multi-Objective Interface Optimization (MOTPE)
Replacing single-objective TPE with multi-objective TPE over accuracy + Brier score + MAE will produce Pareto-optimal configurations that generalize better and resist the "predict stable everywhere" degenerate solution.

Paper Analog

Barker, Bell, Thomas, Carr, Andrews, Bhatt (2025). "Faster, Cheaper, Better: Multi-Objective HPO for LLM and RAG Systems." arXiv:2502.18635. ICLR 2025 Workshop.

Parameter Mapping
Paper ParameterSBPI ParameterNotes
Obj 3: Safety/FaithfulnessBrier Score (calibration)"Is the system's confidence trustworthy?"
Obj 4: Alignment/HelpfulnessDirectional Accuracy"Did it produce the right answer?"
(No analog)MAE (magnitude)Added: our system predicts magnitude
9 hyperparameters12 SBPI interface paramsSame concept, different count
qLogNEHVI samplerOptuna MOTPESamplerBuilt-in equivalent

Expected Signal

MetricBrier score improvement + accuracy maintained
DirectionBrier decreases 10-20%; accuracy stable or +3-8%
MagnitudePareto configs beat single-objective by 15-30% on secondary metrics
ConfidenceMed-High — noise-aware variant matches our small-sample regime
Min Data4 weeks (W13, late March)
Code Change~40 lines in kg_interface_optimizer.py
EXP 3 Dimension Weight Optimization
The 5 SBPI dimension weights (currently 0.25/0.20/0.20/0.20/0.15, set by domain intuition) are an unoptimized interface. Including them in the TPE search space will improve composite score quality and downstream prediction accuracy. Phase 2 extends to per-company covariate-dependent weights.

Paper Analogs

Lu et al. (2025). "Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting." arXiv:2509.11452.
Wakayama, Sugasawa (2024). "Ensemble Prediction via Covariate-dependent Stacking." arXiv:2408.09755.

Parameter Mapping
Paper ParameterSBPI ParameterNotes
3 reward objectives — Lu5 SBPI dimensionsFixed scalarization → learned adaptive
Hypervolume-guided weights — LuTPE over wi ∈ [0.05, 0.40]Start with TPE for pipeline consistency
Covariates — WakayamaCompany tier, composite, divergenceReelShort vs FlexTV get different weights
Fixed stacking — WakayamaCurrent 0.25/0.20/0.20/0.20/0.15The baseline both papers beat

Expected Signal

MetricDirectional accuracy via better composite discrimination
Direction+5-15% static weights; +3-8% additional from covariate-dependent
MagnitudeLu: Pareto-dominant in 6.1 fewer steps. Wakayama: 11% over equal-weight.
ConfidenceMed-High — two papers confirm learned beats intuition
Min Data6 weeks (W16, mid April)
Code Change~80 lines, expands search space 12 → 16 params
EXP 4 Temporal Decay Signal Architecture
Adding exponential temporal decay weighting — where recent weeks contribute more to predictions than older weeks — will capture the recency bias inherent in competitive dynamics and improve directional accuracy by 8-15%.

Paper Analogs

Gastinger et al. (2024). "History Repeats Itself: A Baseline for TKG Forecasting." arXiv:2404.16726.
Liao, Jia, Li, Ma, Tresp (2023). "GenTKG: Generative Forecasting on Temporal Knowledge Graphs." arXiv:2310.07793.

Parameter Mapping
Paper ParameterSBPI ParameterNotes
Exponential decay rate (λ)temporal_decay_rate [0.1, 0.9]λ=0.5 means each prior week has half the weight
Lookback window (k)temporal_lookback [2, 8]Currently hardcoded to 2
Relation-specific weightsDimension-specific decay ratesCommunity stickier than Distribution
TLogic temporal rulesSPARQL temporal queriesGap: their rules are learned; ours hand-written

Expected Signal

MetricDirectional accuracy improvement
Direction+8-15% relative (Gastinger: 24% MRR improvement over non-temporal)
MagnitudeAdding decay over 4-8 weeks vs fixed 2-week equal-weight
ConfidenceMedium — depends on whether dynamics are decay-like or regime-shift
Min Data8 weeks (W18, late April) Blocked
Code Change~120 lines + new SPARQL query, expands to 18 params
EXP 5 Cross-Vertical Configuration Transfer
Warm-starting the AI Agent vertical's TPE optimization from the micro-drama vertical's optimized config will converge faster (fewer trials) and to a better optimum than cold-start initialization.

Paper Analog

Zeng, Maus, Jones, et al. (2025). "BOLT: Large Scale Multi-Task Bayesian Optimization with LLMs." arXiv:2503.08131.

Parameter Mapping
Paper ParameterSBPI ParameterNotes
Source tasks (~1,500 configs)Micro-drama vertical (1 task, 30 trials)Scale gap: 1 vs 1,500. But 30 trials provide a trajectory.
Target taskAI Agent vertical (same 12-param space)Identical parameter names, different domain
LLM warm-startOptuna enqueue_trial()Seed study with best config + neighbors
21% one-shot improvementTrials-to-convergence reductionExpect 40-60% fewer trials needed

Expected Signal

MetricTrials-to-convergence (primary), ceiling accuracy (secondary)
Direction40-60% fewer trials; 5-10% higher ceiling accuracy
MagnitudeZeng et al.: 21% one-shot. Full warm-start stronger.
ConfidenceMed-Low — interface params likely transfer; dimension weights likely don't
Min Data4+ weeks AI Agent data; both pipelines operational
Code Change~200 lines (new script + optimizer mods + SPARQL query)

Degenerate Solution Risk Map

RiskTriggerDetectionMitigation
"Predict stable everywhere"Low direction_thresholdStable rate >50%Exp 1 Guard + Exp 2 Brier
Weight collapseWeights converge to 0/1Entropy H < 0.5Constrain [0.05, 0.40]
Temporal overfittingHigh lookback + low decayTrain/holdout divergenceExp 1 detector on temporal params
Transfer poisoningOvertuned source configExp 1 must clear sourceReject if Guard fails
Pareto collapseMOTPE converges to 1 pointHypervolume < 3 solutionsIncrease trials or ranges

Implementation Timeline

Experiments are sequenced by data requirements and dependency chain. Each experiment must stabilize before the next begins to enable clean attribution of improvements.

WEEK 0-1 — Now (W10-W12 data)
Exp 1: Goodhart Guard
Safety net for all subsequent experiments. Adds overtuning detection, default baseline comparison, and early stopping to Phase 6. ~30 lines, no new data needed. Can start immediately.
WEEK 1-2 — Late March (W13 data loads)
Exp 2: Multi-Objective TPE
Replace single-objective TPE with MOTPE (accuracy + Brier + MAE). Changes optimization landscape for all future experiments. Requires 4 weeks data for calibration evaluation. ~40 lines.
WEEK 4-6 — Mid April (W16 data loads)
Exp 3: Dimension Weight Optimization
Expand search space from 12 to 16 parameters. Learn optimal dimension weights under multi-objective optimization. Requires 6 weeks for sufficient transition pairs. ~80 lines. Must freeze before Exp 4.
WEEK 8-10 — Late April (W18 data loads)
Exp 4: Temporal Decay Signals
Add exponential decay and variable lookback window. Expands to 18 parameters. Requires 8 weeks of history for decay patterns to differentiate from equal-weight. ~120 lines + new SPARQL query.
WEEK 10-12 — May (AI Agent pipeline operational)
Exp 5: Cross-Vertical Transfer
Capstone experiment. Transfer optimized config from micro-drama to AI agent vertical. Tests which parameters are universal vs domain-specific. ~200 lines (new script). Requires stable source config.

Data Accumulation Schedule

WeekDate (approx)Data PointsUnlocks
W10-W12Now51 obs (3 weeks x 17 co.)Exp 1
W13Late March68 obsExp 2
W14-W15Early April85-102 obs
W16Mid April102 obsExp 3
W17Late April119 obs
W18Late April136 obsExp 4
W20+May170+ obsExp 5 (with AI Agent data)

Search Space Evolution

Baseline (Exp 2):     12 parameters — interface tuning only
After Exp 3:          16 parameters — + 4 dimension weights (5th constrained)
After Exp 4:          18 parameters — + temporal_decay_rate + temporal_lookback
Transfer (Exp 5):     18 params transferred, dimension weights tested separately

Cost Model

Cost estimates are based on observed token usage from Experiments 1-2 and this research session. The SBPI optimization pipeline runs locally (Python + Optuna + pyoxigraph) with zero API cost per trial. The primary cost driver is Claude Code development time to implement each experiment.

Prior Experiment Costs (Observed)

ItemTokensTool CallsDuration
Exp 2: KG Interface Optimizer (implementation)~180K~45~25 min
Exp 2: TPE optimization run (30 trials)0 (local Python)0~90 sec
This session: research (4 parallel agents)~362K~154~8 min (parallel)
Research phase total~542K~199~33 min

Implementation Cost Estimates (per experiment)

Experiment Dev Tokens Code Lines Nightly Cost
Exp 1: Goodhart Guard ~80K ~30 $0 (local)
Exp 2: MOTPE ~100K ~40 $0 (local)
Exp 3: Dimension Weights ~150K ~80 $0 (local)
Exp 4: Temporal Decay ~200K ~120 $0 (local)
Exp 5: Cross-Vertical Transfer ~250K ~200 $0 (local)
Total Implementation ~780K ~470 $0/night

Total Project Cost Estimate

~1.3M
Total Tokens (research + impl)
$0
Ongoing Nightly Cost
~470
Lines of New Code
12 wk
Calendar Time

Cost Breakdown by Phase

PhaseTokens% of TotalNotes
Research (this session)~542K42%4 parallel research agents + codebase exploration
Implementation (Exps 1-5)~780K58%Code writing, testing, validation
Nightly pipeline execution00%All local Python (Optuna + pyoxigraph)
Total~1.32M100%

Token costs assume Claude Opus for implementation. Using Sonnet for routine code changes would reduce implementation costs by ~60%. The nightly optimization loop is pure Python — no LLM API calls per TPE trial.

Key Cost Insight

The autoresearch pipeline is asymptotically free after implementation. Each nightly run executes 30 TPE trials purely in Python (Optuna sampler + pyoxigraph SPARQL queries + numpy computation). Zero API calls per optimization cycle. The only ongoing cost is the electricity to run the launchd agent at 6:13 AM daily.

This is the core value proposition of the Karpathy autoresearch framing: front-load the research and implementation cost, then let the optimization compound nightly at zero marginal cost.

Paper Bibliography

10 papers verified via web fetch. All arXiv IDs confirmed to resolve to real papers with matching titles and authors.

1. Overtuning in Hyperparameter Optimization
Lennart Schneider, Bernd Bischl, Matthias Feurer
arXiv:2506.19540 · AutoML-Conf 2025
Used in: Exp 1 (Goodhart Guard) — ~10% of HPO runs produce overtuned configs in small-data regimes
2. Goodhart's Law in Reinforcement Learning
Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, Joar Skalse
arXiv:2310.09144 · ICLR 2024
Used in: Exp 1 (Goodhart Guard) — provides early stopping criterion with provable regret bounds
3. Faster, Cheaper, Better: Multi-Objective HPO for LLM and RAG Systems
Matthew Barker, Andrew Bell, Evan Thomas, James Carr, Thomas Andrews, Umang Bhatt
arXiv:2502.18635 · ICLR 2025 Workshop
Used in: Exp 2 (MOTPE) — noise-aware multi-objective BO outperforms single-objective in small samples
4. History Repeats Itself: A Baseline for Temporal KG Forecasting
Julia Gastinger, Christian Meilicke, Federico Errica, Timo Sztyler, Anett Schuelke, Heiner Stuckenschmidt
arXiv:2404.16726 · 2024
Used in: Exp 4 (Temporal Decay) — 2-param exponential decay beats 9 of 11 complex methods
5. BOLT: Large Scale Multi-Task Bayesian Optimization with LLMs
Yimeng Zeng, Natalie Maus, Haydn Thomas Jones, Jeffrey Tao, et al.
arXiv:2503.08131 · 2025
Used in: Exp 5 (Cross-Vertical Transfer) — 21% improvement from warm-start across tasks
6. GenTKG: Generative Forecasting on Temporal Knowledge Graphs
Ruotong Liao, Xu Jia, Yangzhe Li, Yunpu Ma, Volker Tresp
arXiv:2310.07793 · 2023 (revised April 2024)
Used in: Exp 4 (Temporal Decay) — few-shot TKG forecasting with 16 training samples
7. Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting
Yiming Lu, Zhepeng Wang, Shenghao Li, Xiao Liu, Chenyu Yu, Qianyi Yin, Zhengwei Shi, Zhengyu Zhang, Mingrui Jiang
arXiv:2509.11452 · 2025
Used in: Exp 3 (Dimension Weights) — hypervolume-guided adaptive weights find Pareto-dominant solutions
8. Multi-layer Stack Ensembles for Time Series Forecasting
Niklas Bosch, Oleksandr Shchur, Nick Erickson, Michael Bohlke-Schneider, Caner Turkmen
arXiv:2511.15350 · 2025
Future Exp 6 candidate — learned stacking (Elo 1306) vs fixed averaging (Elo 1000)
9. Ensemble Prediction via Covariate-dependent Stacking
Tomoya Wakayama, Shonosuke Sugasawa
arXiv:2408.09755 · 2024
Used in: Exp 3 Phase 2 — per-company signal weights (11% improvement over equal weighting)
10. Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning
Markovick et al.
arXiv:2505.24478v1 · 2025
Foundation paper — Experiments 1-2 baseline methodology

Research Methodology

How This Research Was Conducted

4 specialized agents ran in parallel for approximately 8 minutes:

AgentFocusTokensTool CallsDuration
sbpi-explorerFull codebase mapping160K497.4 min
paper-agent-1Dimension weighting + ensemble calibration81K497.6 min
paper-agent-2Cross-domain transfer + temporal KG64K327.3 min
paper-agent-3Multi-objective eval + adversarial stability57K244.7 min
Total362K154~8 min (parallel)

Each paper agent searched arXiv, Semantic Scholar, and Google Scholar, then verified every paper by fetching the actual arXiv abstract page. Papers with unverifiable IDs were rejected.