Benchmarking personal intelligence through target-specific forecasting under partial observation.
TargetSpace-Bench evaluates whether AI systems can form, update, and prospectively test calibrated models of specific targets over time.
Not a profile, a preference list, a demographic vector, or a memory store. A person is a lived trajectory: an unfolding stream of observable attention, context, constraint, pressure, and emerging goals from which the next action follows.
TargetSpace-Bench asks whether a model can infer that trajectory from observable longitudinal traces well enough to forecast where it turns next — which commitment slips, which priority is displaced, which obligation is engaged or avoided.
What this is not. TargetSpace makes no claim about consciousness, qualia, private subjective states, or mind-reading, and a system accesses none of these. It evaluates only predictive alignment with future, externally observable target states — sealed forecasts scored against what actually happens.
Most evaluations reward predicting what usually happens, or replaying a target's routine. Both can look skilled while revealing nothing specific to this target.
Each layer removes one way a forecast can look target-specific without being so. The two anchors — R2 and the permutation gate — are the contribution; the rest are adopted from established practice.
Only the sealed forecast and the resolved outcome are scored — never the system's internal explanation or representation.
A benchmark instance is a tuple. The system outputs a calibrated distribution over a defined answer space; a deterministic rule resolves it later.
A specific partially observed adaptive system tracked over time — here, a consenting individual.
Only information timestamped at or before t. Strict walk-forward; the future never informs the past.
An organizer-issued question about a future target state — not chosen by the entrant.
A discrete set of outcomes over which the system emits a probability distribution.
A future time r > t with a deterministic, externally observable resolution rule.
Sealed before the outcome exists, then graded in bits against R1 and R2.
The flagship application is personal intelligence: forecasting a specific person's future target-state transitions. It runs under strict governance.
Consenting adults only. The harness runs where the data lives; only sealed forecasts and resolved outcomes leave the client.
Reporting is aggregate-only. There is no public raw personal dataset, and none is planned.
Commitment — active vs. abandoned.
Priority — maintained vs. displaced.
Task — continuation vs. switch.
Response behaviour — reply, defer, or drop.
Meeting / event — realization vs. no-show.
Obligation — engagement vs. avoidance.
Every tier is scored against the same sealed targets and the same R1/R2 controls. Lift over R2, as a function of tier, is the read-out.
Higher tiers are more sensitive and require stronger consent, local/on-device processing, federation, and governance. Richer evidence is not automatically better — the benchmark measures whether it adds target-specific predictive lift over R2.
The current implementation is a minimal, deterministic synthetic harness that exercises the scoring spine end to end. It demonstrates the mechanics; it is not evidence about real people.
Synthetic target histories, walk-forward forecast instances, R1/R2 baselines, log & Brier scoring, a calibration diagnostic, a permutation specificity check, and evidence-tier reporting.
No human data, no real personal intelligence, no evidence that passive observation helps, no cross-domain validation, no safety validation. The pre-pilot paper reports no empirical results.
A recurring Wednesday review sits on the calendar, but passive signals show attention redirected to an urgent dependency. Will the review be completed? (all numbers illustrative)
Outcome: defers. The model gains ≈ +2.2 bits over R1 and ≈ +2.8 bits over R2 — skill the routine did not contain. Scored against a different person who kept their review, the same forecast earns ≈ −1.6 bits: the skill collapses under permutation, confirming it is specific to this target.
Yuri Andrade Sylvester · Independent Researcher
AI systems increasingly claim to know, remember, personalize to, and act on behalf of particular users, yet current evaluations cannot establish whether a system has formed a durable, calibrated, prospectively useful model of an individual. We call this missing capability personal intelligence. The premise is that a person is not a profile to retrieve but a lived trajectory to forecast — inferred only from observable longitudinal traces. The benchmark scores sealed, proper-scored forecasts of a person's future target-state transitions under partial observation, certified by a stack of controls: a population-prior baseline (R1), a strong own-routine baseline (R2), a calibration gate, and a permutation specificity test, with an evidence-tier ablation and an architecture-neutral grid. The adopted ingredients are cited; the contribution is their conjunction, anchored on R2 and the permutation gate. This is a pre-pilot proposal: no empirical results are reported, only the personal track is implemented (synthetic), and raw data is never public.
@misc{targetspace2026, title = {TargetSpace: Benchmarking Personal Intelligence by Target-Specific Forecasting Under Partial Observation}, author = {Sylvester, Yuri Andrade}, year = {2026}, note = {Preprint, v8.1}, url = {https://targetspace.org} }
Pure Python standard library, deterministic, runs in under a second:python targetspace_synthetic_demo.py
Pre-pilot. The reference implementation and a participant-data harness remain in preparation. Issues and contributions welcome via GitHub.
TargetSpace does not claim access to consciousness, qualia, or private subjective states. It measures predictive alignment from observable longitudinal traces only.
Forecasting skill is not deployment legitimacy and confers no licence to act on a person. Benchmark validity is kept strictly separate from deployment legitimacy.
Raw personal data is not public and should not be centralized. Federation keeps raw data under the participant's control.
Any future human pilot requires informed consent, privacy controls, data minimization, local/federated processing, retention limits, bystander handling, and ethics review.