---
title: "Roadmap for Paper-Faithful Simulation Workflows"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Roadmap for Paper-Faithful Simulation Workflows}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Purpose

This vignette is a planning note for collaborators. It records where the
`nalanda` package currently stands, how the package could be extended toward the
simulation strategy described by Hewitt, Ashokkumar, Ghezae, and Willer
[@hewitt2024predicting; @hewitt2024supplement], and which implementation steps
seem most important for the next phase of work.

The immediate aim is not to claim that `nalanda` already reproduces the design
of these papers. Rather, the goal is to identify a realistic path for building
a user-facing workflow that supports:

1. paper-faithful simulation of survey experiments,
2. the existing pre/post chapter workflow already implemented in `nalanda`,
3. control-versus-treatment chapter comparisons when some books act as control
   conditions, and
4. future extensions for cumulative reading designs across multiple chapters.

# Current package status

At present, `nalanda` already supports several useful pieces of the broader
simulation agenda:

1. A two-turn pre/post workflow for chapter interventions via
   `run_ai_on_chapters()`, where an identity-conditioned baseline is collected
   before exposure to a chapter and a post-reading measure is collected after
   exposure.
2. A one-turn workflow via `run_ai_on_chapters_one_turn()`, where identity
   context, chapter text, and the outcome question are presented in a single
   prompt.
3. A prompt-first multi-turn interface via `simulate_treatment()`, which is
   flexible enough to support more customized simulation sequences.
4. Summary helpers that keep raw model output separate from derived metrics.

This means the package already contains the core execution machinery needed for
prompt construction, repeated simulation, structured extraction, and summary
pipelines. The main gap is not basic infrastructure. The gap is a
paper-faithful experimental abstraction layer.

It is also useful to separate simulation design from statistical analysis.
`nalanda` already supports workflows that can later be used for group
comparisons, including cases where some books act as control conditions. That
does not mean `nalanda` itself needs to become the main home for inferential
contrast estimation or hypothesis testing. Those tasks may still belong
downstream in other tools, including `rempsyc`.

# What the papers add

The Hewitt et al. workflow differs from the current chapter workflow in a few
important respects [@hewitt2024predicting; @hewitt2024supplement]:

1. The main design is condition-based rather than pre/post. The model simulates
   responses to each experimental condition, with group comparisons then
   performed downstream.
2. Prompts are built from a bank of introductory variants rather than a single
   fixed wording.
3. Simulations include demographically described participants rather than only
   broad identity labels.
4. Predictions are averaged over many prompts as an ensemble strategy.
5. Raw predicted treatment effects are useful for ranking conditions, but
   absolute effect magnitudes appear to benefit from linear calibration. In the
   primary survey archive, the paper estimates a shrinkage factor of
   approximately 0.56 [@hewitt2024supplement].

Taken together, these papers suggest that `nalanda` should support at least
three closely related simulation families:

1. `condition-labeled post-only simulations`, where outputs are ready for later
   group comparisons but inferential contrasts can remain downstream;
2. `baseline -> exposure -> post outcome` simulations, where within-unit change
   metrics are native to the package; and
3. `cumulative exposure` simulations, where multiple chapters or interventions
   are allowed to build on one another over time.

These are simulation families, not analysis families. The package does not need
to choose only one family, and it does not need to absorb every downstream
statistical task. The stronger design goal is to share infrastructure across
them.

# Recommended implementation steps

The table below reflects my current view of the most useful staged plan. Impact
scores are on a 1 to 10 scale, where 10 means the step is especially important
for scientific usefulness and for alignment with the published papers.

| Step | Description | Difficulty | Impact |
|:--|:--|:--|:--:|
| 1 | Build a paper-faithful prompt layer, including reusable prompt bank objects and prompt constructors for survey experiments and book-based designs. | Low to medium | 10 |
| 2 | Add a condition-based simulation wrapper that runs control and treatment conditions, stores condition labels and simulation metadata, and returns outputs ready for downstream comparison. | Medium to high | 10 |
| 3 | Add a descriptive summary and calibration layer for package-native metrics, while leaving formal inferential contrasts to downstream tools. | Medium | 8 |
| 4 | Add demographic profile infrastructure, including profile samplers and weighted profile sets, so users can simulate subgroup-specific or population-matched runs. | Medium | 7 |
| 5 | Add ensemble controls that formalize how many prompt variants are used, how they are sampled, and how outputs are pooled. | Low to medium | 8 |
| 6 | Extend the framework to cumulative chapter designs, where earlier chapters can remain in memory or be summarized forward into later prompts. | High | 8 |

I would still group these into three practical phases:

1. Phase 1: prompt layer plus ensemble controls.
2. Phase 2: condition-based experimental wrapper plus descriptive summaries and
   calibration helpers.
3. Phase 3: richer demographic sampling and cumulative exposure designs.

# Why these steps matter

## Step 1: Prompt layer

This is the highest-leverage near-term step because it creates a common
language for all downstream workflows. The supplement describes a structured
prompting strategy with an introductory sentence, a study-setting description,
participant information, treatment content, and the outcome question
[@hewitt2024supplement]. That same structure can be reused for chapter
simulations even when the design is not identical.

For the current package, this would let us move away from hard-coding prompts in
user scripts and toward explicit prompt templates that are inspectable,
versionable, and easier to document.

## Step 2: Condition-based simulation wrapper

This is the step that would bring `nalanda` closest to the paper's core design.
The current one-turn workflow already has much of the required mechanics, but it
is organized around books and chapters rather than experimental conditions. A
dedicated wrapper should make conditions, control groups, and outcomes
first-class objects. The resulting outputs can then be handed off to downstream
tools for mean comparisons, contrasts, or other inferential analyses when
needed.

## Step 3: Descriptive summaries and calibration

This step should be narrower than a full contrast-analysis framework. The role
of `nalanda` here is to produce package-native summaries that are useful for
inspection, plotting, and workflow handoff. For pre/post designs, that includes
metrics such as within-unit deltas. For post-only condition-labeled designs,
that includes condition-level summaries and calibration helpers. The papers show
strong rank-order prediction, but they also argue that absolute effect
magnitudes are systematically overstated without calibration
[@hewitt2024predicting; @hewitt2024supplement]. That makes calibration worth
supporting, even if inferential testing remains outside the package.

## Step 4: Demographic profile infrastructure

This matters, but I do not think it should block the earlier steps. The
subgroup analysis in the supplement suggests that matched demographic prompts
gave only small or no predictive advantages for gender and ethnicity, with
somewhat more benefit for party [@hewitt2024supplement]. That makes demographic
conditioning important, but not the first dependency.

## Step 5: Ensemble controls

The supplement explicitly reports that predictive accuracy improved as the
number of prompts in the ensemble increased [@hewitt2024supplement]. For that
reason, ensemble prompting should not remain an implicit user choice. It should
be represented as a documented object or argument in the package API.

## Step 6: Cumulative chapter designs

This step is especially relevant for the book project, even though it goes
beyond the paper's main experimental setup. The existing pre/post framework is a
natural base for cumulative designs because it already keeps the logic of
before/after change separate from prompt execution. The hard part is deciding
how accumulated reading history should enter later prompts.

# What can be applied directly to the existing chapter workflow?

Even though the paper's design is not identical to the chapter workflow,
several ideas transfer well.

## Transferable immediately

1. **Prompt standardization.** The package would benefit from prompt templates
   that separate intro text, study framing, identity or profile information,
   chapter text, and outcome questions.
2. **Prompt ensembles.** Rather than relying on one canonical chapter prompt, we
   could average over several introductory phrasings or framing variants.
3. **Optional richer participant context.** The current identity-based context
   could be extended to richer demographic profiles, especially in cases where
   subgroup interpretation matters.
4. **Separation of raw and calibrated results.** The package already tends to
   preserve raw outputs and compute metrics downstream. That is a good fit for
   calibration as well.

## Transferable with design adaptation

1. **Condition-ready chapter outputs.** A chapter can be treated as a treatment
   condition and stored alongside a no-reading control, placebo chapter, or
   alternative chapter, with the resulting outputs passed downstream for
   comparison.
2. **Megastudy-style ranking.** Sets of chapters or chapter framings could be
   compared as candidate interventions, much as the papers compare many
   treatments within one study.
3. **Cumulative exposure.** Later chapters could be modeled as interventions
   delivered after earlier ones, with prior material either preserved in memory
   or compressed into an accumulated summary state.

## Less transferable without stronger validation

1. **The exact 0.56 calibration factor.** This number was estimated for the
   paper's primary archive of U.S. survey experiments and should not be assumed
   to transfer automatically to chapter-level reading interventions.
2. **Claims about subgroup benefits from demographic matching.** The paper's
   subgroup findings are informative, but chapter interventions may produce
   different patterns of heterogeneity.

# Where should calibration happen?

My current recommendation is:

1. keep raw simulated responses unchanged,
2. compute only package-native descriptive summaries in `nalanda`,
3. leave formal group comparisons and inferential contrasts to downstream
   tools, and
4. optionally add calibration helpers that work on summary outputs.

In practice, that means calibration should not be applied inside the low-level
simulation functions themselves.

## Why not pre-adjust inside simulation functions?

This would make the raw model output harder to inspect, harder to compare across
calibration schemes, and harder to validate later. It would also blur the line
between model execution and statistical post-processing.

## Why not leave calibration entirely to user scripts?

That is flexible, but it is easy to do inconsistently. If calibration is part
of the recommended workflow, the package should provide a standard path for it.

## Recommended compromise

For metrics that `nalanda` already owns conceptually, summary functions should
be able to compute both raw and optional adjusted outputs. For example, if a
summary function computes `delta_outgroup`, a calibrated variant could appear as
`adjusted_delta_outgroup` when a calibration factor is supplied.

This has several advantages:

1. raw outputs remain accessible,
2. calibration remains explicit,
3. multiple calibration schemes can coexist,
4. user scripts remain simpler and less error-prone.

For future condition-based workflows, a better boundary may be to provide a
small helper that adjusts already-computed effect columns, regardless of where
those effects were estimated. In other words, `nalanda` does not need to own
contrast estimation in order to support calibration.

Such a helper could work on a user-supplied column and append:

1. `adjusted_effect`
2. `calibration_factor`
3. `calibration_source`

The package default should probably be `calibration = NULL`, with named presets
available for known settings. A preset corresponding to the Hewitt et al.
primary archive could reasonably use `0.56`, but that should be framed as a
setting-specific option rather than a universal package default.

# Proposed object designs

## `prompt_bank`

The `prompt_bank` object would formalize the reusable prompt pieces that are
currently spread across ad hoc strings in scripts. Conceptually, it should be a
named list or tibble-backed object with a small, inspectable schema.

At minimum, a `prompt_bank` should contain:

1. `intro_variants`: short opening instructions or framing sentences.
2. `setting_template`: the general study description, such as survey context or
   reading-task context.
3. `profile_template`: a template for participant description, including
   placeholders for identity or demographic fields.
4. `stimulus_template`: a wrapper describing how the treatment text or chapter
   text is introduced.
5. `outcome_template`: the question and response-scale wording.
6. `scenario`: a label such as `"survey_experiment"`, `"book_prepost"`, or
   `"book_cumulative"`.
7. `metadata`: version, source paper, and notes.

In practice, one useful design would be:

```r
prompt_bank <- list(
  intro_variants = c(
    "You will be asked to predict how people respond to various messages.",
    "Can reading a message affect people's attitudes and actions?"
  ),
  setting_template =
    "Social scientists often conduct research studies using online surveys. The text below is from one such survey conducted on a large, diverse population of research participants.",
  profile_template =
    "Participant X is a {ideology}, {age}, {ethnicity}, {gender} participant with {education}. Politically, Participant X identifies as '{party}'.",
  stimulus_template =
    "Please read the material below. {stimulus_text}",
  outcome_template =
    "{outcome_text} Please choose a number from {scale_low} to {scale_high}.",
  scenario = "survey_experiment",
  metadata = list(source = "hewitt_ashokkumar_2024")
)
```

For book workflows, a related prompt bank might swap in a chapter-specific
`setting_template` while keeping the same overall structure.

## `ensemble_size`

I am imagining `ensemble_size` as more than a bare integer, even if the user
API initially accepts an integer. Internally, it would be useful to represent
the ensemble settings as a small control object.

At minimum, this object should capture:

1. `n`: number of prompt variants to use per condition or chapter.
2. `method`: whether prompts are sampled randomly, cycled deterministically, or
   exhaustively enumerated.
3. `replace`: whether prompt variants may repeat.
4. `weights`: optional prompt weights if some variants are meant to count more.
5. `pooling`: whether outputs are averaged at the response level, condition
   mean level, or effect level.
6. `seed`: a seed strategy for reproducibility.

Conceptually:

```r
ensemble_size <- list(
  n = 8L,
  method = "sample",
  replace = TRUE,
  weights = NULL,
  pooling = "effect",
  seed = 42L
)
```

For an early implementation, it would be enough to let users pass
`ensemble_size = 1`, `4`, or `8`, while storing the richer object internally.
That would keep the public API simple but leave room to expand.

## `demographic_profiles`

The `demographic_profiles` object should represent either a fixed set of
profiles or a sampling frame from which profiles are drawn. This is important
because the papers do not only vary wording; they also vary the participant
being simulated [@hewitt2024predicting].

At minimum, each profile should be able to store:

1. `profile_id`
2. `gender`
3. `age`
4. `ethnicity`
5. `education`
6. `ideology`
7. `party`
8. `weight`
9. `label`

Conceptually:

```r
demographic_profiles <- tibble::tibble(
  profile_id = c("p1", "p2"),
  gender = c("Female", "Male"),
  age = c("30-39", "Over 60"),
  ethnicity = c("White", "Black"),
  education = c("College", "Some college"),
  ideology = c("Conservative", "Moderate"),
  party = c("Strong Republican", "Lean Democrat"),
  weight = c(0.5, 0.5),
  label = c("profile_1", "profile_2")
)
```

For `nalanda`, this object could support at least three modes:

1. **Identity-only mode**, close to the current package design.
2. **Fixed-profile mode**, for exact reproducibility with a specified profile
   set.
3. **Weighted-sampling mode**, for approximating a target population.

For chapter work, the immediate value may be greatest for party or ideology,
with richer demographic fields becoming more important when studying subgroup
heterogeneity or matching a target population.

# Proposed next steps

If the goal is to move one small piece at a time, my current order of work
would be:

1. implement a `prompt_bank` constructor and prompt-building helpers,
2. expose ensemble controls in a minimal but explicit form,
3. design a condition-based simulation wrapper around the existing one-turn
   execution logic,
4. add summary functions that return raw and optional adjusted package-native
   metrics,
5. add a small calibration helper for user-supplied effect columns,
6. add demographic profile objects and sampling helpers,
7. revisit cumulative chapter designs after the first three pieces are stable.

The main reason for this order is that prompt standardization and package-native
summaries are useful immediately for the existing chapter workflow, whereas
population-matched demographic simulation and cumulative exposure are likely to
require more validation work.

# References