This vignette is a planning note for collaborators. It records where
the nalanda package currently stands, how the package could
be extended toward the simulation strategy described by Hewitt,
Ashokkumar, Ghezae, and Willer (Hewitt et al.
2024a, 2024b), and which implementation steps seem most important
for the next phase of work.
The immediate aim is not to claim that nalanda already
reproduces the design of these papers. Rather, the goal is to identify a
realistic path for building a user-facing workflow that supports:
nalanda,At present, nalanda already supports several useful
pieces of the broader simulation agenda:
run_ai_on_chapters(), where an identity-conditioned
baseline is collected before exposure to a chapter and a post-reading
measure is collected after exposure.run_ai_on_chapters_one_turn(),
where identity context, chapter text, and the outcome question are
presented in a single prompt.simulate_treatment(), which is flexible enough to support
more customized simulation sequences.This means the package already contains the core execution machinery needed for prompt construction, repeated simulation, structured extraction, and summary pipelines. The main gap is not basic infrastructure. The gap is a paper-faithful experimental abstraction layer.
It is also useful to separate simulation design from statistical
analysis. nalanda already supports workflows that can later
be used for group comparisons, including cases where some books act as
control conditions. That does not mean nalanda itself needs
to become the main home for inferential contrast estimation or
hypothesis testing. Those tasks may still belong downstream in other
tools, including rempsyc.
The Hewitt et al. workflow differs from the current chapter workflow in a few important respects (Hewitt et al. 2024a, 2024b):
Taken together, these papers suggest that nalanda should
support at least three closely related simulation families:
condition-labeled post-only simulations, where outputs
are ready for later group comparisons but inferential contrasts can
remain downstream;baseline -> exposure -> post outcome simulations,
where within-unit change metrics are native to the package; andcumulative exposure simulations, where multiple
chapters or interventions are allowed to build on one another over
time.These are simulation families, not analysis families. The package does not need to choose only one family, and it does not need to absorb every downstream statistical task. The stronger design goal is to share infrastructure across them.
The table below reflects my current view of the most useful staged plan. Impact scores are on a 1 to 10 scale, where 10 means the step is especially important for scientific usefulness and for alignment with the published papers.
| Step | Description | Difficulty | Impact |
|---|---|---|---|
| 1 | Build a paper-faithful prompt layer, including reusable prompt bank objects and prompt constructors for survey experiments and book-based designs. | Low to medium | 10 |
| 2 | Add a condition-based simulation wrapper that runs control and treatment conditions, stores condition labels and simulation metadata, and returns outputs ready for downstream comparison. | Medium to high | 10 |
| 3 | Add a descriptive summary and calibration layer for package-native metrics, while leaving formal inferential contrasts to downstream tools. | Medium | 8 |
| 4 | Add demographic profile infrastructure, including profile samplers and weighted profile sets, so users can simulate subgroup-specific or population-matched runs. | Medium | 7 |
| 5 | Add ensemble controls that formalize how many prompt variants are used, how they are sampled, and how outputs are pooled. | Low to medium | 8 |
| 6 | Extend the framework to cumulative chapter designs, where earlier chapters can remain in memory or be summarized forward into later prompts. | High | 8 |
I would still group these into three practical phases:
This is the highest-leverage near-term step because it creates a common language for all downstream workflows. The supplement describes a structured prompting strategy with an introductory sentence, a study-setting description, participant information, treatment content, and the outcome question (Hewitt et al. 2024b). That same structure can be reused for chapter simulations even when the design is not identical.
For the current package, this would let us move away from hard-coding prompts in user scripts and toward explicit prompt templates that are inspectable, versionable, and easier to document.
This is the step that would bring nalanda closest to the
paper’s core design. The current one-turn workflow already has much of
the required mechanics, but it is organized around books and chapters
rather than experimental conditions. A dedicated wrapper should make
conditions, control groups, and outcomes first-class objects. The
resulting outputs can then be handed off to downstream tools for mean
comparisons, contrasts, or other inferential analyses when needed.
This step should be narrower than a full contrast-analysis framework.
The role of nalanda here is to produce package-native
summaries that are useful for inspection, plotting, and workflow
handoff. For pre/post designs, that includes metrics such as within-unit
deltas. For post-only condition-labeled designs, that includes
condition-level summaries and calibration helpers. The papers show
strong rank-order prediction, but they also argue that absolute effect
magnitudes are systematically overstated without calibration (Hewitt et al. 2024a, 2024b). That makes
calibration worth supporting, even if inferential testing remains
outside the package.
This matters, but I do not think it should block the earlier steps. The subgroup analysis in the supplement suggests that matched demographic prompts gave only small or no predictive advantages for gender and ethnicity, with somewhat more benefit for party (Hewitt et al. 2024b). That makes demographic conditioning important, but not the first dependency.
The supplement explicitly reports that predictive accuracy improved as the number of prompts in the ensemble increased (Hewitt et al. 2024b). For that reason, ensemble prompting should not remain an implicit user choice. It should be represented as a documented object or argument in the package API.
This step is especially relevant for the book project, even though it goes beyond the paper’s main experimental setup. The existing pre/post framework is a natural base for cumulative designs because it already keeps the logic of before/after change separate from prompt execution. The hard part is deciding how accumulated reading history should enter later prompts.
Even though the paper’s design is not identical to the chapter workflow, several ideas transfer well.
My current recommendation is:
nalanda,In practice, that means calibration should not be applied inside the low-level simulation functions themselves.
This would make the raw model output harder to inspect, harder to compare across calibration schemes, and harder to validate later. It would also blur the line between model execution and statistical post-processing.
That is flexible, but it is easy to do inconsistently. If calibration is part of the recommended workflow, the package should provide a standard path for it.
For metrics that nalanda already owns conceptually,
summary functions should be able to compute both raw and optional
adjusted outputs. For example, if a summary function computes
delta_outgroup, a calibrated variant could appear as
adjusted_delta_outgroup when a calibration factor is
supplied.
This has several advantages:
For future condition-based workflows, a better boundary may be to
provide a small helper that adjusts already-computed effect columns,
regardless of where those effects were estimated. In other words,
nalanda does not need to own contrast estimation in order
to support calibration.
Such a helper could work on a user-supplied column and append:
adjusted_effectcalibration_factorcalibration_sourceThe package default should probably be
calibration = NULL, with named presets available for known
settings. A preset corresponding to the Hewitt et al. primary archive
could reasonably use 0.56, but that should be framed as a
setting-specific option rather than a universal package default.
prompt_bankThe prompt_bank object would formalize the reusable
prompt pieces that are currently spread across ad hoc strings in
scripts. Conceptually, it should be a named list or tibble-backed object
with a small, inspectable schema.
At minimum, a prompt_bank should contain:
intro_variants: short opening instructions or framing
sentences.setting_template: the general study description, such
as survey context or reading-task context.profile_template: a template for participant
description, including placeholders for identity or demographic
fields.stimulus_template: a wrapper describing how the
treatment text or chapter text is introduced.outcome_template: the question and response-scale
wording.scenario: a label such as
"survey_experiment", "book_prepost", or
"book_cumulative".metadata: version, source paper, and notes.In practice, one useful design would be:
prompt_bank <- list(
intro_variants = c(
"You will be asked to predict how people respond to various messages.",
"Can reading a message affect people's attitudes and actions?"
),
setting_template =
"Social scientists often conduct research studies using online surveys. The text below is from one such survey conducted on a large, diverse population of research participants.",
profile_template =
"Participant X is a {ideology}, {age}, {ethnicity}, {gender} participant with {education}. Politically, Participant X identifies as '{party}'.",
stimulus_template =
"Please read the material below. {stimulus_text}",
outcome_template =
"{outcome_text} Please choose a number from {scale_low} to {scale_high}.",
scenario = "survey_experiment",
metadata = list(source = "hewitt_ashokkumar_2024")
)For book workflows, a related prompt bank might swap in a
chapter-specific setting_template while keeping the same
overall structure.
ensemble_sizeI am imagining ensemble_size as more than a bare
integer, even if the user API initially accepts an integer. Internally,
it would be useful to represent the ensemble settings as a small control
object.
At minimum, this object should capture:
n: number of prompt variants to use per condition or
chapter.method: whether prompts are sampled randomly, cycled
deterministically, or exhaustively enumerated.replace: whether prompt variants may repeat.weights: optional prompt weights if some variants are
meant to count more.pooling: whether outputs are averaged at the response
level, condition mean level, or effect level.seed: a seed strategy for reproducibility.Conceptually:
ensemble_size <- list(
n = 8L,
method = "sample",
replace = TRUE,
weights = NULL,
pooling = "effect",
seed = 42L
)For an early implementation, it would be enough to let users pass
ensemble_size = 1, 4, or 8, while
storing the richer object internally. That would keep the public API
simple but leave room to expand.
demographic_profilesThe demographic_profiles object should represent either
a fixed set of profiles or a sampling frame from which profiles are
drawn. This is important because the papers do not only vary wording;
they also vary the participant being simulated (Hewitt et al. 2024a).
At minimum, each profile should be able to store:
profile_idgenderageethnicityeducationideologypartyweightlabelConceptually:
demographic_profiles <- tibble::tibble(
profile_id = c("p1", "p2"),
gender = c("Female", "Male"),
age = c("30-39", "Over 60"),
ethnicity = c("White", "Black"),
education = c("College", "Some college"),
ideology = c("Conservative", "Moderate"),
party = c("Strong Republican", "Lean Democrat"),
weight = c(0.5, 0.5),
label = c("profile_1", "profile_2")
)For nalanda, this object could support at least three
modes:
For chapter work, the immediate value may be greatest for party or ideology, with richer demographic fields becoming more important when studying subgroup heterogeneity or matching a target population.
If the goal is to move one small piece at a time, my current order of work would be:
prompt_bank constructor and prompt-building
helpers,The main reason for this order is that prompt standardization and package-native summaries are useful immediately for the existing chapter workflow, whereas population-matched demographic simulation and cumulative exposure are likely to require more validation work.