This vignette is for users who want a practical introduction to the
simulation approach in Hewitt, Ashokkumar, Ghezae, and Willer (Hewitt et al. 2024a, 2024b), and who want to
understand what the nalanda package can already do in that
spirit.
The short version is:
0.56 in that specific setting (Hewitt et al. 2024b);nalanda already supports several related simulation
workflows, but users still need to do some parts manually.The central question of the papers is whether a language model can be used to simulate how people would respond in real social science experiments (Hewitt et al. 2024a). Rather than asking the model to guess an effect size directly, the authors prompted the model to act like many hypothetical survey participants, each exposed to a study condition and then asked the outcome question.
The broad workflow was:
The main finding was that model-based predictions were strongly correlated with the real treatment effects in their primary archive of U.S. survey experiments (Hewitt et al. 2024a). In plain terms, the model was often good at telling which interventions would work better than others, even if it was not perfect at recovering the exact numeric size of the effects.
The supplement is especially useful because it clarifies what parts of the procedure mattered most in practice (Hewitt et al. 2024b).
The prompts were not just a bare stimulus plus question. They included:
This matters because users trying to apply the same general approach should not assume that any single ad hoc prompt will behave like the published method.
The supplement reports that averaging over more prompt variants improved accuracy (Hewitt et al. 2024b). In practice, this means users should avoid treating a single prompt wording as decisive whenever cost permits.
The main paper describes prompting the model with specific participant profiles, including fields such as gender, age, race, education, ideology, and party (Hewitt et al. 2024a). The supplement suggests that matched demographic profiles gave only small or modest gains in some subgroup analyses, but they are still part of the paper’s design (Hewitt et al. 2024b).
The model was good at predicting relative effects, but its raw effect
estimates were too large on average in the primary archive. The
supplement reports a shrinkage coefficient of about 0.56,
meaning that raw predicted effects in that setting should be multiplied
by 0.56 to improve absolute calibration (Hewitt et al. 2024b).
This is one of the most important practical take-aways from the paper.
If you want to use these papers as a guide, the safest practical lessons are:
0.56 factor as a useful paper-specific
calibration, not as a universal law for every possible application.nalandanalanda was not originally built as a line-by-line
reproduction of Hewitt et al. It was built around chapter-based
simulations for questions such as whether books or book chapters shift
attitudes. Even so, there is substantial overlap in spirit and
implementation.
At the moment, nalanda already supports:
run_ai_on_chapters().run_ai_on_chapters_one_turn().simulate_treatment().compute_run_ai_metrics() and
compute_run_ai_metrics_one_turn().What nalanda does not yet fully provide out of the
box:
nalanda todayBelow is a practical way to use the package while staying close to the lessons from Hewitt et al.
This is the most native current workflow in nalanda.
library(nalanda)
res <- run_ai_on_chapters(
book_texts = my_book_texts,
groups = c("Democrat", "Republican"),
context_text = "You are simulating an American adult who politically identifies as a {identity}.",
question_text = "On a scale from 0 to 100, how warmly do you feel towards {group}s?",
n_simulations = 20,
temperature = 0,
model = "gemini-2.5-flash-lite"
)
chapter_metrics <- compute_run_ai_metrics(res)This gives you package-native pre/post summaries such as deltas and gap changes. That is already useful for understanding whether a chapter seems to shift the simulated participant before versus after exposure.
If you want something closer to the paper’s single-prompt logic, use the one-turn interface.
res_one_turn <- run_ai_on_chapters_one_turn(
book_texts = my_book_texts,
groups = c("Democrat", "Republican"),
context_text = "You are simulating an American adult who politically identifies as a {identity}.",
question_text = "On a scale from 0 to 100, how warmly do you feel towards {group}s?",
n_simulations = 20,
temperature = 0,
model = "gemini-2.5-flash-lite"
)
one_turn_metrics <- compute_run_ai_metrics_one_turn(res_one_turn)This is not identical to the Hewitt et al. survey design, but it is closer to a post-only prompt structure where the chapter serves as the intervention text.
Some users will have treatment books and control books.
nalanda can already help generate the simulated outcomes
for each condition, even if the formal between-condition comparison
happens elsewhere.
One practical workflow is:
control,
treatment_a, or treatment_b,Conceptually:
chapter_metrics$condition <- c("control", "treatment", "treatment")
# then analyze in your preferred downstream workflow
# for example with dplyr summaries, rempsyc helpers, or regression modelsThis is an important design point: nalanda can own the
simulation workflow without needing to own every inferential
comparison.
At this stage, users wanting to approximate the Hewitt et al. method more closely should currently handle several parts themselves.
Right now, users still need to manage alternative prompt wordings in their own scripts. A good practical habit is to write down:
Even if these are stored as plain character vectors in a script, that is better than repeatedly editing one long prompt string by hand.
If you want to average over several prompt variants, you currently need to do this by running several simulations and combining them yourself. This is important if you want to remain closer to the published workflow.
Users who want more than identity-only conditioning should currently define their own participant profiles in a data frame or list and loop over them.
If your design involves treatment versus control comparisons,
nalanda can help produce the simulated outcomes, but you
will currently need to estimate contrasts yourself using your preferred
analysis workflow.
If you estimate an effect column downstream and want to apply the paper’s primary-archive calibration, you should currently do so yourself and document it clearly.
For example:
0.56 factorThis is the single most tempting number to overgeneralize from the papers, so it is worth being explicit.
Using 0.56 is most defensible when:
You should be more cautious when:
For most users, the best current practice is:
0.56,If you want a practical, cautious workflow inspired by Hewitt et al., the following is a reasonable current recipe.
nalanda;If cost and time allow, improve on the minimal workflow by:
nalanda may support laterPlanned or plausible future extensions include:
These features would make it easier for users to get closer to the Hewitt et al. workflow without writing as much scaffolding themselves.
For most current users, the best way to understand the Hewitt et al. papers is to think of them as a guide to disciplined simulation rather than as a recipe that transfers mechanically to every design.
The main practical lessons are:
0.56 correction cautiously and
transparently.nalandanalanda today
0.56 factor
nalanda may
support later