--- title: "Understanding Hewitt et al. and Using Nalanda Today" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Understanding Hewitt et al. and Using Nalanda Today} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Overview This vignette is for users who want a practical introduction to the simulation approach in Hewitt, Ashokkumar, Ghezae, and Willer [@hewitt2024predicting; @hewitt2024supplement], and who want to understand what the `nalanda` package can already do in that spirit. The short version is: 1. the papers show that large language models can often predict the direction and relative size of social science experimental effects surprisingly well; 2. this works best for text-based survey experiments; 3. the method uses careful prompting, demographic conditioning, and averaging over multiple prompts; 4. raw model-predicted effects were too large on average in their main archive, so the authors recommend shrinking effect estimates by about `0.56` in that specific setting [@hewitt2024supplement]; 5. `nalanda` already supports several related simulation workflows, but users still need to do some parts manually. # What the papers did, in simple language The central question of the papers is whether a language model can be used to simulate how people would respond in real social science experiments [@hewitt2024predicting]. Rather than asking the model to guess an effect size directly, the authors prompted the model to act like many hypothetical survey participants, each exposed to a study condition and then asked the outcome question. The broad workflow was: 1. describe the study setting, 2. describe a hypothetical participant, 3. show the treatment text, 4. ask the outcome question on the original response scale, 5. repeat this many times across conditions, people, and prompt variants, and 6. compare average responses across conditions. The main finding was that model-based predictions were strongly correlated with the real treatment effects in their primary archive of U.S. survey experiments [@hewitt2024predicting]. In plain terms, the model was often good at telling which interventions would work better than others, even if it was not perfect at recovering the exact numeric size of the effects. # What the supplement adds The supplement is especially useful because it clarifies what parts of the procedure mattered most in practice [@hewitt2024supplement]. ## 1. Prompting strategy matters The prompts were not just a bare stimulus plus question. They included: 1. an introductory framing sentence, 2. a short description of the study context, 3. a description of the hypothetical participant, 4. the experimental stimulus, and 5. the outcome question and scale. This matters because users trying to apply the same general approach should not assume that any single ad hoc prompt will behave like the published method. ## 2. Ensemble prompting matters The supplement reports that averaging over more prompt variants improved accuracy [@hewitt2024supplement]. In practice, this means users should avoid treating a single prompt wording as decisive whenever cost permits. ## 3. Demographic conditioning is part of the method The main paper describes prompting the model with specific participant profiles, including fields such as gender, age, race, education, ideology, and party [@hewitt2024predicting]. The supplement suggests that matched demographic profiles gave only small or modest gains in some subgroup analyses, but they are still part of the paper's design [@hewitt2024supplement]. ## 4. Absolute effect sizes need caution The model was good at predicting relative effects, but its raw effect estimates were too large on average in the primary archive. The supplement reports a shrinkage coefficient of about `0.56`, meaning that raw predicted effects in that setting should be multiplied by `0.56` to improve absolute calibration [@hewitt2024supplement]. This is one of the most important practical take-aways from the paper. # Recommended take-aways for users If you want to use these papers as a guide, the safest practical lessons are: 1. focus first on relative comparisons and ranking, not only exact effect-size recovery; 2. use multiple prompt variants rather than relying on one wording; 3. be explicit about who is being simulated; 4. preserve the original response scale in the prompt; 5. keep raw outputs and calibrated outputs separate; 6. treat the `0.56` factor as a useful paper-specific calibration, not as a universal law for every possible application. # How this relates to `nalanda` `nalanda` was not originally built as a line-by-line reproduction of Hewitt et al. It was built around chapter-based simulations for questions such as whether books or book chapters shift attitudes. Even so, there is substantial overlap in spirit and implementation. At the moment, `nalanda` already supports: 1. **Pre/post chapter simulations** with `run_ai_on_chapters()`. 2. **Post-only one-turn simulations** with `run_ai_on_chapters_one_turn()`. 3. **Custom prompt sequences** with `simulate_treatment()`. 4. **Summary functions for package-native outputs**, such as `compute_run_ai_metrics()` and `compute_run_ai_metrics_one_turn()`. What `nalanda` does not yet fully provide out of the box: 1. a built-in prompt bank matching the paper's exact strategy, 2. a first-class ensemble prompting object, 3. a first-class demographic profile object, 4. a dedicated condition-based experimental wrapper modeled directly on the Hewitt et al. survey workflow, 5. built-in between-condition inferential contrasts, and 6. a standard calibration helper for externally estimated effect columns. # A concrete way to use `nalanda` today Below is a practical way to use the package while staying close to the lessons from Hewitt et al. ## Scenario 1: Pre/post chapter simulations This is the most native current workflow in `nalanda`. ```{r, eval = FALSE} library(nalanda) res <- run_ai_on_chapters( book_texts = my_book_texts, groups = c("Democrat", "Republican"), context_text = "You are simulating an American adult who politically identifies as a {identity}.", question_text = "On a scale from 0 to 100, how warmly do you feel towards {group}s?", n_simulations = 20, temperature = 0, model = "gemini-2.5-flash-lite" ) chapter_metrics <- compute_run_ai_metrics(res) ``` This gives you package-native pre/post summaries such as deltas and gap changes. That is already useful for understanding whether a chapter seems to shift the simulated participant before versus after exposure. ## Scenario 2: Post-only simulations If you want something closer to the paper's single-prompt logic, use the one-turn interface. ```{r, eval = FALSE} res_one_turn <- run_ai_on_chapters_one_turn( book_texts = my_book_texts, groups = c("Democrat", "Republican"), context_text = "You are simulating an American adult who politically identifies as a {identity}.", question_text = "On a scale from 0 to 100, how warmly do you feel towards {group}s?", n_simulations = 20, temperature = 0, model = "gemini-2.5-flash-lite" ) one_turn_metrics <- compute_run_ai_metrics_one_turn(res_one_turn) ``` This is not identical to the Hewitt et al. survey design, but it is closer to a post-only prompt structure where the chapter serves as the intervention text. ## Scenario 3: Control-versus-treatment chapter comparisons Some users will have treatment books and control books. `nalanda` can already help generate the simulated outcomes for each condition, even if the formal between-condition comparison happens elsewhere. One practical workflow is: 1. run the simulation separately or jointly on all chapters or books, 2. attach a condition label such as `control`, `treatment_a`, or `treatment_b`, 3. compute the package-native summaries, 4. pass the resulting data frame to your preferred downstream tool for mean comparisons, contrasts, regression, or meta-analytic summaries. Conceptually: ```{r, eval = FALSE} chapter_metrics$condition <- c("control", "treatment", "treatment") # then analyze in your preferred downstream workflow # for example with dplyr summaries, rempsyc helpers, or regression models ``` This is an important design point: `nalanda` can own the simulation workflow without needing to own every inferential comparison. # What users should currently do themselves At this stage, users wanting to approximate the Hewitt et al. method more closely should currently handle several parts themselves. ## 1. Build or manage a prompt bank manually Right now, users still need to manage alternative prompt wordings in their own scripts. A good practical habit is to write down: 1. multiple introductory variants, 2. the study-setting text, 3. the participant-profile text, 4. the intervention text wrapper, and 5. the outcome question. Even if these are stored as plain character vectors in a script, that is better than repeatedly editing one long prompt string by hand. ## 2. Run ensembles manually If you want to average over several prompt variants, you currently need to do this by running several simulations and combining them yourself. This is important if you want to remain closer to the published workflow. ## 3. Manage demographic profiles manually Users who want more than identity-only conditioning should currently define their own participant profiles in a data frame or list and loop over them. ## 4. Estimate between-condition contrasts downstream If your design involves treatment versus control comparisons, `nalanda` can help produce the simulated outcomes, but you will currently need to estimate contrasts yourself using your preferred analysis workflow. ## 5. Apply effect calibration explicitly If you estimate an effect column downstream and want to apply the paper's primary-archive calibration, you should currently do so yourself and document it clearly. For example: ```{r, eval = FALSE} results$adjusted_effect <- results$raw_effect * 0.56 results$calibration_source <- "Hewitt et al. 2024 primary survey archive" ``` # How to think about the `0.56` factor This is the single most tempting number to overgeneralize from the papers, so it is worth being explicit. ## When it is reasonable to use it Using `0.56` is most defensible when: 1. your application is fairly close to the paper's primary setting, 2. your outcome is a text-based survey-style response, 3. you are estimating condition differences on the original response scale, and 4. you want a rough calibration of absolute effect magnitudes. ## When to be cautious You should be more cautious when: 1. your design is very different from their survey archive, 2. your intervention is a long book chapter rather than a short survey treatment, 3. your outcome is behavioral or cumulative rather than a direct survey item, 4. your analysis focuses on subgroup heterogeneity, or 5. you are working outside the kind of U.S. survey setting used in the paper. ## Best current practice For most users, the best current practice is: 1. report raw results, 2. if useful, add a separate adjusted result using `0.56`, 3. label clearly where the adjustment came from, and 4. do not overwrite the raw values. # A simple recommended workflow for users If you want a practical, cautious workflow inspired by Hewitt et al., the following is a reasonable current recipe. ## Minimal workflow 1. choose a clear outcome question and scale; 2. decide whether your design is pre/post or post-only; 3. define at least a small set of prompt variants; 4. run multiple simulations per chapter or condition; 5. compute package-native summary metrics in `nalanda`; 6. if needed, estimate between-condition comparisons downstream; 7. optionally create a separate calibrated version of any effect column. ## Better workflow If cost and time allow, improve on the minimal workflow by: 1. using several prompt variants rather than one, 2. simulating several participant profiles rather than one generic identity, 3. checking whether conclusions are stable across prompts and profiles, 4. comparing raw and adjusted effects side by side, and 5. treating conclusions as stronger when rank-order patterns are robust. # What `nalanda` may support later Planned or plausible future extensions include: 1. built-in prompt bank objects, 2. first-class ensemble controls, 3. first-class demographic profile objects, 4. a condition-based experimental simulation wrapper, 5. small calibration helpers, and 6. cumulative chapter simulation designs. These features would make it easier for users to get closer to the Hewitt et al. workflow without writing as much scaffolding themselves. # Final practical advice For most current users, the best way to understand the Hewitt et al. papers is to think of them as a guide to disciplined simulation rather than as a recipe that transfers mechanically to every design. The main practical lessons are: 1. prompt carefully, 2. average across prompts when possible, 3. be explicit about who is being simulated, 4. distinguish simulation from inference, 5. keep raw and adjusted outputs separate, and 6. use the `0.56` correction cautiously and transparently. # References