nalandaThis vignette shows how to use nalanda for the kind of
workflow described by Rathje et al. (2024): apply a simple prompt to
many short texts, ask for a numeric response, and compare model outputs
to human annotations.
The goal here is not to reproduce every benchmark in the paper. The goal is to give a simple getting-started pattern you can adapt for:
language column.As in the other live nalanda workflows, it is easiest to
set model routing once at the top of your script.
library(nalanda)
options(
nalanda.integration = "gpt-5-mini",
nalanda.base_url = "https://ai-gateway.apps.cloud.rt.nyu.edu/v1/"
)
# In some Portkey/gateway setups the route slug is not the provider name.
# Verify the route with ellmer::models_portkey() or use a fully-qualified
# model string such as "@gpt-5-mini/gpt-5-mini" if that is the route that works
# in your gateway.The paper works row-wise over tweets or headlines.
run_text_analysis() uses the same pattern: one row per
text.
texts <- tibble::tibble(
id = 1:4,
language = c("English", "English", "Hindi", "Simplified Chinese"),
text = c(
"I love this new community project.",
"This policy announcement is fine, I guess.",
"\u092f\u0939 \u0916\u092c\u0930 \u092c\u0939\u0941\u0924 \u0905\u091a\u094d\u091b\u0940 \u0939\u0948\u0964",
"\u6211\u4e0d\u559c\u6b22\u4ed6\u4eec\u5904\u7406\u8fd9\u4e2a\u95ee\u9898\u7684\u65b9\u5f0f\u3002"
),
human_sentiment = c(1, 2, 1, 3)
)
texts
#> # A tibble: 4 × 4
#> id language text human_sentiment
#> <int> <chr> <chr> <dbl>
#> 1 1 English I love this new community project. 1
#> 2 2 English This policy announcement is fine, I … 2
#> 3 3 Hindi यह खबर बहुत अच्छी है। 1
#> 4 4 Simplified Chinese 我不喜欢他们处理这个问题的方式。 3Here the human labels follow the same coding style used in the paper:
1 = positive2 = neutral3 = negativeThe screenshot tutorial shows a very direct prompt. You can build the
same kind of prompt with make_annotation_prompt().
sentiment_prompt <- make_annotation_prompt(
question = "Is the sentiment of this {language} text positive, neutral, or negative?",
labels = c("positive", "neutral", "negative")
)
cat(sentiment_prompt)
#> Is the sentiment of this {language} text positive, neutral, or negative?
#> Answer only with a number: 1 if positive, 2 if neutral, 3 if negative
#> Here is the text:
#> {text}This returns a prompt template, not a final prompt. The
{language} and {text} placeholders will be
filled separately for each row.
Now apply the prompt to every row with
run_text_analysis(). The result schema is defined with
ellmer just like in the other nalanda
workflows.
res <- run_text_analysis(
data = texts,
id_col = "id",
text_col = "text",
prompt = sentiment_prompt,
response_type = ellmer::type_object(
gpt = ellmer::type_number()
),
n_simulations = 1,
temperature = 0,
model = "gpt-5-mini"
)The important differences from the older chapter-based functions are:
{column_name}, andEach row of the result corresponds to one text and one simulation run.
| id | language | sim | human_sentiment | gpt | text |
|---|---|---|---|---|---|
| 1 | English | 1 | 1 | 1 | I love this new community project. |
| 2 | English | 1 | 2 | 2 | This policy announcement is fine, I guess. |
| 3 | Hindi | 1 | 1 | 1 | यह खबर बहुत अच्छी है। |
| 4 | Simplified Chinese | 1 | 3 | 3 | 我不喜欢他们处理这个问题的方式。 |
This is the same basic structure as the screenshot workflow, but the parsing is already handled for you because the response is extracted as a structured numeric field.
Rathje et al. compare GPT output to human annotations with metrics
such as accuracy, macro F1, and Spearman correlations.
evaluate_text_analysis() provides a simple package-native
version of that step.
scores <- evaluate_text_analysis(
res,
truth_col = "human_sentiment",
estimate_col = "gpt",
by = "language",
metric = c("accuracy", "macro_precision", "macro_recall", "macro_f1")
)
scores| language | n | accuracy | macro_precision | macro_recall | macro_f1 |
|---|---|---|---|---|---|
| English | 2 | 1 | 1 | 1 | 1 |
| Hindi | 1 | 1 | 1 | 1 | 1 |
| Simplified Chinese | 1 | 1 | 1 | 1 | 1 |
For Likert-style tasks, switch the metric set to something like:
The paper also evaluates headline sentiment and emotions on 1 to 7 scales. That prompt style is also supported.
likert_prompt <- make_annotation_prompt(
question = "How negative or positive is this headline on a 1 to 7 scale?",
scale = c(1, 7),
anchors = c("very negative", "very positive"),
text_label = "Here is the headline:"
)
cat(likert_prompt)
#> How negative or positive is this headline on a 1 to 7 scale?
#> Answer only with a number, with 1 being "very negative" and 7 being "very positive".
#> Here is the headline:
#> {text}The live call looks the same, except the response field now represents a Likert rating instead of a class code.
The paper also checks whether repeated runs produce similar outputs.
To do that, increase n_simulations.
res_repeated <- run_text_analysis(
data = texts,
id_col = "id",
text_col = "text",
prompt = sentiment_prompt,
response_type = ellmer::type_object(
gpt = ellmer::type_number()
),
n_simulations = 2,
temperature = 0,
model = "gpt-5-mini"
)Then compare run 1 and run 2 with
evaluate_text_analysis() after reshaping the results into
one column per run.
Use this vignette’s workflow when:
Use the chapter-oriented workflows when your unit is still a book chapter and you care about pre/post changes across simulated identities.
Rathje, S., Mirea, D. M., Sucholutsky, I., Marjieh, R., Robertson, C. E., & Van Bavel, J. J. (2024). GPT is an effective tool for multilingual psychological text analysis. Proceedings of the National Academy of Sciences, 121(34), e2308950121. https://doi.org/10.1073/pnas.2308950121