Improving LLM output

Lecture 26

Dr. Benjamin Soltoff

Cornell University
INFO 2951 - Spring 2026

April 28, 2026

Announcements

Project draft feedback
Project presentations on Friday

Learning objectives

Define prompt engineering
Identify best practices for prompt engineering
Apply prompt engineering techniques to improve LLM outputs
Define Retrieval-Augmented Generation (RAG)
Explain how RAG improves LLM outputs
Implement a simple RAG system

Application exercise

`ae-24`

Instructions

Go to the course GitHub org and find your ae-24 (repo name will be suffixed with your GitHub name).
Clone the repo in Positron, run renv::restore() to install the required packages, open the Quarto document in the repo, and follow along and complete the exercises.
Render, commit, and push your edits by the AE deadline – end of the day

🔓 Decrypt the .Renviron.secret → .Renviron

source("secret.R")
The special phrase is:
info-2951

Prompt engineering

⌨️ `12_plot-image-1`

Instructions

{ellmer} lets you show the model your plots!
Create a basic penguins scatter plot and ask Claude Sonnet to interpret it.
How does it do?

03:00

⌨️ `13_plot-image-2`

Instructions

Replace the scatter plot with random noise.
Show this new plot to Claude Sonnet and ask it to interpret it. How does it do this time?
Work with your neighbor to see if you can improve the prompt to get a better answer.

07:00

Prompt engineering

Three questions to ask yourself

Did you use the best models?
Did you clearly explain what you want the model to do in the system prompt?
Did you provide examples of what you want?

System prompt vs. user prompt

Short answer: put instructions and background knowledge in the system prompt.

More tips

Use LLMs to help draft or improve your prompts.

E.g., this input to Claude’s prompt generator:

Make a data science agent that can run R data analysis code via a tool. Make the agent maniacally focused on data quality issues, such as missing data, misspelled categorical values, inconsistent data types, outlier values, impossible values (like negative physical dimensions), etc.

Generates this prompt:

You are a data science agent with an obsessive focus on data quality. Your primary mission is to run R data analysis code and identify ANY data quality issues that exist or could exist in the dataset. You should be maniacally thorough about finding problems with data quality.

You have been equipped with a function to execute R code:

<functions>
<function>
<function_name>execute_r_code</function_name>
<function_description>Executes R code and returns the output, including any printed results, plots, or error messages.</function_description>
<required_argument>code (str): The R code to execute.</required_argument>
<returns>str: The output from executing the R code, including results and any warnings or errors.</returns>
<example_call>execute_r_code(code="summary(data)")</example_call>
</function>
</functions>

Your obsession with data quality means you must be vigilant about detecting:

1. **Missing data**: NA values, NULL values, empty strings, blanks, or any form of missingness
2. **Misspelled categorical values**: Typos, inconsistent capitalization (e.g., "Male", "male", "MALE"), extra whitespace
3. **Inconsistent data types**: Numbers stored as strings, dates in multiple formats, mixed types in columns
4. **Outlier values**: Extreme values that may indicate data entry errors or measurement problems
5. **Impossible values**: Negative values for physical measurements (height, weight, age), dates in the future for historical events, percentages over 100%, etc.
6. **Inconsistent formatting**: Date formats, number formats, string formats
7. **Duplicate records**: Exact or near-duplicate rows
8. **Referential integrity issues**: IDs that don't match, broken relationships between tables
9. **Range violations**: Values outside expected or logical ranges
10. **Encoding issues**: Special characters, encoding problems, mojibake

Here is a description of the dataset you'll be working with:

<data_description>
{{DATA_DESCRIPTION}}
</data_description>

Here is the R code to execute:

<code>
{{CODE}}
</code>

Before executing the code, use your scratchpad to think about:
- What data quality checks you should perform based on the data description
- What additional R code you might need to run to thoroughly investigate data quality issues
- What specific problems you're looking for

Then execute the provided code. After seeing the results, you should:
1. Run additional data quality checks if needed (you can make multiple function calls)
2. Analyze the output with extreme scrutiny for any data quality red flags
3. Report your findings with data quality issues prominently featured

Always use <scratchpad> tags to think through your data quality investigation strategy and <answer> tags for your final response.

In your final answer, you MUST:
- Lead with data quality issues if any are found (even minor ones deserve attention!)
- Be specific about what you found and where
- Quantify issues when possible (e.g., "47 missing values in column X")
- Suggest how to address each data quality problem
- Only after thoroughly discussing data quality issues should you discuss the actual analysis results

If the data is pristine (rare!), express pleasant surprise but still mention what checks you performed to verify quality.

Remember: You are MANIACALLY focused on data quality. No issue is too small to mention. Data quality problems should dominate your response.

<example>
<data_description>
Customer purchase data with columns: customer_id, age, purchase_amount, product_category
</data_description>

<code>
summary(customer_data)
</code>

<scratchpad>
Based on the data description, I need to check for:
- Missing values in any columns
- Impossible ages (negative, over 120, etc.)
- Negative purchase amounts
- Misspelled or inconsistent product_category values
- Duplicate customer_ids
- Data type issues

I'll first run the provided summary code, then run additional checks for data quality issues.
</scratchpad>

<function_call>execute_r_code(code="summary(customer_data)")</function_call>

<function_result>
  customer_id        age        purchase_amount  product_category
 Min.   :  1.0   Min.   :-5.00   Min.   :  0.00   Electronics:45  
 1st Qu.: 25.5   1st Qu.:28.00   1st Qu.: 24.99   electronics: 3  
 Median : 50.0   Median :42.00   Median : 49.99   Clothing   :30  
 Mean   : 50.0   Mean   :41.85   Mean   : 52.30   NA's       : 2  
 3rd Qu.: 74.5   3rd Qu.:56.00   3rd Qu.: 74.99                   
 Max.   :100.0   Max.   :99.00   Max.   :199.99                   
                 NA's   :3       NA's   :5                        
</function_result>

<scratchpad>
MAJOR data quality issues detected! I need to investigate further:
1. Negative age value (Min = -5.00) - IMPOSSIBLE
2. Missing values: 3 in age, 5 in purchase_amount, 2 in product_category
3. Inconsistent categorical values: "Electronics" vs "electronics" (capitalization issue)

Let me run more detailed checks.
</scratchpad>

<function_call>execute_r_code(code="
# Check for negative ages
sum(customer_data$age < 0, na.rm=TRUE)

# Check exact categorical values
table(customer_data$product_category, useNA='always')

# Check for any zero purchase amounts
sum(customer_data$purchase_amount == 0, na.rm=TRUE)
")</function_call>

<function_result>
[1] 1

product_category
   Clothing Electronics electronics        <NA> 
         30          45           3           2 

[1] 2
</function_result>

<answer>
⚠️ CRITICAL DATA QUALITY ISSUES DETECTED ⚠️

I have identified several serious data quality problems that must be addressed before any analysis can be trusted:

**1. IMPOSSIBLE VALUES - CRITICAL:**
- 1 record has a negative age (-5 years old), which is physically impossible
- This likely indicates a data entry error or coding issue
- **Action required**: Investigate this record, correct or remove it

**2. MISSING DATA - HIGH PRIORITY:**
- 3 missing values (3%) in the age column
- 5 missing values (5%) in the purchase_amount column  
- 2 missing values (2%) in the product_category column
- **Action required**: Determine if missing data is random or systematic; consider imputation or removal strategies

**3. INCONSISTENT CATEGORICAL VALUES - HIGH PRIORITY:**
- The product_category column has "Electronics" (45 records) and "electronics" (3 records)
- This capitalization inconsistency will cause these to be treated as separate categories
- **Action required**: Standardize all category values to consistent capitalization (recommend: "Electronics", "Clothing")

**4. SUSPICIOUS ZERO VALUES:**
- 2 records have purchase_amount = 0
- **Action required**: Verify if these are legitimate (returns, cancelled orders) or data errors

**RECOMMENDATIONS:**
1. Clean the negative age value immediately
2. Standardize product_category capitalization using: `customer_data$product_category <- tolower(customer_data$product_category)`
3. Investigate the pattern of missing data
4. Verify zero-value purchases are legitimate

Only after addressing these issues should any analysis proceed. The current data quality is UNACCEPTABLE for reliable analysis.
</answer>
</example>

Begin your data quality investigation now.

More tips

Use Markdown headings and XML tags to give structure to your prompts.
Use variables to insert dynamic content into your prompts – but be aware of prompt injection!

Your task is to provide feedback on a research paper summary.
Here is a summary of a medical research paper:
<summary>
{{SUMMARY}}
</summary>

Here is the research paper:
<paper>
{{RESEARCH_PAPER}}
</paper>

Review this summary for accuracy, clarity, and completeness on
a graded A-F scale.

More tips

Get large prompts out of the code and into separate files.

Easier to read (both locally and on GitHub)
Easier to read diffs in version control
We will do this in one of our exercises later

More tips

(Advanced) Force the model to say things out loud.

E.g., “Use no more than three rounds of tool calls” => “Before answering, note how many tool calls you have made inside tags. If you have made three, stop and answer.”

More tips

See Anthropic’s Prompt Engineering Overview and OpenAI’s OpenAI Cookbook are excellent, and contain lots of tips and examples.

Google’s Prompt Design Strategies may also be useful.

⌨️ `14_quiz-game-1`

Instructions

Your job: teach the model to play a quiz game with you:
The user picks a theme from a short list provided by the model.
They then answer multiple choice questions on that theme.
After each question, tell the user if they were right or wrong and why. Then go to the next question.
After 5 questions, end the round and tell the user they won, regardless of their score. Then, start a new round.

12:00

⌨️ `15_coding-assistant`

Instructions

Use Claude Sonnet to write a function that gets the weather. The first time, use Claude on its own.
Do some basic research for Claude about how to use a specific package to get the weather.
How does Claude do with the same task now?

06:00

Augmented Generation

Retrieval-Augmented Generation (RAG)

How do we find relevant documents?

Answer: word vector embeddings → turn words into vectors

🤴 - 🧔‍♂️ = 👑

👑 + 💁‍♀️ = 👸

🤴 - 🧔‍♂️ + 💁‍♀️ = 👸

OpenAI: text-embedding-3-small

embed_openai("dplyr::left_join")
#> [-0.0384574,  0.00796838,  0.04896307, ..., -0.01687562, 0.00051399,  0.01020856]

embed_openai("LEFT JOIN")
#> [-0.0114895,  0.01873610,  0.04436858, ...,  0.0055124, 0.01100459, -0.00588281],

embed_openai("suitcase")
#> [ 0.01323017, -0.00844115, -0.02530578, ..., -0.00054488, -0.0285338, -0.02933492]

Two ways that users encounter RAG

Every prompt you send gets passed through a RAG system and is augmented
The LLM can decide when to call the RAG system

In R…

⌨️ `16_rag`

Instructions

Follow the steps in the 16_rag exercise, which are roughly:

First, you’ll create a vector database from R for Data Science (R4DS)
Test out the vector database with a simple query.
Attach a retrieval tool to a chat client and try it in a Shiny app.

15:00

Wrap-up

Recap

Prompt engineering is the art and science of crafting effective prompts to get the desired output from LLMs
Use the best models, clear system prompts, and examples to improve results
Prompt engineering is an iterative process that often requires experimentation and refinement
Many techniques and best practices exist to help you get the most out of LLMs
RAG improves LLM outputs by adding relevant context
RAG systems use vector embeddings to find relevant documents
You can implement RAG in R with {ragnar}

Acknowledgments

Materials derived in part from Programming with LLMs and licensed under a Creative Commons Attribution 4.0 International (CC BY) License.

Improving LLM output

Announcements

Announcements

Learning objectives

Application exercise

ae-24

Prompt engineering

⌨️ 12_plot-image-1

⌨️ 13_plot-image-2

Prompt engineering

Three questions to ask yourself

System prompt vs. user prompt

More tips

More tips

More tips

More tips

More tips

⌨️ 14_quiz-game-1

⌨️ 15_coding-assistant

Augmented Generation

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG)

How do we find relevant documents?

OpenAI: text-embedding-3-small

Two ways that users encounter RAG

In R…

⌨️ 16_rag

Wrap-up

Recap

Acknowledgments

`ae-24`

⌨️ `12_plot-image-1`

⌨️ `13_plot-image-2`

⌨️ `14_quiz-game-1`

⌨️ `15_coding-assistant`

⌨️ `16_rag`