How to test a GenAI chatbot

March 18, 2025

At People’s Law School, we developed a dataset of 42 legal questions to test our generative AI chatbot. We're now publishing the testing dataset to assist others in their GenAI journeys. Here, we document our experience in using the dataset to help ensure our chatbot responses were accurate and helpful. 

Context

At People's Law School, we provide public legal education to British Columbians. Among our offerings is Beagle+, a chatbot that uses artificial intelligence to guide people to relevant, high quality legal information drawn from our People’s Law School websites. In 2024, we relaunched the chatbot as Beagle+, powered by ChatGPT 4 and using retrieval augmented generation (RAG).  

Testing approach

As we developed Beagle+, we tested many configurations, featuring different ChatGPT models and three iterations of our system prompt wording. We also experimented with adjusting the number and size of the content chunks generated by RAG, as well as (in one configuration) lowering the model temperature setting. 

The published dataset includes seven testing configurations:

  • GPT-3.5-turbo + prompt v1 (September 28, 2023) used GPT-3.5-turbo-16k, a model temperature of 0.8, 10 content chunks of ~200 words, and the first iteration of our system prompt

  • GPT-4 + prompt v1 (October 5, 2023) used GPT-4, a model temperature of 0.8, three content chunks of a single web page each, and the first iteration of our system prompt

  • GPT-3.5-turbo + prompt v2 (October 27, 2023) used GPT-3.5-turbo, a model temperature of 0.8, five content chunks of ~200 words, and the second iteration of our system prompt (where we added “Think step-by-step”)

  • GPT-4 + prompt v2 (October 27, 2023) used GPT-4, a model temperature of 0.8, five content chunks of ~200 words, and the second iteration of our system prompt 

  • GPT-4 + prompt v2 + < temp (November 3, 2023) used GPT-4, a model temperature of 0.6, five content chunks of ~200 words, and the second iteration of our system prompt 

  • GPT-4 + prompt v3 (November 6, 2023) used GPT-4, a model temperature of 0.8, five content chunks of ~200 words, and the third iteration of our system prompt (where we added “If content is provided, link to it inline with your answer” and “Write at a grade 8 level”)

  • GPT-4-turbo + prompt v3 (November 9, 2023) used GPT-4-turbo, a model temperature of 0.8, five content chunks of ~200 words, and the third iteration of our system prompt 

We tested these Beagle+ configurations using a dataset of 42 test questions. (Why 42? The answer might be found here.) These were questions from real people, asked in Beagle 1.0 conversations or through other People’s Law School channels. The questions were challenging ones (not softball questions, which GenAI handles well pretty much every time). The questions fell into five buckets:

  • 8 nuanced, high-risk questions that are answered very well on our websites 

  • 8 nuanced questions that are not answered at all on our websites 

  • 9 nuanced questions that aren't answered on our websites, but that concern a topic that *is* covered on our websites 

  • 5 questions that require up-to-date knowledge of the law

  • 12 questions sourced from recent Beagle 1.0 conversations to round out the topic mix to more fully reflect the diversity of problems as well as common problems 

For each question, we developed an ideal response, as well as key points that an ideal response would include.

Here’s an example of an ideal response:

Example of ideal response to test question on power of attorney abuse

During our testing journey, our team of three reviewers (all lawyers) assessed every response from Beagle+ across two dimensions:

  • Safety: We assessed the response as safe, unsafe, or very unsafe. We were looking at whether the response was legally accurate on points that affect someone’s legal rights or steps they might take. If a detail was a bit off, like the name of an agency, that didn’t qualify the response as unsafe.

  • Value: We assessed the response as very valuable, valuable, or not valuable. Here, we were looking at the tone, helpfulness, and language used to empower the user to take a next step in resolving or avoiding a legal issue. 

Our reviewers made reviewer comments explaining their assessments.

Here’s an example response with reviewer comments from early in the testing journey, again featuring the same question as above:

Example 1 of test response to question on power of attorney abuse

Here’s another example response with reviewer comments from later in the testing journey, featuring the same question:

Example 2 of test response to question on power of attorney abuse

Testing results

The responses to the 42 test questions for each of the seven configurations are in the published dataset. Here are the totals:

Testing configuration

Safe

Unsafe

Very unsafe

Very valuable

Valuable

Not valuable

Ideal response

42

0

0

42

0

0

GPT-3.5-turbo + prompt v1

35

6

1

1

30

11

GPT-4 + prompt v1

37

5

0

0

30

12

GPT-3.5-turbo + prompt v2

35

6

1

1

33

8

GPT-4 + prompt v2

41

0

1

1

31

10

GPT-4 + prompt v2 + < temp

39

1

2

3

26

13

GPT-4 + prompt v3

40

2

0

10

26

6

GPT-4-turbo + prompt v3

41

0

1

25

14

3

As we progressed through the testing configurations, the responses became more consistently safe.

Chart showing Beagle+ testing for safety

As well, as we progressed through the testing configurations, the responses became more consistently valuable.

Chart showing Beagle+ testing for value

A variety of question types, assessed across the dimensions of safety (are the responses legally accurate?) and value (are they actually helpful to a user?), has created a potential dataset for anyone looking to test a GenAI chatbot that provides legal help. And it supports our team at People’s Law to constantly — and consistently — refine Beagle+ to provide accurate and helpful answers to those looking to resolve or avoid everyday legal problems in British Columbia.

This website explains in a general way the law that applies in British Columbia, Canada. The information is not intended as legal advice. The cases we refer to reflect real experiences, but names have been changed. See our full disclaimer.

Get the latest free info — sign up for our newsletter

Access the email newsletter archive.

Contact us

You can reach us by phone at 604-331-5400. More contact info.

Also from People's Law School

Dial-A-Law: A starting point for information on the law in British Columbia in 190+ topic areas. Available online and by phone.

Unbundled Legal Services: Learn about a new service model for lower-cost legal help. 

Beagle: A chatbot that helps with common legal problems. Look in the bottom right corner :)

Thanks to our funders

Meet our primary funders.

People's Law School logo

We are grateful to work on the unceded traditional territories of the xʷməθkʷəy̓əm (Musqueam), Sḵwx̱wú7mesh (Squamish) and səlilwətaɬ (Tsleil-Waututh) Nations, whose Peoples continue to live on and care for these lands.