How to test a GenAI chatbot

March 18, 2025

Using a testing dataset has improved the accuracy and helpfulness of Beagle+, a GenAI chatbot that answers questions on everyday legal issues.

At People’s Law School, we developed a dataset of 42 legal questions to test our generative AI chatbot. We're now publishing the testing dataset to assist others in their GenAI journeys. Here, we document our experience in using the dataset to help ensure our chatbot responses were accurate and helpful.

Context

At People's Law School, we provide public legal education to British Columbians. Among our offerings is Beagle+, a chatbot that uses artificial intelligence to guide people to relevant, high quality legal information drawn from our People’s Law School websites. In 2024, we relaunched the chatbot as Beagle+, powered by ChatGPT 4 and using retrieval augmented generation (RAG).

Testing approach

As we developed Beagle+, we tested many configurations, featuring different ChatGPT models and three iterations of our system prompt wording. We also experimented with adjusting the number and size of the content chunks generated by RAG, as well as (in one configuration) lowering the model temperature setting.

The published dataset includes seven testing configurations:

GPT-3.5-turbo + prompt v1 (September 28, 2023) used GPT-3.5-turbo-16k, a model temperature of 0.8, 10 content chunks of ~200 words, and the first iteration of our system prompt
GPT-4 + prompt v1 (October 5, 2023) used GPT-4, a model temperature of 0.8, three content chunks of a single web page each, and the first iteration of our system prompt
GPT-3.5-turbo + prompt v2 (October 27, 2023) used GPT-3.5-turbo, a model temperature of 0.8, five content chunks of ~200 words, and the second iteration of our system prompt (where we added “Think step-by-step”)
GPT-4 + prompt v2 (October 27, 2023) used GPT-4, a model temperature of 0.8, five content chunks of ~200 words, and the second iteration of our system prompt
GPT-4 + prompt v2 + < temp (November 3, 2023) used GPT-4, a model temperature of 0.6, five content chunks of ~200 words, and the second iteration of our system prompt
GPT-4 + prompt v3 (November 6, 2023) used GPT-4, a model temperature of 0.8, five content chunks of ~200 words, and the third iteration of our system prompt (where we added “If content is provided, link to it inline with your answer” and “Write at a grade 8 level”)
GPT-4-turbo + prompt v3 (November 9, 2023) used GPT-4-turbo, a model temperature of 0.8, five content chunks of ~200 words, and the third iteration of our system prompt

We tested these Beagle+ configurations using a dataset of 42 test questions. (Why 42? The answer might be found here.) These were questions from real people, asked in Beagle 1.0 conversations or through other People’s Law School channels. The questions were challenging ones (not softball questions, which GenAI handles well pretty much every time). The questions fell into five buckets:

8 nuanced, high-risk questions that are answered very well on our websites
8 nuanced questions that are not answered at all on our websites
9 nuanced questions that aren't answered on our websites, but that concern a topic that *is* covered on our websites
5 questions that require up-to-date knowledge of the law
12 questions sourced from recent Beagle 1.0 conversations to round out the topic mix to more fully reflect the diversity of problems as well as common problems

For each question, we developed an ideal response, as well as key points that an ideal response would include.

Here’s an example of an ideal response:

Example of ideal response to test question on power of attorney abuse

During our testing journey, our team of three reviewers (all lawyers) assessed every response from Beagle+ across two dimensions:

Safety: We assessed the response as safe, unsafe, or very unsafe. We were looking at whether the response was legally accurate on points that affect someone’s legal rights or steps they might take. If a detail was a bit off, like the name of an agency, that didn’t qualify the response as unsafe.
Value: We assessed the response as very valuable, valuable, or not valuable. Here, we were looking at the tone, helpfulness, and language used to empower the user to take a next step in resolving or avoiding a legal issue.

Our reviewers made reviewer comments explaining their assessments.

Here’s an example response with reviewer comments from early in the testing journey, again featuring the same question as above:

Example 1 of test response to question on power of attorney abuse

Here’s another example response with reviewer comments from later in the testing journey, featuring the same question:

Example 2 of test response to question on power of attorney abuse

Testing results

The responses to the 42 test questions for each of the seven configurations are in the published dataset. Here are the totals:

Testing configuration	Safe	Unsafe	Very unsafe	Very valuable	Valuable	Not valuable
Ideal response	42	0	0	42	0	0
GPT-3.5-turbo + prompt v1	35	6	1	1	30	11
GPT-4 + prompt v1	37	5	0	0	30	12
GPT-3.5-turbo + prompt v2	35	6	1	1	33	8
GPT-4 + prompt v2	41	0	1	1	31	10
GPT-4 + prompt v2 + < temp	39	1	2	3	26	13
GPT-4 + prompt v3	40	2	0	10	26	6
GPT-4-turbo + prompt v3	41	0	1	25	14	3

As we progressed through the testing configurations, the responses became more consistently safe.

Chart showing Beagle+ testing for safety

As well, as we progressed through the testing configurations, the responses became more consistently valuable.

A variety of question types, assessed across the dimensions of safety (are the responses legally accurate?) and value (are they actually helpful to a user?), has created a potential dataset for anyone looking to test a GenAI chatbot that provides legal help. And it supports our team at People’s Law to constantly — and consistently — refine Beagle+ to provide accurate and helpful answers to those looking to resolve or avoid everyday legal problems in British Columbia.