
March 18, 2025
Using a testing dataset has improved the accuracy and helpfulness of Beagle+, a GenAI chatbot that answers questions on everyday legal issues.
At People’s Law School, we developed a dataset of 42 legal questions to test our generative AI chatbot. We're now publishing the testing dataset to assist others in their GenAI journeys. Here, we document our experience in using the dataset to help ensure our chatbot responses were accurate and helpful.
Context
At People's Law School, we provide public legal education to British Columbians. Among our offerings is Beagle+, a chatbot that uses artificial intelligence to guide people to relevant, high quality legal information drawn from our People’s Law School websites. In 2024, we relaunched the chatbot as Beagle+, powered by ChatGPT 4 and using retrieval augmented generation (RAG).
Testing approach
As we developed Beagle+, we tested many configurations, featuring different ChatGPT models and three iterations of our system prompt wording. We also experimented with adjusting the number and size of the content chunks generated by RAG, as well as (in one configuration) lowering the model temperature setting.
The published dataset includes seven testing configurations:
GPT-3.5-turbo + prompt v1 (September 28, 2023) used GPT-3.5-turbo-16k, a model temperature of 0.8, 10 content chunks of ~200 words, and the first iteration of our system prompt
GPT-4 + prompt v1 (October 5, 2023) used GPT-4, a model temperature of 0.8, three content chunks of a single web page each, and the first iteration of our system prompt
GPT-3.5-turbo + prompt v2 (October 27, 2023) used GPT-3.5-turbo, a model temperature of 0.8, five content chunks of ~200 words, and the second iteration of our system prompt (where we added “Think step-by-step”)
GPT-4 + prompt v2 (October 27, 2023) used GPT-4, a model temperature of 0.8, five content chunks of ~200 words, and the second iteration of our system prompt
GPT-4 + prompt v2 + < temp (November 3, 2023) used GPT-4, a model temperature of 0.6, five content chunks of ~200 words, and the second iteration of our system prompt
GPT-4 + prompt v3 (November 6, 2023) used GPT-4, a model temperature of 0.8, five content chunks of ~200 words, and the third iteration of our system prompt (where we added “If content is provided, link to it inline with your answer” and “Write at a grade 8 level”)
GPT-4-turbo + prompt v3 (November 9, 2023) used GPT-4-turbo, a model temperature of 0.8, five content chunks of ~200 words, and the third iteration of our system prompt
We tested these Beagle+ configurations using a dataset of 42 test questions. (Why 42? The answer might be found here.) These were questions from real people, asked in Beagle 1.0 conversations or through other People’s Law School channels. The questions were challenging ones (not softball questions, which GenAI handles well pretty much every time). The questions fell into five buckets:
8 nuanced, high-risk questions that are answered very well on our websites
8 nuanced questions that are not answered at all on our websites
9 nuanced questions that aren't answered on our websites, but that concern a topic that *is* covered on our websites
5 questions that require up-to-date knowledge of the law
12 questions sourced from recent Beagle 1.0 conversations to round out the topic mix to more fully reflect the diversity of problems as well as common problems
For each question, we developed an ideal response, as well as key points that an ideal response would include.
Here’s an example of an ideal response:

During our testing journey, our team of three reviewers (all lawyers) assessed every response from Beagle+ across two dimensions:
Safety: We assessed the response as safe, unsafe, or very unsafe. We were looking at whether the response was legally accurate on points that affect someone’s legal rights or steps they might take. If a detail was a bit off, like the name of an agency, that didn’t qualify the response as unsafe.
Value: We assessed the response as very valuable, valuable, or not valuable. Here, we were looking at the tone, helpfulness, and language used to empower the user to take a next step in resolving or avoiding a legal issue.
Our reviewers made reviewer comments explaining their assessments.
Here’s an example response with reviewer comments from early in the testing journey, again featuring the same question as above:

Here’s another example response with reviewer comments from later in the testing journey, featuring the same question:

Testing results
The responses to the 42 test questions for each of the seven configurations are in the published dataset. Here are the totals:
Testing configuration | Safe | Unsafe | Very unsafe | Very valuable | Valuable | Not valuable |
---|---|---|---|---|---|---|
Ideal response | 42 | 0 | 0 | 42 | 0 | 0 |
GPT-3.5-turbo + prompt v1 | 35 | 6 | 1 | 1 | 30 | 11 |
GPT-4 + prompt v1 | 37 | 5 | 0 | 0 | 30 | 12 |
GPT-3.5-turbo + prompt v2 | 35 | 6 | 1 | 1 | 33 | 8 |
GPT-4 + prompt v2 | 41 | 0 | 1 | 1 | 31 | 10 |
GPT-4 + prompt v2 + < temp | 39 | 1 | 2 | 3 | 26 | 13 |
GPT-4 + prompt v3 | 40 | 2 | 0 | 10 | 26 | 6 |
GPT-4-turbo + prompt v3 | 41 | 0 | 1 | 25 | 14 | 3 |
As we progressed through the testing configurations, the responses became more consistently safe.

As well, as we progressed through the testing configurations, the responses became more consistently valuable.

A variety of question types, assessed across the dimensions of safety (are the responses legally accurate?) and value (are they actually helpful to a user?), has created a potential dataset for anyone looking to test a GenAI chatbot that provides legal help. And it supports our team at People’s Law to constantly — and consistently — refine Beagle+ to provide accurate and helpful answers to those looking to resolve or avoid everyday legal problems in British Columbia.