Datasets For Education

July 30, 2024

Datasets are kind of like the bricks to an AI model. They provide shape, form and structure. Put simply, they are examples of what you want the AI conversations to look like.

Datasets for Education is a new, open (as in everything produced will be released MIT Open-Source) initiative to:

Learn and grow our understanding of how AI systems work by creating one of their constituent parts (and of course training lots of models)
Critically exploring the conversational and pedagogical techniques employed by default in different language models
Evaluating what effective pedagogy could look like through LLM models
Building automated “data synthesis pipelines” and generating datasets

This isn’t a task that requires extensive technical skills, rather the ideal participant is someone who:

Is a passionate educator
Has extensive experience using LLMs (ChatGPT is enough)
Has found the output from LLMs occasionally unsatisfactory given certain contexts

If that sounds like you, read on.

What can I do to help?

There are two main goals to this initiative.

To create datasets that better reflect better pedagogical techniques than the existing commercial models
To critically evaluate what “better pedagogy” looks like in the context of LLMs

To be honest, part 2 is the more important part of this question.

I hope this becomes a forum in which we critically discuss what different contexts students might be using AI systems, and given these contexts, how the AI should act.

Finetuning and shaping AI models is not a simple undertaking, but it is one that (I believe) requires the input of educators. My hope is to kickstart that process.

So how can you help:

First of all, engage in the discussion. Think, share and brainstorm what your “ideal” AI tutor would sound like. What questions should it ask? What tone should it use? When should it do rewrites and edits? When should it make suggestions? How should it scaffold?

Second, choose a specific scenario, and create a few examples of how your ideal AI tutor would work with a learner.

Third (and I’ll provide a lot of support with this step), setup a data workflow that can take some input (like a standard, a topic, content, etc), and generate output.

Fourth, scan some of the output and ensure it looks good.

Building our understanding:

In the next few sections I provide a very surface level introduction to datasets, AI model training, and more.

What is a dataset?

A dataset is the content that is used to train an AI model.

There are two kinds of datasets we are going to concern ourselves with.

Exemplar Datasets – These datasets are used for Supervised Fine Tuning (SFT). They show the AI system how you would like it to try and respond. The Open Assistant Guanaco dataset is a good example of this
Good/Bad Example Datasets – These datasets are used for Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO) and other training strategies. They show the AI “good” and “bad” response examples. Intel’s Orca DPO pairs are a good example of this.

What kind of training will we do?

We are going to build datasets and use these datasets to train LoRA models.

Very simply, LoRA models are a small collection of weights (less than 1/100th the size of the full AI model) that can be trained at a lower cost, but can still have a massive effect on the behaviour of the base model (the model that we are finetuning).