How to Generate Multiple-Choice Questions with AI

Nuria Lopez

Niya Bond

Click here to view the video transcript
– Hi, everyone. I’m Niya Bond, a Faculty Developer here at OneHE. And I’m thrilled to be joining you today with Nuria Lopez, who’s going to be teaching us about using Copilot to generate multiple-choice questions and much more. Nuria is a Learning Consultant at the Teaching and Learning Unit of the Copenhagen Business School in Denmark. Prior to that, Nuria had a 20-year teaching career in higher education in languages and academic writing, and she’s primarily involved in faculty development, prioritizing the implementation of evidence-based practices. And today, we’re gonna talk about how to implement AI-based practices. Thank you so much for joining us today, Nuria.
– Of course, thank you.
– I’m so excited. I haven’t used Copilot myself yet, so I’m really excited to learn about this new tool and how to engage with it to develop multiple-choice questions.
– Yeah, so that’s a good point to begin with I think. I’m going to be using Copilot just because this is the tool that we have the license for at my institution, but everything that we are going to be discussing and going through would be the same if you are using, for example, ChatGPT, which might be more common. So, there are not major changes there.
– Wonderful, thank you. I think that’s also an important point you just made is use what your institution has, right? Become familiar with it, but also there are transferable skills, like you just said, that can go to other technologies.
– Exactly, yes.
– All right, well, I’ll turn it over to you to kind of walk us through the magic that you make in this program.
– Okay. So, this is the Copilot interface. And maybe the first thing that we need to keep in mind is that with Copilot, we have three conversation styles that we can use to interact with the tool, so they are more creative, more balanced, and more precise. According to Microsoft, the more creative style would produce more imaginative answers. And more precise, they would be more concise. and then, more balanced, it would be something in the middle. I’m going to be using the more precise conversation style, but my recommendation is that people should try the three of them with the same prompt and then check what the output is, and see what works best for them. Because in my experience, even the differences of the conversation styles is not consistent. And sometimes you might use one prompt and check the three styles and you might get good output from each of them or two of them, so maybe you can mix and match the output a little bit. But my personal preference is more precise, so that is the one that I’m going to be using. And the other thing that I would like to say before I put the first prompt is that in my experience it’s best to use one prompt for creating the questions in the first place. And once the questions are the type of questions that we really want, high-quality multiple-choice questions, then we ask the tool to generate the feedback for both the correct answer and the incorrect answer. I think that if we use just one prompt to ask the tool to generate questions and feedback, we end up with a lot of output that we will need to revise and refine anyway, and then it becomes a little bit complicated to manage and to engage the tool in the further improvements that we want to ask for. So-
– I so appreciate that point, thank you. I think it’s not typically a one and done if I understand it with this technology. Correct? Yeah.
– No, no, exactly. And that is something very important to keep in mind that this is an ongoing conversation between us humans and the tool and our human intervention is going to be really, really essential to end up with good multiple-choice questions as I think that we will see in the demonstration. So, I have one prompt ready that I’m going to copy and paste. So, this prompt has some of the features that are generally recommended for prompting, like maybe asking the tool to act as a person or take up a role. So, I have said, “You are teaching a university course on education,” so I’m giving the tool the role of a teacher. It is also quite specific about the format that I’m looking for. So, I mentioned, “Create two multiple-choice questions for students in this course.” I also tell the tool that I want three alternatives, so three possible answers: one correct and two incorrect answers. And that I also want the output to indicate which one is the correct answer. And then, the final part of the prompt is where I have included guidelines that are research-based guidelines for writing high-quality multiple-choice questions. So, there is a short paragraph for guidelines about how to write the stems of the questions. And then, the last paragraph contains guidelines to write the alternatives. The alternatives are both the right answer and the incorrect answers. So, all those guidelines come from – from research in assessment. So, that is what I think can help to end up with output that aligns with what other research recommendations for multiple-choice and questions. So, let’s see what we get. I always also begin by asking for two or three questions instead of, for example, 10. And the reason is that, as I said before, it’s quite likely that we need to refine the output. So, it’s a way of controlling the amount of output that we have to deal with. So, it’s easier if we ask for two questions, we work with those and then we continue and we ask for two new questions, and so and so forth.
– [Niya] Okay.
– So, it has come up with question 1 and question 2, and they are about online learning environments, that is what I had asked for. I said that I wanted questions about the student engagement in online learning environments. And the questions are phrased as a scenario using, for example, “Mr. Smith notices that some students are not actively participating in the discussion forums,” and that is something that I had also specified in the prompt. I forgot to say that the prompt that I suggest in the resource can be easily adapted. So, if you want to add, for example, more information about the learning outcomes or you want to provide the tool with a sample multiple-choice question, you can also include that. So, I think that we can use four main criteria to evaluate the output that we get here. And I think that is an essential step. The generative AI tools do produce output that sometimes is incorrect or that is not high quality, and this is where we have to use our human knowledge and human intervention just to make sure that we end up with good multiple-choice questions. And in the resource, I suggest four criteria which are relevant. So, just to make sure that the questions are exactly about the topics that you wanted them to be, that they also have the right level of complexity, and that they focus on the learning outcomes that you wanted them to be focused on. So, that is one thing. The other thing is to check for accuracy. So, just to make sure that there is no incorrect information, and always to make sure that there is one correct answer if that is what you have indicated in the prompt, of course. Then, we need to check for plausibility. And this is an essential step because one of the features of effective multiple-choice questions is that all the alternatives, all the possible answers are plausible. So, that is what really is going to make the students to put effort in discriminating one answer from the others. So, we need to check that all the answers are plausible and that we don’t have any answer that is obviously incorrect so that you can disregard very quickly. And finally, we would check for clarity just to make sure that the questions are easy to read, that there is not confusion in the way they are written because we don’t want to be testing a student’s reading ability, we just want to focus on the learning outcomes related to the topic. So, in this case, for example, the question 1, I can see straight away that the first answer is ignore the issue and continue with the course as planned. So, I think that straight away you don’t need to know anything about online learning or on student engagement to disregard that answer as incorrect one. So, that means that the plausibility of that alternative is very, very low. So, it’s not good to have that sort of alternative in the question. And then, we have option B is the correct one that the tool has indicated in bold. And the third alternative, it’s not great, but it could be included. So, after evaluating the output for question 1, I would now ask the tool to give me a better alternative for option A, which is the one that I think is not plausible, that is obviously incorrect. So, I would concentrate on question 1 until I get a question that I like. And then, I would do the same with question 2. But before I do that, I also want to say that sometimes there are things that we are not fully happy about in the questions, but they are not things that we need to continue prompting the tool to solve them. We can just change that ourselves. So, for example, sometimes the correct answer is much longer than the incorrect answers, and that goes against the recommendations for writing effective multiple-choice questions. All the answers, if possible, should be more or less the same length. But that is something that I don’t think it’s worth, you know, prompting again, we can just reduce the correct answer ourselves. Or, for example, in this case, the question is framed as Mr. Smith notices, I would probably use you, you notice that to make it more personal for the reader. But then, again, that is not a change that I would ask the tool to make ’cause I can very easily change that myself. But in terms of coming up with a better option A, I think that I would prompt again to adjust that. And I have, again, this is a very common case so that is why I have the prompt ready because the plausibility is one of the main weaknesses that the tools have. So, now I tell the tool, “In question 1, can you substitute alternative A for a more plausible alternative,” and let’s see what it comes up with. So, it repeats the question. It has done a little bit more than just changing first alternative, that can also happen. But now I feel that the three alternatives are good. They provide good competition among them and we don’t have any option that is obviously incorrect, so I think that now the question could work. So, I would now prompt the tool to produce the feedback for all the questions. So, when students enter, answer the question, they get feedback both if they choose the right answer, but also if they choose the incorrect answer. So-
– Okay, so students will get feedback regardless of whether they choose A, B, or C, they’ll get some kind of feedback?
– Exactly, yes, yes.
– Okay.
– So, now I prompt for that and I say that I want feedback about explaining why the answer is correct or incorrect. So, I ask that, and hopefully now we get feedback for all the three alternatives. And the feedback that the tool provides tends to be quite good. And I don’t generally find that it needs a lot of revision. I would probably try to make it a little bit more concise. But again, I would do that myself, that is something that can be done very quickly and I don’t think it’s necessary to continue prompting to reduce the number of words in each of the feedback. So, now I would have one question that I would like to use plus the feedback of the three alternatives. And now, I would go back to question 2 and do the same. I would evaluate the output, I would check if the question is relevant and accurate, if the alternatives are plausible, and if the phrasing of the question is clear. And then, I would make a decision about whether I want changes that I can do myself or I would prompt again for the tool to make those changes.
– So, that human intervention continues to be important.
– All the time, all the time. In the resource, I also include two criteria to evaluate the feedback. One is accuracy, of course, that every information is accurate in the way that feedback is provided to students. And again, conciseness, that we don’t end up with very lengthy feedback, which I think is not appropriate for a multiple-choice question. That is something that you need to move on to the next question quite quickly, so you don’t want to get a feedback that is much, much longer than the question itself. So, those are two things to look in the feedback. I think that it’s clear that coming up with plausible answers is the main weakness of the tool. It was the same experience when I was using ChatGPT. Almost all the time, you find one or two possible answers that are not really good because they don’t compete with the right answer. They are obviously drawn and so they are not helping in any way to improve the quality of the question. So, that is something that you can only improve, I think if you have the knowledge yourself about what a good multiple-choice question is. And you are also very aware of what learning outcomes you want to assess by using these questions. And of course, the knowledge about your students, the prior knowledge that they have, the level they are at, that is also essential because many of the questions might be slightly changed by you depending on that information that you have. And of course, you can provide a lot of information in the prompt, but there is always things that come from just within the knowledge that you have about your students and what phase they are in their learning. And therefore, yeah, your human intervention is absolutely essential. even if you use Copilot or ChatGPT.
– I have a question for you. So, I find this fascinating. You clearly find value in it as an educator, but across the educational ecosystem there are mixed feelings about these technologies and tools, right? So, I can see someone maybe who’s a little more skeptical saying, “Well, if you have to go back and correct things anyways, if you’re still the one determining whether it’s quality, whether it meets those four criteria you noted, what is the value or what’s the benefit?” So, I’m hoping you can share with us as someone who’s actively engaged, why do you do this? What’s the benefit?
– The benefit is that you can get ideas for the questions much, much faster than if you do this without the tool. And writing multiple-choice questions is such a time-consuming activity in that being able to ask the tool, “Can you give me five multiple-choice questions about this topic?” And then, you can also become more specific, right? I mentioned first of all, online learning, but then I specified that I wanted the questions to be about student engagement in online learning and it came up with two scenarios that it would’ve taken me much, much longer to think about. So, that is the benefit, but I do understand what you say and I think it’s a very good point because you also need to learn when to stop interacting with the tool. And as I said, there are many things that we can improve ourselves in the output that we first get from the tool. And if that is the case, that is going to be more productive. So, at some point we stop prompting, we just make the adjustment ourselves, we finalise the questions, and that’s it. But I think it’s a help to come up with ideas with scenarios and it’s also very useful to produce the feedback, for example. I think that because by the time that is why I wanted, I would suggest that people use two prompts because if you use two different prompts, by the time you ask for the feedback, you already have a question that you like, that you think it’s going to be a good question for your students. So, that means that the output with the feedback tends to be quite, quite good. So, you might need to make it a little bit shorter, of course, check for accuracy, but in general, it will be feedback that you can use. And that is again, one of the things that is very time consuming and particularly if you want to, as I do, to include feedback for both correct and incorrect answers. So, those are the two main advantages, I would say ideas to begin with and production of feedback.
– Wonderful. And I’m wondering if you have one suggestion for someone who does wanna play around, whether it’s Copilot or ChatGPT, is there a place they should start or is it literally just jump in and start prompting and refining those prompts?
– I would start with a sample prompt that they like. It can be the one I suggest, but it can be any other. And I would make sure that once they find a prompt that they think works for them, they save it so they can even create a prompt library with those prompts that work really well for them because otherwise it’s very easy to forget. But you can work with a sample prompt and adapt it to your preferences. If not, you can begin and trying to create your own prompt from scratch. That is also possible, but of course, it takes a little bit of time, but there are lots of places where you can find recommendations for prompting. I mentioned some before, like you give the tool a role to fulfill and you give some context and you are specific about the format that you want the output in, that sort of thing. And in this case, I think it is very, very important to include those guidelines, research-based guidelines about what makes a good question and the stem of the question that is the question in itself and also what makes good alternatives, so all the possible answers that you have for the question.
– Wonderful. Thank you so much. I love the tip about the prompt library, I hadn’t even thought of that, but what a wonderful resource to have to just return to and keep helping you.
– It saves a lot of time because you might think that you’ll remember next time.
– Yeah.
– But I think that maybe you don’t because sometimes you begin interacting with the tool and then you keep asking and asking and asking and of course, you lose track of, you know, what is the prompt that you wanted to use in the first place? So, I think that is always, once you get one that works for you, save it, yeah, and keep it.
– Wonderful. Well, this has been so enlightening. I’ve really enjoyed this. I appreciate you building in practicality that was so easy to understand and just following the steps along with you. Sometimes I think that these technologies can be a little confusing or hard to use, but you showed that it’s actually easier than I thought and even fun, so I really appreciate that.
– That’s wonderful. But I think that the main thing to remember is how important we humans are. And I think that that is something that with multiple-choice questions has not been emphasized enough since ChatGPT became widely available. Multiple-choice questions have been mentioned all the time as one of the tasks that could be very much facilitated by these tools. But I don’t think we have been so good at discussing that, yes, that is the case if we also use our human intervention. Otherwise, you know, it’s not that easy to get good multiple-choice questions just by prompting. So, the evaluation of the output and AI literacy is absolutely essential.
– I appreciate you ending with that point. I think that’ll bring relief to some who are worried about the place of humans in all of this technology. But in many ways it’s like tech that’s come before: calculators, and word processing, and all of those things, they all require human intervention in the end.
– Yes, exactly, exactly.
– Well, thank you so much for being here with us today. It’s been a pleasure chatting with you and learning how to use this cool technology.
– Thank you. Thank you for having me.
In this video Niya Bond, Faculty Developer at OneHE, talks to Nuria Lopez, Learning Consultant at the Teaching and Learning Unit of Copenhagen Business School, Denmark, about using generative AI (GenAI) tools, such as Microsoft’s CoPilot, to design effective multiple-choice questions (MCQs). Nuria demonstrates a series of research-based guidelines that can be applied when using any GenAI tool to create high-quality MCQs.
MCQs can be useful to support retrieval practice, help students identify gaps in knowledge, provide automated feedback, and give educators important information regarding students’ understanding. MCQs are best for using in formative assessment, as they can give students and teachers feedback on student learning, with time to act on it and improve learning before summative (final) assessment takes place.
What makes an effective MCQ?
- The question focuses on a single clear learning outcome.
- All incorrect answers are “competitive” (i.e., they are plausible and therefore able to “compete” with the correct answer),
- The answer “All of the above” and “None of the above” are avoided as options in the answers (Haladyna et al., 2002; Little and Bjork, 2015; Shank, 2021).
Using a GenAI tool to design MCQs should be understood as an ongoing collaboration between you and the tool: iterative prompting and continuous evaluation of output will be needed to refine the questions. There are three stages of interactions with the GenAI tool:
- Initial prompting,
- Evaluation of output and
- Continued interaction.
What to include in your initial prompt?
- You can interact with CoPilot using three different conversation styles: More Creative (“imaginative answers”), More Precise (“concise answers”), and More Balanced (“balance between comprehensive information and brevity”). However, the differences among styles are not always noticeable or consistent when using specific prompts, as is the case with MCQs. Explore the different styles and compare outputs to decide which one works best for you (you might also find that you can use outputs from different styles).
- CoPilot will produce the best results when tasks are structured. Use a prompt to generate the question(s) and, once these are fully refined according to your needs, a different one to generate feedback for the answers.
- General recommendations for prompting include asking the tool to take on a role (e.g., “You are/Act as a professor”) and being specific about the task and context. For creating MCQs specifically, it is also important to include evidence-informed guidance on how to write effective MCQs and effective feedback (as in the final lines of the sample prompts below).
Sample prompt to generate MCQs
You are teaching a university course on [topic]. This is a [level/type of course, e.g., introductory, postgraduate] course. Create [number] multiple-choice questions for students in this course. The questions must be about [topics/concepts], focusing specifically on [learning outcomes, e.g., the differences between X and Y, the application of X, causes/consequences of Y, OR subtopics].
The questions must have 3 alternatives: one correct answer and two incorrect answers [ask for a higher number of alternatives if relevant]. Indicate which one is the correct answer.
Follow these guidelines to write the stems of the questions: the stem addresses one single learning outcome, uses clear and concise wording, and avoids negative phrases and ambiguous vocabulary.
Follow these guidelines to write the alternatives: all alternatives must be plausible answers, have a similar length, be parallel in grammatical form, and avoid repeating phrases or words from the stem. Also, alternatives should not include obviously wrong answers, “All of the above” or “None of the above.”
Sample prompt to generate feedback for MCQs
Provide feedback for the following multiple-choice question(s): [paste question(s)].
Use these guidelines: provide feedback for all the alternatives, concisely explaining why the answer is correct or incorrect; provide feedback that is focused only on the question; make sure the feedback to the incorrect answers does not reveal the correct answer.
If using sample prompts like the ones above, consider how to adapt them to meet your specific requirements and preferences. For example, you might want to provide more detailed information about the topics or learning outcomes, ask for a scenario to be used to frame the questions, or include a sample MCQ.
As you perfect your prompting skills with practice, create a prompt library, a document where you save those prompts that have worked particularly well for you. It will be useful to have them to hand the next time you need them.
How do I know that the output is good quality?
- Despite the ever-expanding capabilities of GenAI tools, their output needs evaluating. A good place to start is to assess the output against the requirements specified in your initial prompt. See below examples of criteria to evaluate MCQs and feedback generated by CoPilot.
- When assessing CoPilot-generated output, keep in mind that the tool’s main weaknesses are its limitation to provide high-quality distractors (i.e., plausible incorrect answers that “compete” with the correct answer) and the occasional generation of content-related inaccuracies.
Criteria to evaluate AI generated MCQs
- Relevance: Have relevant aspects of the topic(s) been included in the questions? Do the questions have the appropriate level of complexity and focus on the specified learning outcomes?
- Accuracy: Has correct information been provided in the questions? For each question, is there one alternative that answers the question correctly?
- Plausibility: Are distractors (incorrect answers) plausible answers, therefore appropriately “competing” with the correct option?
- Clarity: Are both the question and the answers clearly phrased, following the guidelines provided in the prompt (e.g., avoid “All of the above”/”None of the above”)?
Criteria to evaluate feedback
- Accuracy: Does the feedback correctly explain why the answers are correct or incorrect, without revealing the correct answer in the feedback provided in the incorrect answers?
- Conciseness: Is the feedback brief and clearly focused on what has been asked in the question?
What should I keep in mind for further interactions with the GenAI tool?
- The results of your evaluation will determine how you continue your interaction with the tool. Think of this phase as a dialogue where you provide additional guidance and/or constraints to refine the output (e.g., corrections of inaccuracies, improvements in the plausibility of the distractors).
- As with initial prompting, it is important to be specific about the changes being requested to obtain the desired adjustments in the output. For example, “In Question 2, can you substitute option B with a more plausible alternative?”, “In Question 3, options A and B overlap and they could both be seen as correct answers, could you suggest a different option for one of them?”.
- Each new prompt will generate new outputs that need to be checked and possibly further refined. It is through this continued interaction with the tool that you will normally identify its main advantages and pitfalls, allowing you to decide how you can use it most productively. Do not underestimate the importance of your human intervention: it will be essential to ensure that the final MCQs are accurate, reliable, unbiased, and above all, appropriately aligned with the learning outcomes your students are working towards.
What should I consider when using AI to generate MCQs?
- Think of working with AI as an ongoing conversation – there will be give and take
- Human intervention is always essential with these tools and technologies
- Using these technologies intentionally can help educators save time, and maintain quality
What should I consider when prompting multiple-choice questions?
- Suggest the AI take on a specific role, such as a teacher
- Be specific about the format
- Request 3 alternatives (unless it makes sense for the question to have more because of the topic or the learning outcome), and that the correct answer is identified
- Rely on research-based guidelines for question generation and presentation
How do I know if AI output is good?
- Relevance – specificity to topic and complexity
- Accuracy – no incorrect information
- Plausibility – the answer is possible, high-quality, and makes sense in the context – this is the one that AI struggles with the most, and where human intervention is most important
- Clarity—easy to read; not confusing
Related Resources:
Articles
- Haladyna, T.M., Downing, S.M. and Rodriguez, M.C. (2002) A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education, 15(3), 309-33.
- Little, J.L and Ligon Bjork, E. (2015) Optimizing Multiple-Choice Tests as Tools for Learning. Memory and Cognition, 43, 14-26
- Little, J. L., Frickey, E. A., & Fung, A. K. (2019) The Role of Retrieval in Answering Multiple-Choice Questions, Journal of Experimental Psychology: Learning, Memory, and Cognition, 45(8), 1473–1485.
- Rodriguez, M.C. (2005) Three Options are Optimal for Multiple-Choice Items: A Meta-Analysis of 80 Years of Research, Educational Measurement Issues and Practice, 24(2), 3-13.
- Ryan, A. et al. (2020) Beyond right or wrong: more effective feedback for formative multiple-choice tests, Perspectives on Medical Education, 9(5), 307-313.
Book
- Shank, P. (2021) Write Better Multiple-Choice Questions to Assess Learning, Learning Peaks LLC.
Podcasts
- Teaching in Higher Ed Podcast – Episode 155 – Learning and Assessing with Multiple Choice Questions.
- The Mind Tools L&D Podcast – Episode 310 – Questions, questions, questions (Discussion of key takeaways from Patti Shanks’ book, ‘Write better multiple-choice questions to assess learning.’).