Measuring the success of conversational UIs

8 min readApr 14, 2020

Hundreds of questions were submitted by attendees of Botmock’s AMA about becoming a conversation designer or developer. Five industry experts with backgrounds in design, engineering, and product development answered as many as we could in a live roundtable discussion. If you would like to watch the full recording, you can do so here.

The following CUI experts joined us to answer your questions:

Grace Hughes, Content Design @ Accenture’s Innovation Hub- The Dock, Dublin
Sachit Mishra, Product Manager, Google Assistant
Gabrielle Moskey, Senior Voice Experience Designer @ a national insurance company
Matthew Portillo, Conversational Experience Design, IPsoft Amelia
Brielle Nickoloff, Product Manager, Botmock (panel moderator)

In Part II, we covered the following topics:

How do you evaluate and critique conversational interfaces?
How do you make design decisions transparent and arguable? What have you found helpful in facilitating a design stand up for your interfaces where everyone can see the decisions you’ve made and offer feedback?
What are the main metrics for evaluating the performance of a conversational AI?
How do you design and develop for failure?
What UXR methods do you use in the context of conversation design and development?

What are the main metrics for evaluating the performance of a conversational AI?

Sachit:

I work specifically on the Google Assistant developer platform which enables other developers to build voice experiences for Assistant, and when I’m evaluating the potential success of a conversational experience, one key factor I always consider is the situational context of the user. How likely is the user to opt to use a voice assistant to get something done for them in that context? Is it really 10x better than just pulling out their phone and using an app, or accessing that information on a laptop? Also, I always try to think about where the user actually is located. In the home? In the car? How are they going to discover this experience? How are we going to get them through the entire journey, all the way to where we actually get something done for them? We’re going to lose them along the way if the whole experience isn’t tight.

Retention is also super important when thinking through metrics. It’s something that’s really hard to achieve with chatbots and voice experiences, as compared to modalities like mobile apps and websites. We measure things like the number of times a user comes back to an experience over the course of 7 days, or 28 days. We consider week-to-week retention — are they coming back week over week? The metrics you want to use actually depend a lot on the use case, since not every type of metric applies to all use cases, but these are generally ones my team cares a lot about. Another one I always consider is a no-match or fallback rate. During any given conversation, a user might say something that wasn’t accounted for and we call that a no-match or fallback. We want to know how often that’s happening in any given conversation. There are some other generic metrics like, how many active users do you have per day, per month. By active user I mean they’re engaging with the experience in some meaningful way. You can look at how many users are signing into your service on the voice interface. If you have user accounts, you can check on whether users are signing in as much via the voice or chat version as they are on a graphical modality.

In short, before we start building out an entire experience, we try to fully understand the actual value to the user. Will the product streamline a process that is already been done efficiently in a different way?

Matt:

Sachit spoke to a lot of metrics that can help evaluate the success of an experience post-launch, once it’s actually in users’ hands, so I’ll describe some of the metrics we use while we’re actually designing for any given use case. At IPsoft, we tend to make the broad assumption that users aren’t necessarily excited to speak with a chatbot, and they don’t find it novel that a computer wants to speak to them. Users want to get to a resolution in the most efficient way possible. We look at how much time it takes for a user to complete a specific task, and how many turns of conversation it actually takes. If a user has to go through a really lengthy conversational process to complete something that they could likely do somewhere else, this usually frustrates them and it’s likely not going to be a good experience. The return rate would also take a hit. We also look at whether there’s an opportunity for an actual task completion. Was the user actually able to make that transfer from checking to savings. Did the user have to go back and say what account it’s transferred from, what account it’s transferred to, how much it is, when it should be scheduled for, and so on — those chains of questions will fatigue a user. So, in summary, we’re always balancing whether a task is able to be successfully completed while minimizing both time and conversational turns.

As far as making design decisions transparent; we do a lot of design reviews which is one of my favorite parts of the process. After I put together what I think is a good conversational experience that would get a user through a conversation efficiently, they’re able to do what they came in to do, then I put that in front of my design colleagues. That includes UX experts, other conversational experience designers, a user research designer, language recognition designers, and implementation engineers. We all get into a room, I show a very simple prototype, simply a table in a Google Slides deck, and we talk about it. We bring our perspectives together and completely pick it apart. Every time I go into that I feel pretty good about it, and every time I come out of that room I’m thinking to myself, ‘Oh wow, I didn’t see that, I didn’t see that other thing. So glad these people can show me my blind spots.’

For that reason, time is a very important factor to consider when designing an experience. How much time will the user have to spend interacting with the experience to get to the resolution? How many turns in the conversation does it take to complete a specific task?

Gabrielle:

Testing is important to do before going into development and implementation. As designers we should try to push the limits of the experience and observe how it breaks. Doing this will identify the weak spots and shine a light on where the improvements need to be made.

When designing a conversational experience you want to start the process as you would for any other UX project. Start by conducting your user interviews, do some competitor research, understand best practices and isolate your problem statement.

Then we can proceed with organizing a content repository(it can be a simple text document or sheet) that outlines the conversational steps of the experience. From there, we can create an informed flow chart/prototype that later can be tested and adjusted.

Since we want to make sure to conduct a good amount of usability testing before we go into the development phase, testing a live application usually isn’t doable. For testing at the early stages, what I do is create audio files of the steps of the experience and have users navigate the flow with me pressing play to prompt the responses as they advance through the conversation. This type of testing will help you dial the flow of the conversation (test every time you adjust the flow). If finding people to test with is difficult, just ask any friend or colleague and play a pretend game in which you give them the situational context of the experience and then have them go through the flow of the conversation as you play the part of the conversational UI.

After the conversational experience is implemented, it’s important to keep testing the interface in real life during development and production. Live usability tests alongside surveys will provide a lot of information regarding the efficiency and intuitiveness of the conversational experience.

How do you design and develop for failure?

Gabrielle:

It’s impossible to create a ‘perfect’ chatbot that has an intent for everything, or a voice assistant that can answer anything. Even the most well-designed conversational interactions are susceptible to leading a user to dead ends, which of course leads to errors and fallbacks, or those “sorry, I didn’t quite get that” cases. For this reason, it is critical to incorporate error handling strategies and make sure they’re conducive to a good user experience.

To make a system error-ready, we’re not going to just respond, “Sorry, I don’t know.” Instead, we can actually do something helpful, either by asking the user to rephrase it, or suggesting a disambiguation (“Oh, I think you meant this; is this what you wanted?” as long as we have a medium accuracy of what we think they said). We can also provide a response that gives options covering related topics, and then asking if they were helpful. You want to ensure that if someone gets off track, you can get them back on track. If you have a long flow, you want to make sure that, especially if a user is in a high stress situation [like asking their mobile Alexa app for insurance help if they’ve just been in a car crash], the user wouldn’t get halfway through the flow and then make a mistake, and then have to start all over. You’d need a way for the system to save the progress the user’s already made. A system that could do that would be well-equipped with error handling strategies. To do this, you have to try to break the system. You have to think, ‘what are these people going to say?’, ‘how might they get off track?’, ‘how can we get them back on track?’. This is important for both VUIs and for chatbots.

On the developer side, it’s important to do regression testing by checking the whole system every time you add something new in. If you add in a new intent, you might break other intents. Maybe the new one is a very similar intent to some of the others and it’ll confuse the system. Whenever you write out something new, test out your bot again to make sure everything’s still being answered accurately.

The Botmock team is excited that more of our community members are involved than ever before. There will be more panelists and attendees from diverse backgrounds contributing to these events in the future, so keep an eye out!

Further resources mentioned by panelists:

Content Design, Sarah Richards

Talk: The Science of Conversation, Elisabeth Stokoe

This article is an expansion of the first topic covered in Botmock’s first conversation design and development AMA session. You can also read recaps of the other two sections, Becoming a conversation designer or developer and The Technical Elements of Conversation Design.

Measuring the success of conversational UIs

In Part II, we covered the following topics:

What are the main metrics for evaluating the performance of a conversational AI?

How do you design and develop for failure?

This article is an expansion of the first topic covered in Botmock’s first conversation design and development AMA session. You can also read recaps of the other two sections, Becoming a conversation designer or developer and The Technical Elements of Conversation Design.

Written by Brielle Nickoloff

No responses yet