A year ago, I wrote about how excited we were about generative AI’s potential, but how cautious we had to be before we bring any new technology into a healthcare setting. I advocated for the need to proceed with humility, guardrails and good science.

Three months later we kicked off BUILD (NCT05948670), the first known registered study to explore how a reduced set of Woebot features augmented with large language models (LLMs) compares with the same features in Woebot as it is today. We’ve just completed that study and its full results will soon be published in a peer-reviewed journal. However I wanted to share a preview of what we found today. While we believe conversational agents are the future of tech-enabled mental healthcare delivery at scale, the results of the BUILD study enable us to explore the user experience in a way that tends to be absent from the current hype.

ChatGPT has already helped tens of millions of people experience a conversational agent, or chatbot. These agents/chatbots seek to replicate human discourse. However not all agents/chatbots are created equal, and anything built for emotional support must be beautifully designed, and always in service to the user. These agents may have special utility in this context, because conversation is a much more natural way to engage when people are feeling low, and while this is the moment that is often most difficult for people to reach out for help, it’s also a key moment in which to engage in these skills.

We also believe in the centrality of conversation given the approach on which Woebot is based: Cognitive Behavioral Therapy (CBT), which examines thoughts and what we say to reveal the negative mindset in which mood disorders take root. The language we use matters, and CBT’s creator Aaron Beck’s theory of “Cognitive Specificity” says it reveals the ways in which our thought process is distorted along specific emotional lines. A great CBT practitioner engages by guiding an individual through the process of reconstructing their mindset, systematically identifying negative thoughts, the distortions therein, and ultimately rewriting the script. Woebot uses the same techniques, and because the focus is on the individual’s thoughts (not Woebot’s), success does not depend on perfect understanding.

We’ve spent years perfecting the rules of engagement for conversational agents in this context. We’ve observed firsthand how to balance appropriate guidance with retaining self determination (the person being in the driver’s seat) as the key to engagement. We’ve gained an understanding of how and when to balance the levers of credibility and humility, quirkiness and warmth, empathy and accountability. We’ve also studied working alliance, considered a foundational aspect of all healthcare delivery and a necessary condition for change. Our theory is that while not at the level of human connection, the AI-based conversational agent encounter affords a unique opportunity to speed up the formation of working alliance and engagement, precisely because it is AI.

Let’s get back to BUILD: Our primary outcome for the study was user experience as measured by user satisfaction. Our exploratory outcomes looked at things like symptom changes, working alliance, attitudes toward AI, and a variety of other engagement metrics, as well as safety events and conversation performance.

We tested 11 different LLMs to understand their limitations in our conversational model, and to develop a technical architecture for BUILD and the necessary guardrails to ensure participants would never directly interact with the LLM. We built a study app to deliver both experiences, and randomized 160 people to talk to the bot for 2 weeks, a ridiculously short time in psychotherapy. Users were blind to which condition they had been assigned to. The study was IRB approved, and all users provided informed consent.

What did we find? Across the board, we observed broadly similar reports of user satisfaction and working alliance scores across the two arms. Participants did not seem to mind what condition they were in, even though a majority of both groups thought that they were in the generative AI condition. Our failure to observe differences in user experience was in spite of observing significant differences in the quality and accuracy of the LLMs. One key algorithm, for example, had roughly twice as much accuracy in conceptualizing the nature of the problem that the individual described than our typical classifier.

We also found that while both groups reduced self-reported symptoms, there was no difference in clinical outcomes, although this finding is probably better described as an observation since the study was underpowered to detect statistical difference on this metric.

So, what’s going on here? We offer a few thoughts for consideration.

It might be that LLM-infusion doesn’t go far enough, or that despite increasing exposure, people are not yet able to distinguish between generative versus rules-based AI. The observation that the majority of people who were in both conditions believed they were in the generative condition lends support to this idea. It may also be that the brains of the folks in the rules-based condition lead them to experience an imperfect conversation as nonetheless empathic. For me, this is quite satisfying, because I’ve always believed that the way in which Woebot checks with users the accuracy of his intent classification, (i.e., “It sounds like there’s a couple of things here, feeling low and problems with relationships, is that true? Am I hearing you correctly?”) demonstrates great empathy. It’s also a beautiful way to retain the user’s self-determination, and to guard against bias.

The psychological processes that we built for the study are already highly refined, honed by experienced clinicians over several decades, and are unlikely to benefit from added intelligence because their efficacy depends on the user’s engagement. We suspect that LLMs will help with engagement, not necessarily with psychological technique itself, though they will almost certainly help with better targeting strategies toward precision mental health so that we select the right technique for the right person at the right time. This is exactly what we’re testing in our next study, enrollment for which has just started.

A big plus: There were definite wins for the team in the success of the architecture. Our use of generative AI was within a set of proprietary guardrails to mitigate problematic response patterns from LLMs (notably hallucinations), and I’m happy to say we saw none, and also recorded no serious adverse events.

Interestingly, while more than half of the research participants had never used ChatGPT, we found that the majority of people in both conditions said they were significantly more comfortable about the role of AI in mental healthcare as a result of their participation in the study. That may be because we’ve built Woebot for symptom improvement, not for attention, which could be a key to a genuinely positive experience. Or that, for now, generative AI is best used to augment well structured conversational agents, especially those that employ science and proven design principles toward an objective function grounded in human health. Either way, it’s a good sign of growing literacy about the role of agents like Woebot.

We’ll continue to learn and share what we discover.

Editor’s Note, March 2024: Our research into LLMs is exploratory. We do not currently use LLMs to generate responses to users; all text is developed by our conversational writers, in collaboration from our clinical experts.