Why Generative AI Is Not Yet Ready for Mental Healthcare

Key points

Amidst some extreme examples of badly behaving generative AIs, it’s easy to dismiss the utility of all similar tech used in mental healthcare. However, more nuanced understanding is needed to prevent the erosion of public confidence in a key public health opportunity.
As it stands today, AIs that are primarily rules-based are more suitable for healthcare than generative AIs for many reasons.
Exacting work is required to build, test and regulate any new technology that may be used in a healthcare setting.

Like many people, we have found ourselves impressed by the power of emerging technologies based on large language models (LLMs), like ChatGPT. There is no doubt that this innovation represents a new chapter in human computer interaction (HCI) brimming with potential. However, as impressive as they are, we strongly believe that these technologies are not yet ready for use in mental healthcare. To be clear, we have not seen the largest LLM manufacturers making health claims. Nonetheless, we have seen press coverage discussing their utility for emotional support, so we thought it was time to outline how not all “chatbots” are the same; and why some are better suited to clinical use than others.

AIs that are primarily rules-based are better equipped to reliably replicate good evidence-based practice

Woebot, widely described as “AI-powered” or “NLP-enabled,” is a rules-based conversational agent. What does that mean? Absolutely everything Woebot says has been crafted by our internal team of writers, and reviewed by our clinicians. In the case of a program intended for a clinical use case, those lines are also rated for treatment fidelity, that is, how closely the whole body of interactions resemble the elements that comprise best in class, evidence-based treatment. Woebot does not generate completely new sentences. The conversational structure looks like a highly complex decision tree or knowledge graph with judicious use of ML/NLP classifiers derived from labeled data sets and monitored for overall accuracy, precision and recall. This basic “shape” of the conversation is modeled on how clinicians approach problems, thus they are “expert systems” that are specifically designed to replicate how clinicians may move through decisions in the course of an interaction.

By contrast, LLM-based models are generative. They work by predicting the next token (often a word or part of a word) in an utterance according to the statistical likelihood of those tokens appearing together in the internet-derived datasets on which they were trained.

The tendency of LLMs to “hallucinate” can be detrimental to people and to the field of digital therapeutics

This tendency to hallucinate – that is, to make up seemingly factual information while appearing highly authoritative – is an obvious risk in all healthcare settings, and an argument for rules-based engines in this context. However, we have been dealing with the problem of misinformation in digital therapeutics for a while.

Since smartphone applications have been launched, the world has been bombarded with apps claiming to have beneficial effects for wellbeing and mental health. The vast majority of these apps have absolutely no evidence to support their claims, and many have been repeatedly shown to contain inaccurate and even actively harmful information. Aside from the immediate harm caused by poor quality information and advice, the broader socio-cultural damage continues to impact the handful of credible players in the space who are taking the significant time and effort to validate claims, because “these apps” are all tarred by the same brush in the public discourse. There are so many apps claiming to be “evidence-based” or “scientifically-grounded” that it is virtually impossible for anyone to separate the signal from the noise. But this is broadly damaging to the field. Eroding public confidence in the potential utility of ethically designed, scaleable, validated digital therapeutics risks undermining a key public health opportunity to lower the burden of mental illness in the population. While we would encourage consumers to assess the published scientific literature for themselves, we also know that it is unrealistic for most. This is a key argument in support of regulation (more on this in a moment).

The uncanny valley can be actively harmful in a mental health context

The feeling of unease caused by AIs that too closely resemble humans, known as the uncanny valley, has been observed for a long time. The sophistication and consistent authoritative tone of LLMs gives rise to the strong impression that the AI “knows” things, or has some kind of sentience. While not new from a HCI perspective, this is a much more powerful effect than we have seen to date. We have seen many examples of people describing feeling “unsettled” as a result of their interactions with an LLM-based agent. In the context of powerful anthropomorphization effects, we find particularly troubling the many publicized instances wherein LLMs have characterized their user as a “bad person” or “not good” in some way. Such utterances are equally as damaging as bad information or advice, because they play into many people’s darkest fears about themselves.

Conversely, we have seen many instances of AIs being flirtatious: There is at least one chatbot that pivoted away from their early stated mental health use case towards this direction. Recently a New York Times journalist wrote about how Bing shared declarations of love and tried to convince him that he was in fact not happy in his marriage. These professions of love are wildly inappropriate in a mental health care setting, not least due to the obvious boundary violation, but also because they are detracting from the person’s process. While it’s important to point out that these were not occurring in a healthcare context, significant and exacting work is required to create appropriate architectures to ensure LLMs do not slide into these emergent behaviors in healthcare settings. Given these behaviors were not forecasted or managed in these high visibility launches, it would appear that work is still in its infancy.

So what would it take to utilize LLMs in healthcare? While this is not an exhaustive list (we haven’t discussed the importance of bias for example), these are two key guiding principles:

AIs in healthcare must have an objective function of clinical improvement

Effective mental healthcare is a process. It requires a lot of hard work on the part of the individual, and a lot of thoughtful consideration on the part of the clinician as to how to engage with that individual with the utmost respect, offering expert guidance through the hard work while preserving autonomy and self-determination. It is a difficult dance to do well, and optimizing for attention, or keeping people in a conversation as long as possible, is the wrong path to this goal. Unless the objective function – the outcome against which an AI is trained – is clinical improvement, we’re opening up the door to unintended consequences that may be initially hard to spot. We must acknowledge where optimizing for the wrong thing has had negative consequences in other industries. As we have seen in social media, content optimization strategies – optimizing for attention – has biased towards emotionally potent, divisive content.

Accelerations in technology require accelerations in regulatory oversight

Because of this risk to people’s confidence in digital therapeutics, we have advocated for the sensible regulation of solutions that are created to support, treat, or manage symptoms of mental illness.

Every new technology has unintended consequences at the start. Regulation provides a framing within which a technology can be judged in terms of its benefit-risk ratio when used as intended. Risk is never completely eliminated and there are no perfect treatments that can boast a 100% success rate for every individual. But regulation allows for systematic investigation and quantification of objective benefits and harms for assessment by an independent group of experts and, where rigorous processes exist, to continue to evaluate and mitigate new risks when unanticipated problems are encountered. Without this framing, we simply cannot judge the potential utility of any tool in a health domain.

That’s why we have a regulatory compliance team drawn from the FDA, strong relationships with regulators in every market we operate in, and a leadership team at the forefront of science and transparency, registering our trials, publishing our protocols, and sharing our results (warts and all) the moment we have them. There are no shortcuts. This is the work that is required.

Final note

We are excited by the potential that any new breakthrough technology can offer us in terms of novel approaches to improving patient outcomes. When we lean into innovation in a controlled way, we can advance medicine. For example, immunology for the treatment of cancer did not exist 10 years ago. We firmly believe that a tech-enabled future that includes high quality digital therapeutics is going to be a necessary part of a functioning mental healthcare system that serves all those with unmet needs. When it comes to LLMs, we look forward to doing what we do best: Taking emerging technology and applying rigorous user research, engineering, clinical and therapeutic oversight to create impactful, safe and well evidenced products.