Get in touch
Central to our guiding principles is the use of empathic AI to measure and improve human well-being. We see this as an urgent need that makes the development of empathic AI worthwhile. As AI gets smarter, it becomes increasingly essential to ensure that it learns to accomplish its objectives using methods aligned with human well-being. The principal goal of these guidelines is to ensure that empathic AI is used to improve human emotional well-being (and not the contrary).
Given that well-being is key to the ethical deployment of empathic AI, it is critical that we define human well-being and identify ways it can be measured. First, a definition: well-being is the experience of living a good life. It is the experience of living a life on balance, characterized by comfort, health, happiness, fulfillment, and a desired level of variety or richness in emotional experience. This brings to fore three key tenets of measuring and optimizing for well-being:
No single measure of well-being is perfect, but many are adequate.
No single measure of well-being reigns supreme in the eyes of scientists, philosophers, or poets. Yet it is not impossible to measure well-being. There are many valid metrics to choose from, just as there are many ways to be happy. Developers of empathic AI should strive to optimize for the most contextually appropriate measure. They should also ensure that improvements in one measurement or facet of well-being do not come at the expense of decreases in others. It is therefore essential to use multiple complementary measures, particularly in high-risk applications.
When optimizing for well-being, preserve emotional richness.
Well-being is not just about feeling "positive." We are not always better off smiling. We like feeling different emotions in different contexts—horror felt in response to a horror movie is, to many, a sought-after experience. We also enjoy experiencing a variety of positive emotions, from awe to amusement to love. But this should not discourage developers from training AI to increase positive emotions on balance. It should just be trained to do so without reducing the overall variety of emotions we experience and express.
Algorithms should be tested for their causal effects on well-being.
Developers should take observational measures of well-being into consideration to ensure algorithms are designed in a manner likely to improve well-being. But they should be aware that observational measures of well-being do not allow for causal inference. Before deploying an algorithm, standard experimental tests—such as A/B tests—should be used to evaluate its causal effects on well-being. Such tests should adhere to applicable standards of research ethics.
With these tenets in mind, here we provide a summary of widely accepted measures of human well-being. Given that well-being is subjective, it is most directly measured using self-report instruments. But there are also many objective proxies of well-being that can be measured reliably and unobtrusively. We summarize a range of widely accepted self-report measures and objective proxies in the tables below.
Tier 1: Day-to-Day Measures
Present emotional experience
Self-reported positive emotions (e.g., admiration, adoration, aesthetic appreciation, amusement, awe, calmness, contentment, ecstasy, excitement, interest, joy, love, pride, romance, satisfaction, or triumph)
Lowness in self-reported negative emotions (e.g., anxiety, boredom, confusion, contempt, disappointment, disgust, distress, doubt, fear, guilt, horror, pain, regret, sadness, shame, or tiredness)
Given tradeoffs between immediate and long-term effects on emotions, measure emotions in the days or weeks following use of the application if possible.
Given that negative emotions are not always undesirable (e.g., horror in response to horror movies), measure emotional experience alongside other self-report measures or objective proxies of well-being.
To obtain reliable measurements, consider ecological momentary assessment (EMA) best practices.
Mental health survey instruments
Individual mental health assessments
Community and relationship mental health assessments
Most relevant to health-related applications.
Where possible, use well-validated instruments to assess mental health, as effect sizes can then be compared across many interventions.
Tier 2: Longer Term Measures
Past emotional experience
Feeling positive emotions in the past day, week, or month
Feeling fewer negative emotions in the past day, week, or month
Given that people can be inaccurate in recalling past experiences, measure alongside present emotional experience.
Feeling satisfied when reflecting upon one’s life as a whole
Feeling that life is close to ideal
Having few regrets
Satisfaction with Life Scale
Consider pairing intermittent measures of life satisfaction with more frequent well-being measures that are more variable and responsive to changes, such as present emotional experience.
Feeling motivated to help others
Recently helped others
High in assessments of empathy
Low in assessments of narcissism, machiavellianism, and psychopathy
The relevance of altruism depends on the specific benefits and risks of the application. It is particularly relevant for platforms that influence social decision-making.
Tier 3: Less Direct Measures
Level of satisfaction with a product or service
Likelihood of recommending a product or service to others
Caveat: User satisfaction is less directly linked to the ethics of the application than other well-being measures (e.g., consider user satisfaction with tobacco products).
Repeated informed consent
Informed consent solicited after a comprehensive summary of potential risks is reiterated to the user, such that consent can be viewed as a judgment that the benefits exceed the potential risks
Informed consent may be solicited regularly to ensure that users understand potential risks.
Ongoing consent is essential in early stages of testing, when data is insufficient to weigh benefits and risks.
Tier 1: Day-to-Day Measures
Positive facial expressions, vocal utterances (e.g., laughter), speech prosody, language, or emotion-related body movement appropriate to a given social context
Lowness in negative facial expressions, vocal utterances, speech prosody, language, or emotion-related body movement that are not expected in a given social context (for exceptions, consider expressions of horror in response to a horror movie or sadness at a funeral)
Assessing the appropriateness of expressive behavior to a given social context depends on nuanced measurement. For instance, a triumphant shout may be appropriate on a football field but not in a comedy club (valence alone is often insufficient).
Expressive behavior can be measured more frequently and passively than self-report and is less subject to self-enhancement bias and demand characteristics.
To weigh behaviors appropriately, measure alongside more intermittent self-report measures.
Antisocial behavior (e.g., physical harm, hate speech, environmental pollution)
Where feasible, reductions in harmful behavior can be used as an objective proxy to accompany self-report measures of well-being.
Tier 2: Longer Term Measures
Physical health indicators
Low rate of illness, injury, and/or mortality
Healthy levels of inflammatory cytokines, cholesterol, blood pressure, and other objective indicators of physical health
Where feasible, physical health indicators, particularly those closely linked to emotional well-being (e.g., sleep quality, inflammatory cytokines), can be used as objective proxies to accompany self-report measures.
Academic performance metrics
Career advancement metrics
Metrics of goal attainment are most relevant when they represent users’ direct goals in deploying empathic AI.
Negotiate life outcomes
Where feasible, reductions in negative life outcomes can be used as a strong objective proxy to accompany self-report measures of well-being.
Agreement with facts that are backed by near-universal scientific or expert consensus
False beliefs are of most concern for applications that control the spread of information, such as search engines and social media, and can be costly to individual and societal well-being.
Tier 3: Less Direct Measures
Emotion related physiology
Physiological changes associated with emotion in a given context, such as pupil dilation, sweating, and heart rate variability
Physiological indicators of emotion such as autonomic activity are highly context dependent and must be interpreted with care.
The following are recommended standards for ongoing measurement of well-being within an application or condition of an A/B test, by risk level and number of users. We recommend discontinuing any application of empathic AI whose benefits do not substantially outweigh its costs for the well-being of users and other affected parties.
Low-risk area AND used for <30 min. per day on average.
Medium risk area OR used for 30 min. to 1 hour per day on average.
High-risk area OR used for >1 hour per day on average.
<100 daily users
100-1K daily users
1k-10K daily users
10k-100k daily users
100K-1M daily users
1M+ daily users