Measuring Well-being

Well-being

Central to our guiding principles is the use of empathic AI to measure and improve human well-being. We see this as an urgent need that makes the development of empathic AI worthwhile. As AI gets smarter, it becomes increasingly essential to ensure that it learns to accomplish its objectives using methods aligned with human well-being. The principal goal of these guidelines is to ensure that empathic AI is used to improve human emotional well-being (and not the contrary).

Guiding Principles

Given that well-being is key to the ethical deployment of empathic AI, it is critical that we define human well-being and identify ways it can be measured. First, a definition: well-being is the experience of living a good life. It is the experience of living a life on balance, characterized by comfort, health, happiness, fulfillment, and a desired level of variety or richness in emotional experience. This brings to fore three key tenets of measuring and optimizing for well-being:

1

No single measure of well-being is perfect, but many are adequate.

No single measure of well-being reigns supreme in the eyes of scientists, philosophers, or poets. Yet it is not impossible to measure well-being. There are many valid metrics to choose from, just as there are many ways to be happy. Developers of empathic AI should strive to optimize for the most contextually appropriate measure. They should also ensure that improvements in one measurement or facet of well-being do not come at the expense of decreases in others. It is therefore essential to use multiple complementary measures, particularly in high-risk applications.

2

When optimizing for well-being, preserve emotional richness.

Well-being is not just about feeling "positive." We are not always better off smiling. We like feeling different emotions in different contexts—horror felt in response to a horror movie is, to many, a sought-after experience. We also enjoy experiencing a variety of positive emotions, from awe to amusement to love. But this should not discourage developers from training AI to increase positive emotions on balance. It should just be trained to do so without reducing the overall variety of emotions we experience and express.

3

Algorithms should be tested for their causal effects on well-being.

Developers should take observational measures of well-being into consideration to ensure algorithms are designed in a manner likely to improve well-being. But they should be aware that observational measures of well-being do not allow for causal inference. Before deploying an algorithm, standard experimental tests—such as A/B tests—should be used to evaluate its causal effects on well-being. Such tests should adhere to applicable standards of research ethics.

With these tenets in mind, here we provide a summary of widely accepted measures of human well-being. Given that well-being is subjective, it is most directly measured using self-report instruments. But there are also many objective proxies of well-being that can be measured reliably and unobtrusively. We summarize a range of widely accepted self-report measures and objective proxies in the tables below.

Self-Report Measures and Objective Proxies

Self-Report Measures

Category
Measures
Best Practices

Tier 1: Day-to-Day Measures

Present emotional experience

Self-reported positive emotions (e.g., admiration, adoration, aesthetic appreciation, amusement, awe, calmness, contentment, ecstasy, excitement, interest, joy, love, pride, romance, satisfaction, or triumph)

Lowness in self-reported negative emotions (e.g., anxiety, boredom, confusion, contempt, disappointment, disgust, distress, doubt, fear, guilt, horror, pain, regret, sadness, shame, or tiredness)

Given tradeoffs between immediate and long-term effects on emotions, measure emotions in the days or weeks following use of the application if possible.

Given that negative emotions are not always undesirable (e.g., horror in response to horror movies), measure emotional experience alongside other self-report measures or objective proxies of well-being.

To obtain reliable measurements, consider ecological momentary assessment (EMA) best practices.

Mental health survey instruments

Individual mental health assessments

Community and relationship mental health assessments

Most relevant to health-related applications.

Where possible, use well-validated instruments to assess mental health, as effect sizes can then be compared across many interventions.

Tier 2: Longer Term Measures

Past emotional experience

Feeling positive emotions in the past day, week, or month

Feeling fewer negative emotions in the past day, week, or month

Given that people can be inaccurate in recalling past experiences, measure alongside present emotional experience.

Life satisfaction

Feeling satisfied when reflecting upon one’s life as a whole

Feeling that life is close to ideal

Having few regrets

Satisfaction with Life Scale

Consider pairing intermittent measures of life satisfaction with more frequent well-being measures that are more variable and responsive to changes, such as present emotional experience.

Altruism

Feeling motivated to help others

Recently helped others

High in assessments of empathy

Low in assessments of narcissism, machiavellianism, and psychopathy

The relevance of altruism depends on the specific benefits and risks of the application. It is particularly relevant for platforms that influence social decision-making.

Tier 3: Less Direct Measures

User Satisfaction

Level of satisfaction with a product or service

Likelihood of recommending a product or service to others

Caveat: User satisfaction is less directly linked to the ethics of the application than other well-being measures (e.g., consider user satisfaction with tobacco products).

Repeated informed consent

Informed consent solicited after a comprehensive summary of potential risks is reiterated to the user, such that consent can be viewed as a judgment that the benefits exceed the potential risks

Informed consent may be solicited regularly to ensure that users understand potential risks.

Ongoing consent is essential in early stages of testing, when data is insufficient to weigh benefits and risks.

Objective Proxies

Category
Measures
Best Practices

Tier 1: Day-to-Day Measures

Expressive behavior

Positive facial expressions, vocal utterances (e.g., laughter), speech prosody, language, or emotion-related body movement appropriate to a given social context

Lowness in negative facial expressions, vocal utterances, speech prosody, language, or emotion-related body movement that are not expected in a given social context (for exceptions, consider expressions of horror in response to a horror movie or sadness at a funeral)

Assessing the appropriateness of expressive behavior to a given social context depends on nuanced measurement. For instance, a triumphant shout may be appropriate on a football field but not in a comedy club (valence alone is often insufficient).

Expressive behavior can be measured more frequently and passively than self-report and is less subject to self-enhancement bias and demand characteristics.

To weigh behaviors appropriately, measure alongside more intermittent self-report measures.

Harmful behaviors

Self-harm

Substance abuse

Antisocial behavior (e.g., physical harm, hate speech, environmental pollution)

Where feasible, reductions in harmful behavior can be used as an objective proxy to accompany self-report measures of well-being.

Tier 2: Longer Term Measures

Physical health indicators

Low rate of illness, injury, and/or mortality

Sleep quality

Healthy levels of inflammatory cytokines, cholesterol, blood pressure, and other objective indicators of physical health

Where feasible, physical health indicators, particularly those closely linked to emotional well-being (e.g., sleep quality, inflammatory cytokines), can be used as objective proxies to accompany self-report measures.

Goal attainment

Academic performance metrics

Career advancement metrics

Metrics of goal attainment are most relevant when they represent users’ direct goals in deploying empathic AI.

Negotiate life outcomes

Incarceration rates

Job loss

Poverty rates

Where feasible, reductions in negative life outcomes can be used as a strong objective proxy to accompany self-report measures of well-being.

Accurate beliefs

Agreement with facts that are backed by near-universal scientific or expert consensus

False beliefs are of most concern for applications that control the spread of information, such as search engines and social media, and can be costly to individual and societal well-being.

Tier 3: Less Direct Measures

Emotion related physiology

Physiological changes associated with emotion in a given context, such as pupil dilation, sweating, and heart rate variability

Physiological indicators of emotion such as autonomic activity are highly context dependent and must be interpreted with care.

Standards of Measurement

The following are recommended standards for ongoing measurement of well-being within an application or condition of an A/B test, by risk level and number of users. We recommend discontinuing any application of empathic AI whose benefits do not substantially outweigh its costs for the well-being of users and other affected parties.

Low Risk

Low-risk area AND used for <30 min. per day on average.

Medium Risk

Medium risk area OR used for 30 min. to 1 hour per day on average.

High Risk

High-risk area OR used for >1 hour per day on average.

None (informed consent is sufficient).

<100 daily users

Track at least one measure of well-being across a minimum of 20 randomly sampled users on at least a weekly basis.

100-1K daily users

Track at least one Tier 1 measure of well-being across a minimum of 50 randomly sampled users on at least a weekly basis.

1k-10K daily users

Track at least two measures of well-being, including a Tier 1 measure, across a minimum of 200 randomly sampled users on at least a weekly basis.

10k-100k daily users

Track at least two measures of well-being, including a Tier 1 measure, across a minimum of 400 randomly sampled users on a daily basis.

100K-1M daily users

Track at least two Tier 1 measures of well-being across a minimum of 2,000 randomly sampled users on a daily basis. Provide public access to anonymized results.

1M+ daily users

Track at least one measure of well-being in every user on at least a weekly basis.
Track at least one Tier 1 measure of well-being across a minimum of 40 randomly sampled users on at least a weekly basis.
Track at least two measures of well-being, including a Tier 1 measure, across a minimum of 100 randomly sampled users on a daily basis.
Track at least three measures of average well-being, including a Tier 1 measure, across a minimum of 300 randomly sampled users on a daily basis.
Track at least three measures of average well-being, including two Tier 1 measures, across a minimum of 1,000 randomly sampled users on a daily basis. Provide public access to anonymized results.
Track at least three measures of average well-being, including two Tier 1 measures, across a minimum of 5,000 randomly sampled users on a daily basis. Provide public access to anonymized results.
Track at least one Tier 1 measure of well-being in every user on at least a weekly basis.
Track at least two measures of well-being, including a Tier 1 measure, in every user on at least a weekly basis.
Track at least three measures of well-being, including a Tier 1 measure, across a minimum of 500 randomly sampled users on a daily basis.
Track at least four measures of well-being, including two Tier 1 measures, across a minimum of 1,000 randomly sampled users on a daily basis. Provide public access to anonymized results.
Track at least four measures of well-being, including two Tier 1 measures and another Tier 1 or 2 measure, across a minimum of 4,000 randomly sampled users on a daily basis. Provide public access to anonymized results.
Track at least four measures of well-being, including two Tier 1 measures and another Tier 1 or 2 measure, across a minimum of 10,000 randomly sampled users on a daily basis. Provide public access to anonymized results.

Previous

Guiding Principles

View Principles

Next

Conditionally Supported Use Cases

View Cases
Drag