Smartwatch- and smartphone-based remote assessment of brain health and detection of mild cognitive impairment

Overview
The Intuition brain health study (NCT05058950) was a prospective, observational and decentralized study in which all activities were mediated via a custom research iPhone application (Study App) and there were no in-person visits. The study was sponsored by Biogen, Inc., in collaboration with Apple, Inc. Scientific leadership for the design, analyses and communication of study results were guided by an external Scientific Committee, which was comprised of clinicians, researchers, and experts in ethics and technology and patient advocacy. The Scientific Committee members were engaged to provide independent scientific input and guidance for the duration of the study.
Participants aged 21 to 86 years residing in the United States were recruited using digital and online recruitment strategies directed by healthcare practitioners, researchers and the joint Apple–Biogen study team. Individuals were guided to the study website ( where they viewed the Institutional Review Board (IRB)-approved study information, and could elect to participate by providing e-consent through the Intuition Study App on the iPhone (Fig. 1). General eligibility required prospective participants to have an iPhone version 8 or newer running the latest version of iOS and to be willing to wear a study-provisioned Apple Watch (Supplementary Fig. 1). As part of the compensation for participation, the participants could later own the Apple Watch by completing study tasks. Three categories of participants were recruited into seven cohorts based on age and cognitive status (Supplementary Fig. 2), including individuals presumed to be cognitively intact (Controls), individuals with prominent subjective cognitive complaints (SCC) and those self-reporting or known to have a medical diagnosis of mild cognitive impairment (MCI). The largest category were the three Control cohorts (n ≈ 18,000+) including at least 6,000 in Early and Middle Adulthood (Controls-EM) aged 21–59 years and 12,000 in the Late Adulthood (Controls-L) divided into those at low- or high- risk for cognitive decline based on prespecified risk factor criteria (Controls-L Low- and Controls-L High-Risk; Supplementary Table 1). The second category of participants were those with concern for new decline in cognitive function (SCC cohort) compared with 1 year before study enrollment as defined by prespecified threshold score on a validated baseline screening questionnaire CFI-14-item (total score ≥ 4) and by the age of 50–86 years (n ≈ 2,000). The third category included three cohorts of MCI (n ≈ 2,000) divided into those in early and middle adulthood (MCI-EM) aged 21–49 years who self-reported receiving a diagnosis of cognitive impairment, or late adulthood aged 50–86 years who either self-reported an MCI diagnosis (MCI) or who were referred/identified by clinical sites/medical record review with documented and clinically confirmed MCI status (MCI-CC). For further details on cohorts and eligibility see Supplementary Section 1. While MCI was developed as a clinically diagnosed syndrome intended to identify those at risk for progression to dementia, we took a wider view and included those at risk of any cognitive impairment at any age or cause in adulthood. The purpose was to better understand varieties of typical and atypical cognitive health issues in real-world heterogenous populations with medical, psychiatric and neurological comorbidities.
Participants were recruited regardless of whether they were preexisting Apple Watch users, and all cohorts were eligible and encouraged to order, pair and wear the study-provisioned Apple Watch after completing baseline enrollment. The digital engagement strategy deployed in the Study App was designed to fit alongside individuals’ daily activities and required, on average, about an hour per month to carry out study-related active tasks. Participants were asked to complete surveys and questionnaires on health and habits, perform cognitive assessments (for example, ‘memory and thinking activities’) both in and out of the Study App, and were provided with educational content intended to raise brain health awareness (Extended Data Figs. 7 and 8).
Participants were informed of the overarching aims of the study to help researchers investigate the role an Apple Watch and iPhone could play in measuring changes in thinking and memory, and by studying changes in brain health over time, which may occur normally as people age or could be an early indicator of certain forms of dementia, such as AD. With a decentralized framework, we surmised that the study addressed important challenges with the current paradigm of conducting clinical studies via direct recruitment and engagement in brain health research, and by facilitating more patient-oriented approaches integrated into everyday life, which encouraged a broader diverse audience and democratized participation. The new study design advanced and aligned with the objectives of the 21st Century Cures Act according to the US Food and Drug Administration, which seeks to promote medical advances, evolve the traditional model of trial design and probe the value of real-world data to improve brain health outcomes. With the Intuition brain health study DCT framework, burdens to participate were minimized with data collection and tracking seamlessly integrated in the devices of everyday life (for example, mobile apps and wearables). The core objectives of the Intuition study were to classify MCI using multimodal passive and interactive data collected from the Apple Watch and iPhone and to characterize cognitive trajectories in individuals at risk for longitudinal decline.
Study design
Intuition enrollment and study flow consisted of four stages described below and depicted in Fig. 1. The planned study observation window was 24 months per subject based on the time from enrollment. The Intuition brain health study opened enrollment on 20 September 2021 and, following 24 months of data collection, the study was closed earlier than planned on 20 September 2023. Due to changing program priorities at Biogen, the study was discontinued early. With at least 12 months of data from 82.3% (n = 18,934) of participants, including 92.9% of which provided at least 4 h of daily Apple Watch wear data, then the study has collected adequate data to approach the key objectives outlined in the study design and statistical analysis plan.
Recruitment stage
We worked with vendors, commercial, research and academic entities in each of these principal strategic areas to facilitate broad recruitment:
-
(1)
Email campaigns—both broad and demographic focused approaches;
-
(2)
Word-of-mouth study referrals;
-
(3)
Web search and social media advertisements;
-
(4)
Community health and advocacy events;
-
(5)
Intuition Study website and Apple App store traffic;
-
(6)
Referrals from study sites and identified by diagnostic codes from research, medical and/or claims databases (for example, MCI-CC).
Strategies were adaptive, for example, as targeted email campaigns recruited demographically defined cohorts, we shifted email recruitment approaches toward those demographic characteristics required for cohorts that remained open for enrollment as other cohorts filled completely. The study website contained a variety of IRB-approved materials about the study, eligibility criteria and expectations for study participation, and was updated with key information as the study proceeded. For additional information about recruitment approaches, including details about clinically confirmed MCI, see Supplementary Section 1.
Screening, e-consent and eligibility stage
Interested individuals were directed to the Apple App Store to download the Intuition Study App and initiate screening. The core initial screening eligibility criteria (Supplementary Fig. 1) were: age 21–86 years, primary resident of the United States for the study duration, educational attainment of eighth grade or higher, fluent in spoken and written English, use of an active iPhone version 8 (released in Fall 2017) or newer running the latest iOS, access to Wi-Fi or hardwired internet with a desktop computing device (Mac or Windows) or iPad, willingness to wear an Apple Watch and have an active email address for use in study communications. After demonstrating initial eligibility potential participants advanced through by providing IRB-approved e-consent, including an explanation and overview of data that was to be collected and confirmed their understanding of the study and acknowledged their willingness to participate. Email contact information was confirmed, and identity verified. Next, participants provided self-reported responses to questions in the Study App that evaluated cognitive and risk factor status to determine potential cohort eligibility. Participants were presented with the ability to share the relevant Health Kit and Sensor Kit data streams, and to receive study notifications, before moving on to complete the Onboarding Stage.
Onboarding stage
Participants were oriented to the Study App, including the tasks, points and rewards, and profile sections. Next, an onboarding ‘new user experience’ ensued, which formally welcomed individuals, provided an overall study overview and explained reasons why participation and contribution to research were important. A curriculum timeline for study activities was presented, including baseline surveys on health and habits, baseline cognitive assessment (that is, CANTAB 30-min computerized battery), Watch ordering and features, and educational material on cognition and brain health. To encourage Apple Watch wear, introductions were provided for to how to pair the Watch once received and participants were incentivized to engage in a weekly ‘Stand challenge’ and set up sleep tracking. In addition to monthly CANTAB batteries, participants were asked every 3 months to perform 2 weeks of high-frequency ‘burst’ cognitive testing in the Study App on the iPhone. For these bursts, an introduction, assessment tutorial, scheduling and practice opportunities were offered.
Baseline enrollment stage and beyond
After completion of the onboarding stage, including the new user experience, participants advanced to baseline enrollment status by attempting the out-of-App CANTAB computerized cognitive battery on a personal computing device (for example, iPad, laptop or desktop computer). Completion of CANTAB triggered the Apple Watch provisioning and shipping process. Participants paired the Apple Watch with the iPhone and started sharing Watch-based passive study data and began to earn points for completing activities such as the weekly Stand challenges and sleep tracking. We deployed points for study task completion strategically to drive engagement over the course of the study and to serve as a mechanism for participant compensation. Completing baseline enrollment provided subjective and objective cognitive health data to define phenotypes for longitudinal comparison. Cycles of interactive data collection ensued (Extended Data Fig. 8), and passive data capture occurred in the background of typical daily device use. All participants accumulated and received points for their time participating in study-related activities and reached the ‘Watch Goal’ by completing approximately 40% of routine study activities. The ‘Watch Goal’ was meant to provide an opportunity to the participants to keep the Apple Watch after study completion or if they decided to withdraw voluntarily after reaching the study goal. Points could have been redeemed in an ongoing fashion through the Study App for monetary rewards, up to US$280 of possible compensation with high adherence to study tasks.
Data sources and measurement approach
Interactive cognitive measurements
Extended Data Table 1 provides an overview and description of the type of data captured with descriptions of source, activity and cadence of sampling. These cycles persisted over the study observation window (Extended Data Fig. 8). With the DCT approach there were no traditional brick-and-mortar site-based requirements. All study data were collected digitally and included Study App engagement information and participant-entered self-report data related to demographics; health, lifestyle and habits; global and mental health and cognitive health. The overall interactive cognitive measurement approach (Supplementary Section 2) included six main areas, including:
-
(1)
SCC surveys: in-app biannually (CFI-14, E-Cog-12);
-
(2)
Monthly CANTAB: out-of-app CANTAB 30-min computerized battery;
-
(3)
Quarterly Cam-Cog burst: in-app high-frequency testing for 2 weeks, three times daily;
-
(4)
Quarterly language: in-app 5-min custom battery with recorded voice;
-
(5)
Tele-research: out-of-app, event-based triggered by prespecified criteria, tele-visit to evaluate cognitive health status, medical comorbidities and to perform a tele-MoCA;
-
(6)
Context of cognition: in-app baseline, quarterly and biannual surveys.
To complement the subjective cognitive assessments, participants completed computerized neuropsychological tests monthly and quarterly. Monthly assessments included five tests from the CANTAB—a tool with over 30 years of application in neuropsychological and clinical research62. The five tests were the PRMi, PRMd, PAL, SWM and MTS tasks. These tests were chosen specifically for their broad coverage of key cognitive domains, including visual short-term episodic, recognition and working memory, as well as processing speed, complex attention and executive functioning. All tests selected were based on unique visuospatial stimuli agnostic of language and or culturally mediated effects. Unique stimuli (for example, parallel forms) were used in each assessment to obviate overt practice effects. Previous research indicates that CANTAB tests demonstrate orthogonality among outcome measures, suggesting they may capture distinct aspects of cognition63. In addition, correlational analyses between CANTAB and traditional paper-and-pencil neuropsychological assessments have identified moderate relationships and some overlapping cognitive domain structures, such as PAL/PRM outcomes with episodic learning/memory64. Moreover, the selected tests have clinical validity for measuring progressive changes associated with human aging65,66, exhibiting sensitivity to the early identification and trajectories of cognitive decline in both cross-sectional and longitudinal studies66,67. Specifically, the five CANTAB tests selected (PRMi, PRMd, PAL, SWM and MTS) have demonstrated clinical validity and performance differences across several neurodegenerative and neuropsychiatric disorders with moderate-to-high effect sizes, including MCI, AD and related dementias67,68,69, Parkinson’s disease70,71 and clinical depression72. Different therapeutic areas benefit from unique combinations of CANTAB tasks, with the five selected for our study chosen based on their utility in stratifying age-associated cognitive changes from MCI and AD trajectories73.
The monthly assessments were deployed through web browsers on participant’s preferred personal computing device, which has been shown to be ecologically valid for remote cognitive assessment and equivalent to in-clinic, supervised environments74,75. Furthermore, the tasks have multiple parallel/alternate forms to reduce learning effects and incorporate stepwise difficulty levels to mitigate floor and ceiling effects, particularly among clinical cohorts, thereby enhancing the sensitivity and specificity of the cognitive assessments. Once launched in the web browser, the CANTAB battery uses an English recorded voice for instructions, but the contents of the tests are all nonverbal and have been validated in non-English speaking populations, such as Spanish76,77.
In the quarterly Cambridge Cognition (Cam-Cog) burst assessment, participants were invited to complete an N-Back task (that is, 2-Back) in addition to the Digit Symbol Substitution Test (DSST) on their personal iPhones. These tests were captured in a high-frequency paradigm, allowing users to provide a snapshot of their cognition up to three times per day, across 2 weeks each quarter (see Extended Data Fig. 8 for further information). Emerging research has demonstrated the utility in smartphone testing for better approximating a user’s cognitive function, enhancing the ecological validity as well as the clinical utility in performance features such as the learning curve and intra-individual variability in cohort stratification47,78. Further information on each cognitive task is given in the following paragraphs.
The PRM task assesses visual learning and recognition memory across two phases. The immediate phase begins with sequential presentation of 18 abstract nonsemantic images to learn. The task proceeds to a forced-choice recognition test where users must select between a previous pattern and a visually similar foil. Participants then perform the delayed recognition task at the end of the full assessment battery. Total percent correct is a key outcome measure for each task phase.
PAL assesses visual-spatial learning and episodic memory, requiring participants to recall the location of an abstract nonsemantic image. The task increases in difficulty from 2-box patterns to 4-, 6-, 8- and 12-box patterns. Failing a level after four attempts terminates the task, preventing users proceeding to higher difficulty levels. Total error counts are a key outcome measure for the task.
The SWM assesses users working memory and executive function through the acquisition and manipulation of spatial information. Through the process of elimination, users must open boxes to collect hidden tokens, but tokens will never present in the same box twice. Participants must find all the tokens ideally without reopening any boxes that have previously contained one. The task increases in difficulty from 4, 6, 8 to 12 tokens. Total error counts and strategy scores are a key outcome measures for the task.
The MTS assesses processing speed and complex attention by requiring participants to find the exact match of an abstract, nonsemantic target image from an array of visually similar variants and foils. Trials are pseudorandomized across difficulty levels of 1-, 2-, 4-, 6- and 8-pattern choices requiring them to scan each option and choose the correct response as fast as possible. The task has a speed–accuracy trade off with reaction times and accuracy measures being key outcomes.
For further descriptions, see ‘Cognition measurement approach’ in Supplementary Section 2.
Passive measurements
Multimodal data from the iPhone and Apple Watch can measure across a diverse array of human functions, such as sensorimotor, behavioral, physiologic and autonomic. These data streams are signals with varying sampling frequencies, from event-based sensing (for example, number of running workouts) to routine scheduled sampling (for example, daily volume of calls received) and near continuous high-frequency signals (100 Hz IMU accelerometer). The Sensor Kit ( system collects information using various sensors on the iPhone and Watch, and computes features derived from sensor information using proprietary algorithms. The main areas covered by the Sensor Kit include device and application usage, keyboard metrics, message and phone use, sound and speech detection, facial metrics, odometer and locations, and the x–y–z coordinates of a body’s acceleration and angular velocity from triaxial accelerometer and gyroscope measurements. Health Kit records a variety of health metrics, both sensed and manually entered, and from first- and third-party sources. Examples include physical activity (for example, exercise minutes, active calories burned, step counts, and so on), different types of workouts (for example, running or rowing), time spent and energy burned, walking speed and asymmetry, heart rate and variability, VO2 max, respiratory rate, oxygen saturation and sleep (https://developer.apple.com/documentation/healthkit).
Ethics, privacy and data storage
The Intuition study was conducted in accordance with the ethical principles outlined in the Declaration of Helsinki and complied with all applicable regulations and guidance, including but not limited to International Council for Harmonization (ICH) and Good Clinical Practice (GCP) guidelines. This study was approved by the IRB, Advarra (Study ID 285PI401, Board no. 00000971). All participants in the study were provided informed consent electronically and remotely via the Study App.
Secure frameworks were developed to meet the security standards set forth in applicable law, including the deployment of technology and data security processes with vulnerability monitoring and penetration testing. Study data, including any protected health information, were stored in encrypted form at rest and in transit following National Institute of Standard and Technology guidelines outline by the Joint Task Force Transformation Initiative 2013.
The Study App used for enrollment, eligibility screening and active task administration, as well as the platform used for data collection and monitoring, utilized physical, organizational and technical safeguards designed to protect the confidentiality, security and integrity of the data collected. For example, data were encrypted for transmission and storage following guidelines recommended by the US Department of Commerce National Institute of Standard and Technology Federal Information Processing Standard Publication 140-2 that outlines security standards for securing data with health information.
Study oversight
Scientific, ethical and clinical leadership were guided by a Scientific Committee consisting of recognized leaders in the fields of clinical research, neurology, psychiatry and medicine, technology and wearable devices, real-world evidence and biostatistics, bioethics and patient advocacy. The key roles and responsibilities of the Committee were to oversee and provide input on the conduct of the trial, monitor study progress, provide guidance related to recruitment, retention, and attrition and contribute to data analyses and dissemination strategies for scientific results. The Committee members were engaged to provide independent scientific input and guidance for the duration of the study.
Study objectives
The co-primary objectives of the study were:
(1a) To develop and validate a classifier using multimodal passive sensor data and metrics derived from normal iPhone and Apple Watch usage to distinguish individuals with normal cognition from those with MCI.
(1b) To develop and validate a cognitive health score that tracks fluctuations in cognitive performance over time using multimodal passive sensor data and metrics derived from normal iPhone and Apple Watch usage.
The secondary objective of the study was:
(2) To develop a prediction model that uses multimodal passive sensor data and metrics derived from normal iPhone and Apple Watch usage to predict cognitive decline and/or conversion to MCI.
Sample sizes
Formal sample size calculation
Because the primary and secondary objectives of this study did not involve formal, prespecified hypothesis tests, traditional power analyses were not applicable. Two methods for calculating prediction model sample size were chosen: Hanley and McNeil79— to ensure precise estimation of the model AUROC; and Riley80—to guarantee precision in estimation of the overall outcome proportion (that is, MCI prevalence or incidence), low average prediction error and low likelihood of model overfitting79,80.
For the Hanley method, sample sizes were computed to ensure an AUROC CI width of ≤0.05. For the approach based on Riley80, sample sizes were estimated according to the author’s recommendations: to ensure a margin of error of ≤0.05 in estimate of outcome proportion, ≤0.05 mean absolute prediction error and small overfitting defined by an expected shrinkage of predictor effects of 10% or less. Calculations assumed a range of AUC values, and a study outcome proportion (MCI prevalence or incidence for the diagnostic or prognostic models, respectively) of 0.1.
Estimated sample sizes for a range of assumed model AUCs are shown in Supplementary Table 9. With a conservative assumed model AUROC of 0.7, the number of MCI (or MCI converter for the prognostic model) to be included in the study was n = 360 based on Hanley, and n = 564 based on Riley. Combined target MCI sample sizes from the population-based and clinically validated patient groups of n = 722 ensured sufficient numbers to develop a diagnostic model for the primary objective while holding out a large independent test set for model validation. After accounting for attrition and varied adherence (for example, 10–20%) with study protocol activities, the target MCI population size was then set at n = 800–1,000 for those at risk for age-related neurodegenerative causes (that is, dementia). Similarly, for those in early and middle adulthood with variable reasons for cognitive impairment, we wanted to target enrollment for n = 800–1,000 for adequate model building and testing. For appropriate control group sample sizes required for classifier development, we targeted N = 6,000 for those Controls-EM to pair with the MCI-EM and account for the impact of attrition and variable adherence. To approach the secondary objective of the study and develop prognostic models of cognitive decline and MCI conversion, we set similar target sample sizes (n = 800–1,000) for anticipated Controls/SCC who might decline and/or convert to MCI. Based on epidemiologic data for projections, we estimated that with 1,500 Controls aged 50–59 years, 2,000 SCC, 6,000 Controls-L Low-Risk and 6,000 Controls-L High-Risk, we anticipated that approximately 500–550 participants would decline clinically annually. As with the diagnostic model, this number of estimated converters was sufficient to develop a well-powered prognostic model for MCI conversion, while holding out a large independent test set. For further discussion about the epidemiologic calculations and reference to previous studies and sample size for MCI classification please reference Supplementary Section 3.
Statistical analyses
Overall approach
Because both the primary objective of a diagnostic MCI classifier and the secondary objective of a prognostic classifier involve primarily the use of baseline or near-baseline participant data—to predict current MCI status and future transition to MCI, respectively—the analytic approaches are similar. Participants will be split into a training dataset, which will be used for all model development and tuning activities, and a testing dataset, which will be set aside for independent model validation. Candidate statistical and machine learning models will be validated on the independent, held-out testing dataset of participants to ensure generalizability of the resulting model performance measures. In addition to assessing model performance, the importance of different features and sensor streams will be assessed to understand which sensor domains provide the most predictive utility for MCI classification and prognosis.
Special concerns must be accounted for in model development due to the nature of the study, with real-world, high-frequency data collection across a wide breadth of data modalities. These concerns include understanding and accounting for data missingness, choosing optimal sampling windows and temporal resolutions, and balancing interpretability with performance in what could be very complex models.
Intermediate steps are also necessary before achieving the primary and secondary objectives. These involve but are not limited to (1) assessment of adherence and data missingness across the active and passive data streams of the study; (2) characterization of the active unsupervised cognitive task data, including between- and across-task correlations, possible ceiling/floor effects and psychometric validation steps such as looking for expected demographic and clinical associations and assessing test–retest reliability and learning effects; (3) full examination of the passive data streams, many of which are exploratory in the context of cognition; and (4) data reduction steps, particularly for the high-volume passive sensor streams.
Initial approach to analyses and proof-of-concept MCI classifier
For this manuscript, we began MCI classification with readily interpretable statistical tests and modeling. For baseline characterization of cohorts by age and cognitive status we applied analysis of variance and pairwise t-tests to group means and variance for selected key subjective (CFI/E-Cog) and objective (CANTAB) measures of cognition. For model building, we started with logistic regression in a subset of validated MCI (MCI-CC + MCI tele-health confirmed, N = 556) versus a large diverse population aged 50–86 years both with and without cognitive complaints (SCC + Controls, N = 16,234). Predictor variables (total input feature space of N = 205) included core demographics of age, sex and education, and baseline subjective and objective cognition as measured by CFI and E-Cog total and item-level scores, and N = 176 CANTAB outcomes based on PRMi, PRMd, PAL, SWM and MTS assessments, all of which were scaled/standardized. The proof-of-concept MCI classifier is a logistic regression model with ridge penalization (L2 regularization). The model incorporates all baseline CANTAB outcomes (objective cognitive performance measures), along with two subjective cognition surveys: CFI and E-Cog. In addition to these cognitive assessments, the model also uses core demographic variables including age, sex and education level. The data were split into 80% for training and 20% for testing. To address class imbalance between the majority and minority classes, the training data were resampled using a 3:1 majority-to-minority class ratio. The model was trained using 100 times bootstrap resampling in the outer loop to enhance generalization and estimate the stability of the model. Within each bootstrap iteration, grid search was employed in the inner loop to systematically explore a range of hyperparameters, specifically the regularization strength for ridge penalization, and identify the best-performing hyperparameter configuration. To further ensure robust evaluation, the inner loop applied stratified fivefold crossvalidation, which maintained class balance within each fold while testing different hyperparameter sets. This nested crossvalidation setup ensured that the hyperparameter tuning of the model was independent of the outer loop resampling, minimizing the risk of overfitting and optimizing performance on unseen data. Model accuracy and mean AUROC on the test dataset are reported with 95% CIs. A list of rank ordered predictor coefficients supplement the results for model interpretability.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
link