Comprehensive evaluation of the Functional Activities Questionnaire (FAQ) and its reliability and validity

The Functional Activities Questionnaire (FAQ) is a collateral-report measure of difficulties in activities of daily living (ADLs). Despite its widespread use, psychometric analyses have been limited in scope, piecemeal across samples, and limited primarily to classical test theory. This manuscript consolidated and expanded psychometric analyses using tools from generalizability and item response theories among 27,916 individuals from the National Alzheimer’s Coordinating Center database who completed the FAQ. Reliability was evaluated with internal consistency, test-retest, and generalizability analyses. Validity was assessed via convergence with neurocognitive measures, classification accuracy with impairment stage, and confirmatory factor and IRT analyses. Demographics did not impact scores and there was strong evidence for reliability (0.52-0.95), though coefficients were attenuated when restricted in range to diagnostic groups (e.g., normal cognition). There were strong correlations with neurocognitive measures (rs: −0.30-−0.59), strong classification accuracy (AUCs: .81-.99), and a single-factor model had excellent fit. All items evidenced strong IRT discrimination and provided significant information regarding functional disability, albeit within a relatively restricted range. The FAQ is a reliable and valid measure of ADL concerns for use in clinical/research settings. It best assesses mild levels of functional difficulty, which is helpful in distinguishing normal cognition from mild cognitive impairment and dementia.

Keywords: Functional Activities Questionnaire (FAQ), functioning, activities of daily living (ADLs), psychometrics, reliability, validity, item response theory (IRT)

Assessment of Functioning in Aging and Dementia

Functional independence and the ability to participate in diverse meaningful activities is key to healthy aging. Activities of daily living (ADLs) can be grouped into domains, such as self-care, mobility, communication, learning/applying knowledge, domestic life, community and civic life, and interpersonal interactions/relationships (World Health Organization, 2001). ADLs can also be hierarchically organized into basic, instrumental, and advanced activities (Reuben & Solomon, 1989). Basic ADLs (BADLs) comprise activities that meet physical needs and sustain rudimentary autonomy, such as bathing, grooming, and toileting. Instrumental ADLs (IADLs) encompass activities that facilitate independent living in industrialized society, such as bill payment, health maintenance, meal preparation, transportation competence, and household chore completion. Advanced ADLs (AADLs) comprise “luxury” functioning that develops individual and community identity, and enhances quality of life. Examples include occupational roles, hobbies, financial investments, and social engagement. Traditionally, BADLs and IADLs have been emphasized in medicine, and also tend to be structured and routinized. AADLs are less routinized, tend to rely on higher levels of fluid cognition, and most recently have received attention in the fields of aging and neurodegeneration (Cornelis, Gorus, Van Schelvergem, & De Vriendt, 2019).

As emphasized in the WHO framework, ADLs can be limited by impairment in bodily structures and functions, which lead to physical and cognitive deficits (WHO, 2001). Neurodegenerative disease is a primary cause of functional limitations in the elderly. In fact, ADL dependence is fundamental to a diagnosis of dementia (i.e., major neurocognitive disorder) and guides cognitive staging across the diagnostic continuum (American Psychiatric Association, 2013; Jack et al., 2018; McKeith et al., 2017). While mild cognitive impairment (MCI) was originally conceptualized as a diagnostic stage without ADL changes, there is now a robust literature documenting functional limitations in MCI (Jekel et al., 2015). In fact, newer diagnostic criteria and staging frameworks allow MCI’s characteristic cognitive and neurobehavioral impairment to impact functioning (limited to AADLs and IADLs), although the individual should still be able to function independently despite doing so with increased difficulties (Jack et al., 2018). Accurately assessing ADLs is also important to cognitive and lifestyle interventions, as targeting them can allow people to maintain functioning despite cognitive decline. Furthermore, baseline functional difficulty appears to moderate the response to cognitive remediation and brain health lifestyle interventions (Amofa et al., 2019; Denny et al., 2020)

Since assessment of functional impairment is critical to accurate diagnosis and effective intervention, several measures have been promulgated that typically fall into three broad categories: self-report, collateral-report, and performance-based. Since limitations in awareness are a common feature of neurodegenerative conditions, collateral forms have emerged as a time-efficient method for measuring functional concern and they typically have stronger relationships to cognition and more accurate characterization of functioning than self-report measures (Gold, 2012).

Pfeffer’s Functional Activities Questionnaire

One of the most commonly used measures of functional status is the Functional Activities Questionnaire (FAQ; Lindeboom et al., 2003; Pfeffer et al., 1982), which has been included in large efforts such as the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and National Alzheimer’s Coordinating Center (NACC). Pfeffer and colleagues (1982) developed the measure in response to the focus on BADLs in contemporary measures, which did not adequately sample the range of “social functioning” in community-living elders (Pfeffer et al., 1982). As such, the measure’s ten items primarily assess IADLs and some AADLs from different ADL domains, including: bill payment, tax management, cooking, hobbies, tracking current events, traveling, and remembering appointments and medications. These items are typically presented to a close collateral, who rates the items on a 4-point ordinal scale, with higher scores indicating more difficulty.

The FAQ was designed to be part of a functional capacity index and was initially validated in a sample of 195 elders from a retirement community. In this sample, it was shown to have strong-to-moderate convergent validity with another collateral report of BADLs (i.e., the Lawton), a cognitive screener (i.e., Mini-Mental State Examination), and two brief measures of executive functioning/attention (i.e., Symbol Digit Modalities and Raven’s Progressive Matrices). Using a criterion of being completely dependent on two or more activities, equating to a cutoff of ≥6, the FAQ showed adequate sensitivity (.85) and specificity (.81) to a neurologist’s diagnosis of dementia. There was also suggestion of criterion validity as characterized by a strong correlation (r = −.83) with neurologist ratings of functioning. However, there is concern for possible criterion contamination as not all neurologists were blinded to the same collateral sources.

Since then several studies have analyzed the classification and predictive accuracy of the FAQ (Devanand et al., 2008; Jacinto, Brucki, Porto, Martins, & Nitrini, 2012; Mejia, Gutiérrez, Villa, & Ostrosky-Solís, 2004; Tabert et al., 2002; Teng et al., 2010; Yin et al., 2020). In differentiating dementia from normal cognition (NC) with a modified 11-item, Spanish FAQ, a cutoff of ≥7 was found to have lower sensitivity (.36-.71) than specificity (.79-.98) in a sample of Mexican elders with 0-9 years of education (Mejia et al., 2004). Individuals with MCI and mild dementia were selected from a previous NACC data freeze, and cutoff of ≥5 was found to have adequate sensitivity (.80) and specificity (.87; Teng et al., 2010). In a sample of Brazilian elders using a modified Portuguese FAQ, Jacinto and colleagues (2012) evaluated different diagnostic borderlands and found cutoffs of ≥2 for NC/cognitive impairment (sensitivity = .88, specificity = .90), ≥3 for NC/dementia (sensitivity = 1.00, specificity = .94), and ≥5 for MCI/dementia (sensitivity = .90, specificity = .63). In a sample of rural Chinese elders, a Mandarin FAQ cutoff of ≥6 had adequate specificity (.75) and sensitivity (.94) for differentiating NCs from dementia (Yin et al., 2020). However, to maintain similar accuracy, the cutoffs had to be modified for those older than 75 (≥11), for those with more education (≥8; a finding that is opposite of what one would expect), and among men (≥9). Using a modified dichotomized FAQ scoring (no problem vs. some problem), a cutoff of ≥2 had relatively low sensitivity (.61) and specificity (.57) for predicting conversion to dementia from MCI within two years (Tabert et al., 2002); however, including this dichotomized FAQ significantly improved the predictive accuracy of a multivariate model (Devanand et al., 2008).

To date, the only study after Pfeffer and colleagues (1982) to expressly evaluate reliability and validity did so in Brazil using a modified Portuguese FAQ, which includes three items that differ from the original measure (Assis, de Paula, Assis, de Moraes, & Malloy-Diniz, 2014). They found adequate internal consistency (.91) and convergent validity with another collateral ADL measure (.85 with its IADL subscale, but .23 with its BADL subscale). They also found small-to-moderate effects (rs = 0.24-0.57) with cognitive screeners and brief scales (e.g., Dementia Rating Scale, Clock Drawing Test, MMSE, Frontal Assessment Battery). Using a principle components analysis, they found a two-factor structure consisting of 1) more routinized ADLs and 2) ADLs that rely on episodic memory or social responsivity.

There have also been studies that discussed aspects of reliability and validity in the FAQ as a part of other research foci. In looking at self-report versions in English and Spanish, some studies have found strong internal consistency (.81-.86; Contador et al., 2020; Tappen et al., 2010). One study looked at test-retest over the course of a year and found low-to-adequate correlations for NCs (.73), MCI (.81), and dementia (.63), and recommended that a change of 3.6 points on the FAQ be considered reliable change (Malek-Ahmadi et al., 2015). Several studies have found small-to-large effect sizes for the MMSE and FAQ (Albert et al., 1999; Contador et al., 2020; Devanand et al., 2008; Laks et al., 2005; Marshall et al., 2011; Tappen et al., 2010; Tekin, Fairbanks, O’Connor, Rosenberg, & Cummings, 2001). One other study looked at cognition in greater detail and found moderate correlations with measures of attention/executive functioning (i.e., Trail Making Test, Digit Symbol Modalities Test) and memory (i.e., Rey Auditory Verbal Learning Test; Marshall et al., 2011). In addition to the Assiss and colleagues (2014) study, one other study evaluated the latent structure of the FAQ using a confirmatory factor analysis (CFA) approach, but only evaluated a 1-factor model and compared it across Hispanic and Non-Hispanic Whites (Sayegh & Knight, 2014). They found adequate fit (RMSEAs = .08-.09; CFIs = .90-.94; TLIs = .87-.92) and configural and factorial invariance across both groups (i.e., same number of factors and item-factor loadings), although scalar invariance was not supported (i.e., different intercepts).

Current Study

Furthermore, none of the studies noted above applied methods from generalizability theory (GT) or item response theory (IRT) to the FAQ. GT was developed to supplement classical test theory (CTT; Price, 2017). CTT-based analyses assume that a measure score is a combination of the individual’s “true” score and measurement error, although they do not provide information on the sources of error. GT uses multilevel statistics to parse the sources of error when there are multiple raters or time points, which then facilitates generalization to different settings and decisions. IRT models illuminate item-scale information as it relates to a latent construct. It assumes that a test score is a function of an individual’s latent ability and the test’s item characteristics. IRT can be used to determine the amount of information that a scale and its items provide regarding the latent construct, as well as item response-level differences in difficulty and discrimination.

An updated, consolidated, and expanded analysis of psychometrics would help researchers and clinicians determine whether the FAQ meets the needs of their research projects and clinics. It would also give them tools to fully leverage the information offered by the FAQ. Since psychometrics are sample-dependent, it is important to have these conducted on a single sample that is large in size and well-characterized to allow for reasonable generalization to different settings. The current study provides information on reliability and validity using methods from CTT, GT, and IRT using a large, multicenter sample. We report how we determined our sample size, all data exclusions, all manipulations, and all measures in the study.

Method

Sample

NACC is funded by the National Institute on Aging (NIA) and comprises several Alzheimer’s Disease Research Centers (ADRCs) across the United States of America. The current sample includes data from 39 ADRCs, with data collected between September 2005 and December 2018. Individuals may have multiple visits, and those who had a FAQ during their first visit were selected (n = 27,916). The sample consisted primarily of well-educated (M = 15.21, SD = 3.31, range = 0-30), older adults (M = 71.39, SD = 10.56, range = 18-110), although there was a wide range of age and education. Other sample characteristics regarding sex, education group, ethnoracial group, and primary language are noted in Table 1 .

Table 1.

n (%)FAQ M (SD)Cronbach α [99% CI]η 2 [99% CI]
Total 27,9166.85 (9.98)0.976 [.975, .977]
Age 0.023 [.018, .027]
3,399 (12.2)6.91 (9.90)0.976 [.974, .977]
60-698,095 (29.0)5.53 (9.17)0.976 [.975, .977]
70-7910,044 (36.0)6.23 (9.51)0.975 [.974, .976]
≥806,378 (22.8)9.49 (11.17)0.977 [.976, .978]
Sex .004 [.002, .006]
Man10,800 (38.7)7.67 (10.24)0.975 [.974, .976]
Woman17,116 (61.3)6.34 (9.78)0.977 [.976, .978]
Education .033 [.028, .038]
1,857 (6.7)12.31 (11.76)0.976 [.974, .978]
12-1510,196 (36.8)7.74 (10.37)0.976 [.975, .977
≥1615,640 (56.5)5.53 (9.11)0.975 [.974, .976]
Ethnoracial Group .006 [.003, .008]
Non-Hispanic White20830 (75.9)6.98 (9.94)0.975 [.974, .976]
Black3636 (13.3)5.21 (9.38)0.980 [.979, .981]
Hispanic1269 (4.7)8.70 (11.17)0.978 [.976, .980]
Asian710 (2.6)6.11 (9.43)0.975 [.971, .978]
Other963 (3.5)5.97 (9.46)0.973 [.970, .976]
Language .007 [.005, .010]
English25,917 (92.9)6.66 (9.85)0.976 [.975, .977]
Spanish1,281 (4.6)10.50 (11.61)0.979 [.977, .981]
Chinese204 (0.7)5.50 (9.09)0.977 [.971, .982]
Other481 (1.7)8.07 (10.55)0.978 [.974, .981]
Clinician Diagnosis .682 [.675, .688]
Normal Cognition12,459 (44.6)0.34 (1.62)0.870 [.866, .874]
Cognitive Difficulties below MCI Threshold1,272 (4.6)2.00 (4.10)0.907 [.897, .907]
MCI5,343 (19.1)3.34 (4.87)0.897 [.892, .902]
Dementia8,842 (31.7)18.84 (8.92)0.944 [.942, .946]
CDR .820 [.816, .824]
No Impairment12,647 (45.3)0.27 (1.43)0.866 [.862, .870]
Questionable/MCI8,601 (30.8)4.74 (5.77)0.903 [.899, .907]
Mild Dementia3,847 (13.8)18.15 (6.38)0.870 [.862, .878]
Moderate Dementia1,688 (6.0)26.10 (4.05)0.815 [.798, .831]
Severe Dementia1,133 (4.1)29.39 (2.04)0.863 [.848, .877]

Note.. FAQ: Functional Activities Questionnaire; CI: confidence interval; CDR: Clinical Dementia Rating; MCI: mild cognitive impairment

NACC also collects some data on co-participants, who were slightly younger (M = 62.31, SD = 13.81), well-educated (M = 1.5, SD = 2.84), primarily women (64.8%; men = 35.2%), and non-Hispanic White (75.4%; Black American = 13.5%, Hispanic = 4.6%, Asian American = 2.9%; Other = 3.6%). A slight majority of co-participants were spouses (53%; Child = 25.2%, Friend = 10.7%, Sibling = 5.2%), lived with the participant (59.6%), and had known the participant for many years (M = 39.74, SD = 17.25). Notably, information on the language that the co-participant used to complete the FAQ was not available in this data freeze; NACC’s Uniform Data Set (UDS) has also been adapted into Spanish and Mandarin Chinese and some co-participants may have completed the FAQ in this language (Rascovsky, 2017). However, 92.9% of sample participants preferred English, and we presume most FAQs were administered in this language. A sample of individuals with two visits were selected for reliability analyses (n = 16,262), which had similar age (M = 71.48, SD = 10.16) and education (M = 15.88, SD = 7.08). Other demographics are summarized in Supplementary Table 1.

Measures

FAQ

As noted in the introduction, the FAQ is a 10-item, collateral-report scale, with 4-point ordinal responses within each item (Pfeffer et al., 1982). Response categories include: 0, Normal; 1, Has difficulty but does by self; 2, Requires assistance; 3, Dependent. Item content is summarized in Table 5 . Total score can range from 0 – 30, with higher scores indicating more functional difficulties.

Table 5.

Confirmatory factor analysis and item-response theory analyses

Item DescriptionM (SD)Factor Loading [99% CI] a b1 b2 b3 Information
Bills0.861 (1.24)1.146 [1.139, 1.160]4.0280.2540.4960.9346.69
Taxes0.925 (1.27)1.167 [1.157, 1.176]3.1450.1980.4200.8136.44
Shopping0.678 (1.11)1.041 [1.028, 1.054]3.9300.4550.8891.4577.88
Hobby0.572 (1.01).896 [.881, .911]3.4450.5841.1611.6808.20
Stove Use0.471 (0.97).809 [.792, .826]3.5050.8791.3651.7217.26
Meal Preparation0.653 (1.11)1.024 [1.010, 1.038]3.4080.5560.9481.3857.29
Current Events0.588 (1.01).899 [.884, .913]3.3750.5381.1361.7218.46
Paying Attention0.494 (0.90).766 [.751, .781]3.2020.6091.3871.9988.99
Remembering Appointments/Medications0.799 (1.11)1.012 [1.000, 1.024]3.3300.0910.7151.4388.90
Traveling outside neighborhood0.811 (1.20)1.102 [1.090, 1.114]3.0990.2740.6881.0807.21

Note. CI: confidence interval. a: Unconstrained item discrimination parameters; constrained to 3.71 for b1-b3 & information estimation. b1-b3: Item difficulty parameters. Factor loadings are unstandardized

Cognition & Depressive Symptoms

NACC’s UDS includes measures of cognition, which have changed over time (Morris et al., 2006; Weintraub et al., 2018, 2009). To develop composite cognitive scores, only data from the first and second versions of the UDS battery were included. These UDS versions incorporated standardized neuropsychological measures including the Weschler Memory Scale-Revised edition (WMS-R; Wecshler, 1987a) Logical Memory Immediate and Delayed Recall, WMS-R Digit Span, Weschler Adult Intelligence Scale-Revised edition (WAIS-R; Wecshler, 1987b) Digit Symbol, Trail Making Test Parts A and B (Reitan & Wolfson 1993), Boston Naming Test (30-odd items; Kaplan, Goodglass, & Weintraub, 1983), and semantic fluency (animals and vegetables; Morris et al., 1989). Cognitive composite scores for episodic memory (WMS-R Logical Memory Immediate and Delayed Recall), Attention/Working Memory (WMS-R Digit Forward and Backward Correct Trials and Longest Span), Language (semantic fluency, Boston Naming Test), and executive function (WAIS-R Digit Symbol, Trail Making Test Parts A and B) were created based on a published factor analysis (Hayden et al., 2011). Z-scores for each test were derived using the means and standard deviations of all cognitively unimpaired individuals at the baseline visit (see diagnosis section below for information on the cognitively normal determination). Trail Making Test Parts A and B were multiplied by −1 so higher scores indicate better performance. Z-scores within each domain were summed and averaged across the number of included measures to derive the composite scores.

Depressive symptoms were assessed with the 15-item Geriatric Depression Scale (GDS-15; Yesavage & Sheikh, 1986). Participants with available data across cognitive measures and the GDS-15 were included in the cognition and depressive symptoms sample. Missing data for each domain is as follows: Attention N = 153, Executive Function: N=2,949, Language N = 350, Depressive Symptoms = 678. There was no missing data for the Memory domain. Supplementary Table 1 displays demographic and clinical characteristics of those included in the cognition and depressive symptoms sample.

Clinical Diagnosis

The NACC has two diagnostic methods: clinician diagnosis and Clinical Dementia Rating scale (CDR® Dementia Staging Instrument; Morris et al., 2006). The clinician diagnosis can be derived by a single clinician, multi-clinician discussion, or formal consensus panel, depending on the site. Categories include: NC, cognitive difficulties below MCI threshold, MCI, and dementia. The CDR is a semi-structured interview with participant and collateral that queries cognitive and functional status and has performance-based tasks for the participant (John C. Morris, 1997). It is used to stage the level of cognitive impairment, with five levels: NC, questionable/MCI, mild dementia, moderate dementia, and severe dementia. Studies have shown CDR ratings to have validity and reliability, including in multicenter contexts (Morris et al., 1997; Morris, 1997). Number of people in each diagnostic category are noted in Table 1 . Clinical diagnoses are allowed to use all available clinical data (which can include the FAQ) and the procedures are not uniform across ADRCs, which mirrors clinical practice but can lead to criterion contamination for the purpose of our study. As such, the CDR was selected as a diagnostic variable for classification analysis.

Procedure

Assumptions for different tests were evaluated. All analyses were run with SPSS (Version 26), R (Version 3.6.2), and JASP (Version 0.12.2). Given the relatively large sample size, we expected many analyses to be statistically significant and elected to emphasize effect sizes and 99% confidence intervals. An effect size’s 99% confidence interval that does not include zero is considered statistically significant (p < .01) for the coefficients used in this study. For eta-squared, we used the following rules of thumb: 0 - .01 (negligible), .01 - .06 (small), .06 - .14 (medium), >.14 (large). For Cohen’s d, we used: 0 - .20 (negligible), .20 - .50 (small), .50 - .80 (medium), >.80 (large). For correlation coefficients, we used: 0 - .10 (negligible), .10-.30 (small), .30-.50 (medium), .50-1.0 (large; Cohen, 1992).

Demographic Effects

The impact of five demographic variables on FAQ scores was explored: sex, age, education, ethnoracial group, and patient’s preferred language. Age and education were also divided ordinally for these analyses. Correlational relationships with age and education were evaluated with nonparametric Kendall τ. Omnibus group contrasts were evaluated with nonparametric Kruskal-Wallis H. Exploration of omnibus effect sizes and post-hoc contrasts utilized Brown-Forsythe and Tukey’s Honest Significant Difference (HSD) corrections with η 2 , η 2 p, and Cohen’s d coefficients.

Reliability & Generalizability Theory

Cronbach’s alpha was used as a measure of internal consistency. A subsample of individuals who also completed a second visit were selected for test-retest reliability analysis (n = 16,265). Zero-order correlation is recommended for CTT test-retest. We also calculated the intraclass correlation coefficient (ICC) to quantify consistency across time. GT provides proportion of variance explained by subject-level differences, time-level differences, and item-level differences. At the scale level, it provides generalizability coefficients, and since the FAQ is not part of an item bank and we are only evaluating two time points, we selected the generalizability of change coefficient (Rc; fixed items, fixed time). Since many clinicians may want NC information to calculate reliable change when re-evaluating a patient at 1-2 years, we also selected individuals with normal cognition (CDR = 0) at both time points who completed their second visit within two years (n = 1,709; Mage = 69.97, SD = 11.17; Medu = 16.14, SD = 5.67). Reliability analyses were calculated using the psych package (Revelle & Condon, 2019). We interpreted Cronbach’s alpha and reliability-generalizability coefficients as follows: ≤.50 (unacceptable), .51-.59 (poor), .60-.69 (questionable), .70-.79 (acceptable), .80-.89 (good), ≥.90 (excellent; George & Mallery, 2003). We interpreted ICC coefficients as follows: .90 (excellent; Koo & Li, 2016).

Convergent Validity & Classification Accuracy

Convergent validity was evaluated by examining partial correlations between the FAQ and cognitive composite domains and depressive symptoms with adjustment for age, sex, education, and ethnoracial group. Separate hierarchical linear regression models were used to evaluate the amount of variance in the FAQ accounted by cognitive domains and depressive symptoms. Step 1 of the models included age, sex, education, and ethnoracial group. Cognitive domain or depressive symptoms was included in step 2. Change in R-square was examined across the two steps for each model. To evaluate the unique contribution of each cognitive domain and depressive symptoms to the FAQ, multiple regression was conducted with inclusion of all cognitive domains, depressive symptoms, age, sex, education, and ethnoracial group as independent variables.

Receiver operating characteristic (ROC) curve analyses were used to assess the FAQ’s accuracy for detecting cognitive impairment and to elucidate cut-scores that maximized sensitivity and specificity for identifying both MCI and dementia. Two sets of ROC curve analyses were conducted. Initially, analyses were conducted examining between the no cognitive impairment group and impaired groups (i.e., MCI/dementia combined) to assess the FAQ’s accuracy for identifying the presence of cognitive impairment broadly. Next, three separate ROCs were conducted for the no impairment versus MCI groups, no impairment versus dementia groups, and MCI versus dementia groups in order to identify cut-scores that most accurately differentiated severity of cognitive impairment (i.e., MCI versus dementia). For ROC analyses, classification accuracy was interpreted based as follows: areas under the curve (AUC)= .50-.69 (poor), .70-.79 (acceptable), .80-.89 (excellent), and ≥.90 (outstanding; Hosmer et al., 2013).

Latent Structure & Item Response Theory

Since no exploratory factor analysis (EFA) has been conducted on the original FAQ, 300 individuals were randomly selected for exploratory analyses. Principal axis extraction was selected for EFA since item responses are not normally distributed. The number of factors extracted would be based on parallel analysis and scree plot, with oblique rotation for multiple factors. If a multifactor model emerged, this would then be compared to a simple (i.e., uncorrelated residuals) 1-factor model with CFA using the entire sample. Given ordinal items/indicators, CFA with diagonally weighted least squares (DWLS) estimation was employed. Several fit indices were employed and criteria for good fit are indicated in parentheses (Brown, 2015), including chi-square (p > .05), goodness of fit index (GFI > .95), root mean square error of approximation (RMSEA < .06), standardized root mean square residual (SRMR < .08), comparative fit index (CFI >.90), and Tucker-Lewis Index (TLI > .95).

IRT analyses were run if unidimensionality had adequate fit. A graded response model (GRM) for ordinal items was employed. Constrained and unconstrained GRMs were both run. A constrained GRM is analogous to a 1-parameter model in which the item discrimination parameter (a) is fixed based on model data and item response difficulty parameters (bx) are freely estimated. Unconstrained GRM is analogous to a 2-parameter model, with both a and bx being freely estimated. Discrimination parameters were interpreted as follows: .01-.34 (very low), .35-.64 (low), .65-.1.34 (moderate), 1.35-1.69 (higher), >1.70 (very high; Baker, 2001). GRM models were tested for relative fit, with the better fitting GRM used to produce difficulty parameters and test-item information and characteristic curves (TIC, IIC, & ICC). Since FAQ responses are 4-point ordinal, each item has three difficulty parameters (b1b3) and four ICCs in contrast to Rasch and 1PM IRT models for dichotomous items, which provide a single difficulty parameter and ICC. The difficulty parameter corresponds to the latent construct continuum (θ); specifically, an individual with this level of functional disability has a .50 probability of endorsing that response. Statistical nomenclature conflicts with conceptual meaning, and greater latent ability (i.e., a higher score) is indicative of more functional difficulties. IRT analyses were run with the ltm package (Rizopoulos, 2006).

Results

The FAQ had statistically significant but a negligible-to-small correlation with age (Kendall τ = .09, 99% CI [.08, .10]) and a small correlation with education (Kendall τ = −.13, 99% CI [−.14, −.12]). When considering categorical demographics, all ANOVAs—including Brown-Forsythe correction and Kruskal-Wallis H— and their contrasts were statistically significant. However, most effect sizes were not meaningful with the exception of age and education groups, which had small effect sizes. Omnibus effect sizes are summarized in Table 1 . Tukey HSD post-hoc contrasts were mostly significant and revealed different patterns among groups. For age, all contrasts were statistically significant but the ≥80 group accounted for meaningful differences, with small effect sizes compared to the d = .241, 99% CI [.186, .296]), the 60-69 group (Cohen’s d = .393, 99% CI [.349, .437]), and the 70-79 group (Cohen’s d = .321, 99% CI [.280, .362]). For education, all contrasts were statistically significant with the highest and lowest groups having the largest difference, with a medium-to-large effect (Cohen’s d = .791, 99% CI [.727, .855]). The less than high school vs. some college contrast had a small effect (Cohen’s d = .431, 99% CI [.366, .496]) as did the some college vs. bachelor’s or higher contrast (Cohen’s d = .230, 99% CI [.197, .263]).

For ethnoracial groups, not all contrasts were statistically significant, with the Asian origin group not having any statistical difference from other groups. Other contrasts had negligible-to-small effects, which was primarily driven by the Hispanic group when compared to Black Americans (Cohen’s d = .353, 99% CI [.268, .437]), Asian Americans (Cohen’s d = .245, 99% CI [.124, .366]), and NHWs (Cohen’s d = .171, 99% CI [.096, .246]). A negligible-to-small effect was also observed between NHWs and Black Americans (Cohen’s d = .180, 99% CI [.134, .226]). For language, not all contrasts were statistically significant, with differences driven by Spanish, with small effects when compared to Chinese (Cohen’s d = .443, 99% CI [.248, .638]) and English (Cohen’s d = .387, 99% CI [.313, .461]).

In contrast, diagnostic cognitive stages had large effect sizes, suggesting the FAQ may be helpful in distinguishing diagnostic groups. Given the large impact of disease staging, we decided to run post-hoc analyses of covariance with the FAQ and demographic variables, using the CDR sum as a covariate. ANCOVAs were still statistically significant, but omnibus effect sizes were all negligible for sex (Kruskal-Wallis H = 190.01, p < .001, η 2 p< .001), age (Kruskal-Wallis H = 614.10, p < .001, η 2 p = .005), education (Kruskal-Wallis H = 898.13, p < .001, η 2 p = .003), ethnoracial group (Kruskal-Wallis H = 218.15, p < .001, η 2 p = .002), and language (Kruskal-Wallis H = 199.87, p < .001, η 2 p < .001). All post-hoc contrasts were now null-to-negligible for age (all |ds| < .08), education (all |ds| < .09), ethnoracial group (all |ds| < .06), and language (all |ds| < .02).

Reliability & Generalizability

Internal consistency coefficients are summarized in Table 1 . The FAQ had strong internal consistency overall and across all demographic groups. When the item-score range was restricted within diagnostic groups, internal consistency was attenuated but still considered good (αs = .82 - .94).

In the subsample of those with normal cognition, with an interval of 1.33 years, the overall reliability was significantly attenuated. This appears attributable to the lack of variability within time points and lack of differences between time points (Mann-Whitney U = 1.45e6, p = .31). To wit, the multilevel variance analysis did not note meaningful variance at any level nor interaction, with minimal total variance explained. Reliability results are summarized in Table 2 .

Table 2.

Test-retest and generalizability information

VarianceCoefficient
Total Sample --
Subject.89 (43%)
Time.01 (0%)
Items.02 (1%)
Subject × Time.09 (4%)
Subject × Items.07 (3%)
Time × Items.89 (43%)
Residual.11 (5%)
Rc 0.89
Interval Days M (SD) 1598 (1138)
Time 1 FAQ M (SD) 5.65 (9.23)
Time 2 FAQ M (SD) 6.99 (10.55)
r 0.93
ICC 0.95
Normal Cognition --
Subject.01 (15%)
Time.00 (0%)
Items.00 (0%)
Subject × Time.01 (20%)
Subject × Items.01 (16%)
Time × Items.01 (15%)
Residual.02 (34%)
Rc 0.85
Interval Days M (SD) 487 (142)
Time 1 FAQ M (SD) 0.19 (1.03)
Time 2 FAQ M (SD) 0.29 (1.61)
r 0.52
ICC 0.67

Note. Total retest sample n = 16,265. Normal cognition retest sample n = 1,709. FAQ: Functional Activities Questionnaire; Rc: generalizability of change coefficient. M: mean. SD: standard deviation. ICC: intraclass correlation coefficient.

Convergent Validity & Classification Accuracy

Table 3.

Hierarchical linear regression evaluating the association between the Functional Activities Questionnaire with cognitive domains and depressive symptoms

Step 1Step 2: MemoryStep 2: AttentionStep 2: Executive FunctionStep 2: LanguageStep 2: Depressive Symptoms
Δ R 2 0.043 ** 0.298 ** 0.082 ** 0.327 ** 0.290 ** 0.052 **
Age (B)0.079 **
[0.066, 0.093]
0.004 [−0.007, 0.016]0.058 ** [0.045, 0.071]−0.066 ** [−0.077, −0.054]−0.026 ** [−0.038, −0.014]0.090 ** [0.077, 0.103]
Sex (B)−1.160 ** [−1.452, −0.868]0.184 [−0.061, 0.430]−0.960 ** [−1.239, −0.680]−0.203 * [−0.422, 0.035]−0.244 * [−0.489, 0.001]−1.033 ** [−1.317, −0.749]
Education (B)−0.322 ** [−0.369, −0.275]−0.003 [−0.043, 0.038]−0.150 ** [−0.196, −0.103]0.059 ** [0.020, 0.099]0.011 [−0.029, 0.052]−0.257 ** [−0.303, −0.211]
Black (B)−1.697 ** [−2.099, −1.295]−2.264 ** [−2.598, −1.930]−2.499 ** [−2.887, −2.111]−3.686 ** [−4.017, −3.356]−3.301 ** [−3.640, −2.962]−1.582 ** [−1.973, −1.191]
Hispanic (B)−0.024 [−0.714, 0.665]−0.393 [−0.966, 0.179]−1.294 ** [−1.959, −0.630]−1.893 ** [−2.454, −1.3331]−1.309 ** [−1.886, −0.732]−0.445 [−1.117, 0.226]
Asian (B)−0.230 [−1.242, 0.781]−0.734 * [−1.574, 0.105]−1.130 * [−2.098, −0.161]0.038 [−0.0782, 0.859]−2.251 ** [−3.097, −1.405]−0.286 [−1.269, 0.697]
Other Ethnoracial Group (B)−0.184 [−1.132, 0.764]−1.176 ** [−1.963, −0.389]−1.646 ** [−2.557, −0.735]−2.490 ** [−3.261, −1.718]−2.116 ** [−2.909, −1.323]−0.932 * [−1.856, −0.008]
Cognitive Domain/Depressive Symptoms (B) −3.139 ** [−3.2322, −3.047]−2.410 ** [−2.566, −2.254]−3.476 ** [−3.572, −3.381]−3.778 ** [−2.9090, −1.323]0.667 ** [0.611, 0.722]

Note. Separate hierarchical linear regression models were conducted for each cognitive domain and depressive symptoms. Step 1 adjusted for age, sex, education, and ethnoracial group and step 2 included cognitive domain or depressive symptoms. ΔR 2 : change in r-squared, B: unstandardized beta,

99% confidence interval in brackets.

Results of the ROC curve analyses are presented in Table 4 . For the overall sample, ROC curve analyses yielded a significant area under the curve (AUC) of .894, suggestive of excellent classification accuracy (Hosmer et al., 2013). When the cognitively impaired sample was subdivided by degree of cognitive impairment, ROC curve analyses indicated stronger classification accuracy for the FAQ among patients with dementia (AUCs ≥ .954), indicating outstanding classification accuracy. However, classification accuracy remained excellent even among those with MCI (i.e., AUC = .814). For the overall sample, the FAQ was able to reliably differentiate NCs from patients with cognitive impairment at an optimal cut-score ≥2 (76% sensitivity/95% specificity). Similarly, an optimal cut-score of ≥2 produced 58% sensitivity/95% specificity for differentiating NCs from patients with MCI. In contrast, a slightly higher threshold of ≥5 was suggested for the contrast between NCs and those with dementia (98% sensitivity/99% specificity). Finally, the FAQ also differentiated patients with MCI from those with dementia at a cut-score of ≥14, which yielded 86% sensitivity/90% specificity.

Table 4.

Receiver operating characteristic curve analyses

MeasureCut-ScoreSensitivity (%)Specificity (%)Base Rate (.40)Base Rate (.30)Base Rate (.20)Base Rate (.10)
PPVNPVPPVNPVPPVNPVPPVNPV
NC (n = 12,647) vs. Cognitively Impaired (n = 15,269)
FAQ
AUC = .894 ***
99% CI: .888-.899
≥ 182.191.20.860.880.800.920.700.950.510.98
≥ 276.395.30.920.860.870.900.800.940.640.97
≥ 371.197.20.940.830.920.890.860.930.740.97
≥ 467.398.00.960.820.940.870.890.920.790.96
≥ 564.098.50.970.800.950.860.910.920.830.96
≥ 661.198.90.970.790.960.860.930.910.860.96
≥ 758.399.20.980.780.970.850.950.900.890.96
≥ 856.199.30.980.770.970.840.950.900.900.95
≥ 953.799.40.980.760.970.830.960.900.910.95
≥ 1051.599.50.990.750.980.830.960.890.920.95
NC (n = 12,647) vs. MCI (n = 8,601)
FAQ
AUC = .814 ***
99% CI: .806-.823
≥ 168.691.20.840.810.770.870.660.920.460.96
≥ 258.495.30.890.770.840.840.760.900.580.95
≥ 349.397.20.920.740.880.820.810.880.660.95
≥ 442.998.00.930.720.900.800.840.870.700.94
≥ 537.498.50.940.700.910.790.860.860.730.93
≥ 632.798.90.950.690.930.770.880.850.770.93
≥ 728.399.20.960.670.940.760.900.850.800.93
≥ 824.999.30.960.660.940.760.900.840.800.92
≥ 921.499.40.960.650.940.750.900.830.800.92
≥ 1018.699.50.960.650.940.740.900.830.810.92
NC (n = 12,647) vs. Dementia (n = 6,668)
FAQ
AUC = .996 ***
99% CI: .995-.998
≥ 199.691.20.881.000.831.000.741.000.561.00
≥ 299.595.30.931.000.901.000.841.000.701.00
≥ 399.197.20.960.990.941.000.901.000.801.00
≥ 498.898.00.970.990.950.990.931.000.851.00
≥ 598.398.50.980.990.970.990.941.000.881.00
≥ 697.698.90.980.980.970.990.960.990.911.00
≥ 797.099.20.990.980.980.990.970.990.931.00
≥ 896.299.30.990.980.980.980.970.990.941.00
≥ 995.399.40.990.970.990.980.980.990.950.99
≥ 1094.099.50.990.960.990.970.980.990.950.99
MCI (n = 8,601) vs. Dementia (n = 6,668)
FAQ
AUC = .954 ***
99% CI: .950-.958
≥ 199.631.40.490.990.380.990.271.000.141.00
≥ 299.541.60.530.990.420.990.301.000.161.00
≥ 399.150.70.570.990.460.990.331.000.181.00
≥ 498.857.10.610.990.500.990.370.990.201.00
≥ 598.362.60.640.980.530.990.400.990.231.00
≥ 697.667.30.670.980.560.980.430.990.251.00
≥ 797.071.70.700.970.590.980.460.990.281.00
≥ 896.275.10.720.970.620.980.490.990.300.99
≥ 995.378.60.750.960.660.980.530.990.330.99
≥ 1094.081.40.770.950.680.970.560.980.360.99
≥ 1192.483.80.790.940.710.960.590.980.390.99
≥ 1291.186.00.810.940.740.960.620.970.420.99
≥ 1388.888.20.830.920.760.950.650.970.460.99
≥ 1486.490.40.860.910.790.940.690.960.500.98
≥ 1584.091.90.870.900.820.930.720.960.540.98
≥ 1681.493.30.890.880.840.920.750.950.570.98
≥ 1777.994.40.900.860.860.910.780.940.610.97
≥ 1874.395.30.910.850.870.900.800.940.640.97
≥ 1970.396.40.930.830.890.880.830.930.680.97
≥ 2066.597.00.940.810.900.870.850.920.710.96

Note. FAQ: Functional Activities Questionnaire; AUC: area under the curve; CI: confidence interval; NC: normal cognition; MCI: mild cognitive impairment; PPV: positive predictive value; NPV: negative predictive value.

Latent Structure & Item Responsiveness

Constrained and unconstrained GRMs were run. In the unconstrained GRM, discriminability parameters fell within a narrow range (as = 3.10 – 4.03), and the information criteria and likelihood ratio test indicated that a constrained model had better fit (constrained discriminability = 3.73); all items had very high discrimination properties. In evaluating difficulty parameters and ICCs, it is clear that items 2 & 1 (i.e., assembling tax/business records and paying bills, respectively) are associated with the least amount of overall functional difficulty. In contrast, items 8 & 5 (i.e., attending to/understanding television/literature and stove use, respectively) are associated with the most functional difficulty. Item 9 (i.e., remembering appointments/medications) appeared to have a wide difficulty range across its response categories. In particular, lower responses were not associated with significant functional disability; however, being completely dependent on this item is associated with a high degree of functional disability. With regard to overall information on functioning, items 8 & 9 (i.e., attending to/understanding television/literature and remembering appointments/medications, respectively) provide the most information regarding functional disability across their response categories. The overall test appears to provide a significant amount of information regarding functional disability (Total Information = 77.31). However, in evaluating the TIC, it provides this information within a narrow range of functional ability. Namely for those with slight disability, and 83% of total test information is contained within the θ range of 0 – 2. IRT parameters are summarized in Table 5 . ICCs for each response category are visualized in Figures 1 – 4 . IICs are presented in Figure 5 and TIC is visualized in Figure 6 .

An external file that holds a picture, illustration, etc. Object name is nihms-1721558-f0001.jpg

Item Characteristic Curves for Functional Activities Questionnaire response category 1 (normal). Higher values on ability axis indicate more functional difficulty.

An external file that holds a picture, illustration, etc. Object name is nihms-1721558-f0004.jpg

Item Characteristic Curves for Functional Activities Questionnaire response category 4 (dependent); “Stove Use” and “Current Events” have the same b and overlap. Higher values on ability axis indicate more functional difficulty.

An external file that holds a picture, illustration, etc. Object name is nihms-1721558-f0005.jpg

Item Information Curves. Higher values on ability axis indicate more functional difficulty.

An external file that holds a picture, illustration, etc. Object name is nihms-1721558-f0006.jpg

Test Information Curve. Higher values on ability axis indicate more functional difficulty.

Discussion

The current manuscript aimed to determine the reliability and validity of the FAQ using updated and expanded analyses in a large multicenter sample. We found excellent reliability when considering both internal consistency and temporal stability in the entire sample. The reliability coefficients were attenuated when variance was restricted within diagnostic groups. However, there was at least good internal consistency across all diagnostic groups. Furthermore, in the NC group over 1.33 years, there was moderate consistency and good generalizability within the GT model. When considering the CTT coefficient, reliability was poor, although this is attributable to the lack of variance as elaborated by GT and observed in the means and standard deviations at both time points. In fact, simulation studies have shown that CTT’s test-retest coefficient can be drastically reduced with minuscule changes in scores that result in differences in ranks, despite the overall score stability appearing adequate (Duff, 2012).

We also found strong evidence for validity, across convergent, criterion/classification, and structural aspects. There were strong correlations with neurocognitive testing, and relatively lower with depression and demographics — suggestive of discriminant validity. There was also strong evidence for criterion validity and classification accuracy, with large differences between cognitive diagnoses and outstanding classification accuracy. Another finding from our study was that the range of appropriate cutoffs for distinguishing MCI from dementia may be wider than previously reported in the literature. Choosing an appropriate cutoff will depend on the base rate of dementia in the particular population the professional is working with, and whether the professional wants to prioritize positive predictive accuracy or negative predictive accuracy; we have provided a range of sensitivity, specificity, PPV, and NPV for four base rates (10% - 40%) in Table 4 to facilitate these decisions. There also appears to be strong structural validity, with a single-factor — functional disability — having excellent fit. At an item level, all items had strong loadings on the factor and strong discrimination parameters.

Notably, some small-to-medium differences in FAQ scores were observed among demographic groups. The most striking difference was observed between those at the ends of the education spectrum. However, meaningful differences were also found among the other education groups, the oldest adults compared to younger groups, and Hispanic and Spanish-speaking participants compared to other groups. Of note, when controlling for disease stage, all contrasts vanished to the null-to-negligible range. This suggests that FAQ differences among demographic groups were accounted for by differences in disease severity at first visit. This accords with several studies showing that advancing age and lower education are risk factors for dementia and increasing severity (Babulal et al., 2019). Furthermore, ethnoracial minorities are at higher risk due to a confluence of factors, and may present to clinics later in disease stages (Babulal et al., 2019; O’Bryant et al., 2013). The FAQ’s sensitivity to disease severity and robustness against other demographic influences suggests that including the FAQ in evaluations may balance out demographic classification disparities that are observed on other tests and help maximize equity in assessment.

It is worth clarifying the construct assessed by the FAQ: despite its name and inspiration to sample “social functioning,” the FAQ is better thought of as a measure of functional disability. This is clear when evaluating items, where the lowest score indicates a lack of difficulty without considering the ability level. For example, consider two individuals whose hobby is gardening: one achieved master gardener status and reaps prize melons every season, and the other’s melon plants rarely provide edible fruit. If both can engage in their hobbies without difficulty, they would both be rated equally. This is also reflected in the ICCs and IICs, but most apparent in the TIC, wherein the FAQ seems to capture those within 0-2 SDs of functional disability (θ). It does not provide any meaningful information on positive functional ability (negative θ values), nor more severe ADL problems. These insights also explain why a prior study found low correlations between the FAQ and a BADL scale (Assis et al., 2014). Based on the aforementioned high internal consistency and relatively small difficulty range, one could argue for excluding items to reduce the length of the FAQ. While this could be statistically justified, there are other reasons to continue to use the FAQ in its current iteration. First, the FAQ’s ten items are not burdensome in and of themselves, and do not typically reduce evaluation efficiency as they can be completed by collateral while the patient undergoes neurocognitive testing. Furthermore, although many items are within the same hierarchical ADL range, they capture different varieties of ADLs that are consistent with the WHO ICF classification, despite predating this system by roughly two decades. Individuals may first have difficulty in one ADL relative to others, and this one ADL varies across individuals. This is reflected in GT analyses that revealed time × item interaction. Furthermore, having excess information for those with low levels of functional difficulty makes the FAQ uniquely appropriate for distinguishing NC, MCI, and dementia, which is reflected in the outstanding AUCs in our ROC analyses.

With the above information, clinicians and investigators may use the FAQ with confidence to assess functioning or discriminate between cognitive stages. Furthermore, when following individuals over time, the test-retest information can help determine whether there is reliable interval change. Though it is beyond the scope of this manuscript, there are multiple methods for calculating reliable change, and the information provided in Table 2 can be used to calculate these indices (Duff, 2012). For example, if an individual was reassessed a year and half later, an increase in total FAQ of ≥2 could be considered reliable change based on one method. However, each CDR stage had markedly different means, variances, and distributional properties, such that traditional reliable change methods may not be appropriate when considering how the FAQ may be used over time.

Our results regarding convergence with cognitive domains align with prior findings reporting that the executive functioning domain has a strong relationship to functional abilities (González, Soble, Marceaux, & McCoy, 2016; Yam, Gross, Prindle, & Marsiske, 2014). Executive function abilities, which capture planning and problem-solving skills, are fundamental to goal-directed behaviors. In our study, the memory and language domains also demonstrated large correlations with the FAQ. Several studies have reported associations between memory and functional status (Mansbach & Mace, 2019; Yam & Marsiske, 2013). However, the relatively robust association with the language domain was not anticipated. The language domain in our study included semantic fluency measures, which have been associated with executive function skills such as updating ability (Shao, Janse, Visser, & Meyer, 2014). Thus, it is possible that the observed association between the language domain and the FAQ may be explained, at least in part, by skills that are pertinent to executive function. In addition, unlike most previous studies examining self-report or performance-based outcomes, our study examined a functional status measure completed by a collateral source. This discrepancy may have influenced the results as language-based functional impairments may be more readily observed in interactions with collaterals.

The IRT findings for Item 9 (i.e., remembering appointments/medications) and its wide range of difficulty across response categories are curious, although they could reflect the moderating impact of compensatory strategies. For example, individuals who have some difficulty with recalling appointments may still have overall strong functional ability if they can write things down and benefit from reminders. However, if they are completely dependent on others to remember, it may reflect a lack of compensatory strategies or more severe disease; both of which are associated with greater functional disability. These, and other IRT insights, can shed light on ADLs and functioning as they relate to aging and neurodegeneration.

In addition to considering the information provided in our study, clinicians and investigators may want to consider the characteristics of the raters that provide collateral information. A recent study using the NACC database found that collateral sources cohabitating with the individual, or those with close relationships (i.e., paid caregiver, spouse, adult child), and higher education are more likely to provide higher ratings on the FAQ, although the effect sizes were negligible-to-small (Hackett, Mis, Drabick, & Giovannetti, 2020). Our study also has some limitations. Primarily, this dataset is not from an epidemiological study and is not representative of the United States population at large. Although there were a wide range of individuals, it was predominantly an older, White, highly educated, and English-speaking sample. Additionally, the NACC dataset does not include information on some influential social and psychological factors, such as household income, occupational prestige, self-reported anxiety symptoms, and languages of certain forms, so our analyses were unable to explore associations with these variables. Future studies may want to explore the different neurodegenerative causes of functional disability. Furthermore, using differential item functioning and structural invariance techniques in subgroups could help determine if there are item-level differences across demographic groups. However, this is beyond the scope of the current manuscript, which was primarily focused on the scale as a whole and found no meaningful differences across most demographic groups at the scale level. Other datasets with more representative sampling may be better suited to this task. The current study provides a strong empirical foundation from which our understanding of neurodegenerative disability can grow.