Skip to main content

Evaluation of the psychometric properties of self-reported measures of alcohol consumption: a COSMIN systematic review



To review studies about the reliability and validity of self-reported alcohol consumption measures among adults, an area which needs updating to reflect current research.


Databases (PUBMED (1966-present), MEDLINE (1946-present), EMBASE (1947-present), Cumulative Index of Nursing and Allied Health Literature (CINAHL) (1937-present), PsycINFO (1887-present) and Social Science Citation Index (1976-present)) were searched systematically for studies from inception to 11th August 2017. Pairs of independent reviewers screened study titles, abstracts and full texts with high agreement and a third author resolved disagreements. A comprehensive quality assessment was conducted of the reported psychometric properties of measures of alcohol consumption using the COnsensus-based Standards for the selection of health Measurement Instruments (COSMIN) to derive ratings of poor, fair, good or excellent for each checklist item relating to each psychometric property.


Twenty-eight studies met inclusion criteria and, collectively, they investigated twenty-one short-term recall measures, fourteen quantity-frequency measures and eleven graduated-frequency measures. All measures demonstrated adequate/good test-retest reliability and convergent validity. Quantity-frequency measures demonstrated adequate/good criterion validity; graduated-frequency and short-term recall measures demonstrated adequate/good divergent validity. Quantity-frequency measures and short-term recall measures demonstrated adequate/good hypothesis validity; short-term recall measures demonstrated adequate construct validity. Methodological quality varied within and between studies.


It was difficult to discern conclusively which measure was the most reliable and valid given that no study assessed all psychometric properties and the included studies varied in the psychometric properties that they selected to assess. However, when the results from the range of studies were considered and summed, they tended to indicate that the quantity-frequency measure compared to the other two measures performed best in psychometric terms and, therefore, it is likely to produce the most reliable and valid assessment of alcohol consumption in population surveys.


Alcohol use and associated consequences are a major public health problem, described as the third leading risk factor for poor health globally [1]. Recently, new revised guidelines from UK (United Kingdom) Chief Medical Officers advised adults about the likely harmful health effects of drinking more than 14 units/week [2], which is approximately six 175 ml glasses of (13%) wine, six 568 ml pints of (4%) lager or ale or (4.5%) cider or fourteen 25 ml measures of (40%) spirts (1 unit is 10 ml or 8 g of pure alcohol) in the UK [3]. The Global Burden of Disease Survey identified alcohol as a top five risk factor for non-communicable disease in the UK [4]. It is important that reliable and valid measures are used to monitor and assess alcohol misuse and related problems and, in turn, to inform public health strategies.

Our initial scoping exercise indicated that data about alcohol intake tends to be collected in surveys using one or more of the following three types of self-report questionnaires: Quantity-frequency measures ask questions about ‘usual’ alcohol drinking to estimate the frequency (e.g. number of days per week) and volume of alcohol consumed (e.g. ‘how many (cans/bottles/ glasses) were consumed on a typical drinking day’ [5,6,7]). Graduated-frequency questionnaires measure the volume of consumed alcohol by grouping the number of drinks per occasion into graduated categories, beginning typically with the highest amount consumed by a respondent and decreasing in pre-set categories (e.g. ‘During the last 12-months, how often did you have 12 or more drinks of any kind of alcoholic beverage in a single day?’ ‘During the last 12 months, how often did you have at least 8 but less than 12 drinks of any kind of alcoholic beverage in a single day?’ [8, 9]). Short-term recall measures ask respondents to recall the alcohol that they consumed within a predetermined timeframe such as during the previous week or the last 24-h (e.g. the ‘Yesterday’ method) or using a diary to record all alcohol consumption over a period of time [10, 11].

There is a need to ensure that survey instruments discern accurately alcohol consumption in order to identify the population of drinkers who consume over 14 units of alcohol per week [2], or misuse alcohol. In this review alcohol misuse is defined as ‘drinking excessively – more than the lower-risk limits of alcohol consumption’ [12]. Gmel [13] conducted a literature review of self-report measures (the quantity-frequency, graduated-frequency and short-term recall measures) compared to biological tests (i.e. blood alcohol concentration) using studies published in this field since 2004; and Feunekes [14] conducted a systematic review of studies published 1984–1999 on the capacity of the quantity frequency, extended quantity frequency, retrospective diary, prospective diary, and 24-h recall measures, respectively, to classify individuals according to their alcohol intake. These previous reviews are outdated and not in keeping with advances in survey methodology and design concerning alcohol research or with public health guideline changes (such as the reduction in alcohol guidelines in the UK [2]). This paper presents the results of a systematic review of all relevant research evidence regarding the reliability and validity of different types of survey measures of self-reported alcohol consumption in the adult population. Reliability and validity in this review are defined by the COnsensus-based Standards for the selection of health Measurement Instruments (COSMIN) methodology [15]. COSMIN provided an iterative way of assessing the psychometric properties of included measures. The review adds to previous research by providing the first COSMIN-type review of alcohol intake measures as well as providing an updated review of the alcohol consumption measures. This review addressed the following questions:

Are self-reporting measures (the quantity-frequency, graduated-frequency and short term recall measures) reliable and valid in their assessment of alcohol consumption for the general population? If so, which of the self-reporting measures are most reliable and valid? Which measure most accurately identifies levels of alcohol consumption? The use of a reliable and valid measure in alcohol survey research will enhance the rigour and comparability of studies.


The review was reported in accordance with PRISMA guidelines (see checklist attached as Additional file 1) [16]. No protocol exists for this review. Study authors searched PUBMED (1966-present), MEDLINE (1946-present), EMBASE (1947-present), CINAHL (1937-present), PsycINFO (1887-present) and SSCI (1976-present) from their inception to 11th August 2017 for peer-reviewed articles. Search terms were based on a COSMIN search filter to identify studies of psychometric properties, combined with terms relevant to alcohol intake measures (Fig. 1).

Fig. 1
figure 1

Search strategy; List of free text terms and medical subject headings searched for using the conjunctions ‘AND’ or ‘OR’ to find articles which met the inclusion criteria using the online bibliographic databases

Eligibility criteria

Papers were included if they were English language peer-reviewed studies that evaluated the reliability or validity of survey measures of alcohol consumption that were ‘self-completed’ by adults aged ≥18 years via telephone, paper, computer or interview. Studies were included if they assessed the reliability or validity of self-report alcohol consumption measures (the quantity-frequency, graduated-frequency or short term recall measures or any variation of these measures). Studies were excluded if they did not focus on reliability or validity, were reviews of the literature or study participants had a mental or alcohol disorder diagnosis, were in receipt of treatment for alcohol misuse or were being cared for in a care institution. The review focused upon evaluating the psychometric properties of alcohol consumption measurement for the general drinking population; previous research indicates that people with an alcohol use disorder diagnosis tend to self-report differently from other drinkers (see discussion [17]). Studies were excluded also if they measured self-reported alcohol consumption using other methods only (biological testing or self-reporting alcohol tests).

Titles were exported to Refworks, duplicates were removed and titles and then suitable abstracts were screened and examined by HMcK, CT and MD independently. Cases of disagreement over study inclusion were resolved via review and discussion. Data collection from eligible studies involved extracting information about population characteristics, measures, results and COSMIN quality ratings onto an Excel spreadsheet (see Table 2). This was completed by HMcK and checked by other reviewers. Reference lists of literature reviews and citation lists of included studies were searched for relevant papers. The search strategy identified 806 studies after duplicate removal, 478 remained following examination of abstracts and 28 papers were included following full-text review (Fig. 2).

Fig. 2
figure 2

PRISMA flow diagram [16]; Flowchart depicting the process of searching, selecting and sifting studies according to eligibility criteria. The search stages were identification, screening, eligibility and inclusion

Quality assessment

Pairs of independent reviewers applied the well-validated COSMIN checklist to assess the methodological quality of included studies. Definitions of the psychometric properties are provided by COSMIN (see Table 1). Information (e.g. coefficients) on psychometric properties reported on each measure by included studies were assessed using the quality criteria COSMIN checklist created by Terwee [18] which generated ratings of good, moderate or poor. An additional methodological quality score was calculated for each psychometric property checklist using the ‘worst score counts’ method, where the lowest rating of any of the items in an individual psychometric property checklist is taken as the overall score for that property [19]. Risk of bias (where evidence reported by studies may not be trustworthy [20]) was accounted for by assessing methodological quality of studies. It is important to note that the review reported the properties that were recorded in the original articles and that most articles did not assess or report the full range of properties recommended by COSMIN.

Table 1 COSMIN definitions of domains, measurement properties, and aspects of measurement properties [18]


Table 2 presents the characteristics and results from the 28 papers that met inclusion criteria. It acts as a summary of the content from Additional file 2: Tables S1 and S2 which are included as Additional files 2 and 3. Included studies reported drinks/alcohol measures in standard sizes for the country of publication (see Additional file 2: Table S1). Some studies included beverage specific measures. Studies were conducted in the USA (n = 18), Australia (n = 4), Canada (n = 2), Finland (n = 2), UK (n = 1) and the Netherlands (n = 1). Most studies included short-term recall measures (n = 21), quantity-frequency measures (n = 14) and graduated-frequency measures (n = 11). Convergent validity (n = 15), criterion validity (n = 14), test-retest reliability (n = 10), predictive validity (n = 9), inter-rater reliability (n = 5), hypothesis validity (n = 4), construct validity (n = 2), divergent validity (n = 2), and structural validity (n = 1) were assessed across the studies. Some studies assessed the psychometric properties of more than one measure and measure type but not one study assessed all COSMIN psychometric properties.

Table 2 Summary of characteristics and psychometric properties for included studies

Methodological quality assessment

There was wide variation in methodological quality ratings for each psychometric property (as presented and discussed below).

Quantity-frequency measures achieved criterion validity ratings of excellent (n = 1), fair (n = 1) and poor (n = 2). Test-retest reliability quality ratings were good (n = 1), fair (n = 1) and poor (n = 2), with inter-rater reliability rated fair (n = 1) and poor (n = 1). Convergent validity ratings were good (n = 1) and fair (n = 2). Hypothesis validity was rated good (n = 1) and fair (n = 1). Predictive validity was rated excellent (n = 1) and structural validity fair (n = 1).

The graduated-frequency measures achieved convergent validity ratings of good (n = 2) and fair (n = 3). Test-retest reliability ratings were rated fair (n = 2) and good (n = 1) and inter-rater reliability was also rated fair (n = 1). Criterion validity was rated good (n = 1), fair (n = 1) and poor (n = 1). Predictive validity was rated excellent (n = 1), good (n = 1) and fair (n = 1). Divergent validity was rated fair (n = 1). Construct validity was rated fair (n = 1).

The criterion validity ratings for the short-term recall measures were excellent (n = 1), good (n = 1), fair (n = 1) and poor (n = 4). Convergent validity was rated good (n = 2) and fair (n = 5). Predictive validity was rated excellent (n = 1), good (n = 1), fair (n = 2) and poor (n = 1). Test-retest reliability scores were rated fair (n = 3), with inter-rater reliability also rated fair (n = 1). Hypothesis validity was rated good (n = 1) and fair (n = 1). Divergent validity was rated fair (n = 1) and construct validity was rated poor (n = 1).

Test-retest reliability

Quantity-frequency and graduated-frequency measures completed by a Finnish population sample [11] and a computer and paper administered quantity-frequency measure demonstrated good test-retest reliabilities [6]. Moderate test-retest reliabilities were reported for a quantity-frequency measure administered to a general population sample [21] and for quantity-frequency and short-term recall measures in an Australian general sample of twins [22]. Good test-retest reliability was reported in an undergraduate student population sample for a graduated-frequency measure [10] and in a general population [23]. Test-retest reliability of a daily intake short-term recall measure was good for an older adult sample [24]. Moderate test-retest reliability was reported for a short-term recall measure of ≥5 drinks consumed per drinking occasion [25]. In an older population sample, inter-rater reliability was good for quantity-frequency and short-term recall measures [26] though poor inter-rater reliability was reported in a study administering a weekly quantity-frequency measure to over 65-year olds [7] and for the graduated-frequency and short-term recall measures in a general population [27] (for detailed results see Table 2).

Criterion validity

Studies of quantity-frequency measures administered to the general population sample [28,29,30] and a quantity-frequency and short-term recall measure [31] demonstrated good criterion validity. An annual graduated-frequency measure and previous 24 h short-term recall measure administered in a general population sample indicated good criterion validity for ‘heavy drinkers’. Poor validity was reported for moderate drinkers in this study (due perhaps to the fact that consumers of lower levels of alcohol may drink irregularly and not within the 24-h before administration of the short-term recall measure) [27]. An undergraduate student sample completed two graduated-frequency measures and a short-term recall measure with moderate criterion validity [32]. Short-term recall spousal reports that were used as a criterion or standard to validate alcohol intake in an older sample reported good criterion validity [24]. A short-term recall measure administered to an undergraduate student sample had poor criterion validity [33] though other studies of the short-term recall measure [34] and the short-term recall and graduated-frequency measures [9] reported good criterion validity (see Table 2).

Construct validity

Poor construct validity was found for 30-day graduated-frequency measure completed in an undergraduate sample (age range 18–20 years) [35]. A short-term recall measure compared with the MAST measure on two separate occasions in a sample of older adults reported poor to moderate construct validity [24] (see Table 2).

Hypothesis validity

Good hypothesis validity was reported for a quantity-frequency measure compared to a short-term recall measure in an older adult population sample [26] and for a quantity-frequency measure compared to a short-term measure in a general population sample [36] (see Table 2).

Predictive validity

One study of a graduated-frequency and short-term recall measure that was completed by an undergraduate student sample demonstrated adequate to good predictive validity [9] whilst another (albeit small sample size) study of the same measures in an undergraduate student sample (age range 18–20 years) recorded poor predictive validity [32]. A general population study found poor predictive validity for the three measures [37] though measured against unstandardized indicators of alcohol-related mortality, morbidity and harm. A short-term recall measure achieved good or adequate prediction properties regarding heavy drinking (≥5 drinks per occasion) for samples aged 18–39 [25] and for a general population [38] (see Table 2).

Convergent validity

Moderate to good convergent validity was found in a general population sample for a two-week beverage-specific quantity-frequency measure, a graduated-frequency and short-term recall measure [39]. Similarly, adequate or good convergent validity was recorded for the three types of measures of alcohol intake in a cohort of 20 to 63-year olds [11] and in a general population [37]. A graduated-frequency and short-term recall measure demonstrated good convergent validity in an undergraduate student samples [8, 10]. A short-term recall measure completed by undergraduate student samples reported adequate to good convergent validity [40]. Also, adequate convergent validity was found for short-term recall measures in a male population sample [41] (see Table 2). Only one study referred to divergent validity of the graduated-frequency and short-term recall measures and only in terms of a negative correlation in an undergraduate student sample between religiosity and alcohol consumption [10] (see Table 2). Similarly, only one study referred explicitly to structural validity - a 30-day quantity-frequency measure that was used to collect data on alcohol consumption in a general population reported poor validity [42] (see Table 2).

Overall, the review found that only a relatively small number of studies investigated the COSMIN psychometric domains of each type of measure. Furthermore, the hypothesis validity or structural validity of the graduated-frequency measure was not investigated at all nor was the structural validity of the short-term recall measure. Divergent validity or construct validity were not assessed for the quantity-frequency measure.


Psychometric property ratings for measure types

Each type of measure appeared to have good criterion validity according to COSMIN methodology. Several different reference standards or criterions were used in the included studies to measure alcohol consumption (e.g. [9, 29]). The appropriateness of using peers [34], spousal reports [24] and short-term recall measures [31] as criterion standards is questionable and perhaps it is unsurprising that these studies reported a low quality rating (despite reporting good content validity). Currently, there is no gold standard for the measurement of alcohol consumption. Most countries use some standard unit of measurement (e.g. one drink, one unit) but there is a lack of consensus and no internationally accepted definition thereby posing difficulties for the conduct of comparative analyses. Biological markers of alcohol consumption should be used more frequently to support and validate findings from self-reporting measures, as these methods are not subject to sampling errors or researcher or participant bias [14]. However these measures are also not without risk of error. Alcohol abstinence in the 24 h prior to breath-, blood- or urine- ethanol measurement has been shown to produce low results even for heavy drinkers [43]. More research is needed to find a gold standard for alcohol consumption measurement.

Construct validity was poor for graduated-frequency and short-term recall measures, and not assessed for quantity-frequency measures. The structural validity of the quantity-frequency measure only was assessed and this construct validity-related property was deemed to be poor. Only one study investigated the predictive validity of the quantity-frequency measure and it found that the validity was poor. Poor predictive validity results suggest the measure may not be valid in predicting the measurement of future alcohol intake among the general population or in predicting the measurement of drinking trajectories and alcohol-related consequences. The study was conducted with good methodological quality and received a good COSMIN score.

In contrast, the graduated-frequency and short-term recall measures achieved mixed results including predicting with variable accuracy the outcomes of alcohol-related morbidity and mortality and alcohol dependence. There were several studies of the convergent validity of each measure and generally this property was deemed to be moderate to good.

Test-retest results tended to indicate that similar outcome-assessments of alcohol consumption were found when the quantity-frequency measure, graduated-frequency measure and the short-term recall measure were re-administered. Mixed results were reported for inter-rater reliability of quantity-frequency and short-term recall measures, with poor inter-rater reliability found when the graduated-frequency measure was applied. In particular, there appeared to be difficulty obtaining good agreement between raters regarding the measurement of consumed beer, wine and liquor respectively [27], between self-report tests (AUDIT (Alcohol Use Disorders Identification Test [44]) and CAGE (Cut down, Annoyed, Guilty, Eye-opener) [45]) and a quantity-frequency measure when research assistants interviewed participants using a face-to-face predetermined appointment schedule [7]. It is important to note that these studies achieved only fair or poor COSMIN ratings. Indeed, many of the reported poor psychometric properties may be due to poorly conducted studies as indicated by poor COSMIN ratings [6, 21, 31]. Variation between types of psychometric properties for the same measure (e.g. high validity for one property and low for another property) may be due to differences in study design and methodological quality.

Discrepancies between COSMIN ratings and psychometric properties

There were some studies in which there were discrepancies between COSMIN ratings of the quality of a psychometric property and the performance of a measure. For example, one study [6] reported good test-retest reliability for a typical weekly quantity-frequency measure but the methodological quality of a particular aspect of the study was rated poor because the method of administering the (computer or paper) measure of consumption was not consistent across time-points. Reasons for poor methodological quality ratings using the COSMIN checklist included inappropriate time intervals between measure administrations, ambiguity over management of missing responses, lack of assurance that patients remained stable between measure administrations, inadequate sample size and choice of inappropriate statistical methods (e.g. reporting Spearman’s correlation coefficients [46] over kappa values for test-retest reliability).

Issues with self-reporting alcohol consumption

Self-reported alcohol consumption is difficult to measure accurately due to the influence of social desirability and memory issues and these factors were alluded to in many included studies (e.g. [25, 27, 32, 35]). Possible solutions to these challenges include using more anonymised interview types, randomised response techniques, checking responses using more than one alcohol measure and using memory aids (interviewer prompts, calendars or diaries) [47]. Also, population-based survey research about alcohol consumption and drinking habits are particularly problematic when the sample includes alcoholics because of uncertainty about whether or not participants are sober when interviewed, difficulty recalling consumption due to the effect of alcohol on memory and increased alcohol tolerance in frequently heavy drinkers [48]. These issues pose challenges for the reliable and valid assessment of alcohol consumption in surveys. Potential solutions include factoring in more complex survey questions requiring greater reflection on alcohol intake (if respondents are asked to consider the timing, type of beverage drank and episodic heavy drinking their responses should be more considered), [17] use of a breathalyser before measure administration to ensure participants are alcohol-free [49] and creating an environment that is conducive to confidentiality and honest disclosure of alcohol consumption [48, 50]. These potential solutions may be incorporated into population-based survey collection of alcohol consumption data in order to afford greater confidence in the drinking status of participants and significant assurance that responses reflect consumption accurately.

Comparison with previous reviews

Generally, the measures did not appear to vary significantly across population age and sex groupings. The assessment of the amount of alcohol consumed appeared to exert some influence on the psychometric performance of self-report measures. Parker [27] reported good concurrent validity using a short-term recall measure though for heavy drinkers only. Gmel [13] found the graduated-frequency measure over reported alcohol intake, whereas the beverage specific quantity-frequency measure provided a more accurate measure of consumption. The Feunekes review recommended that the quantity and frequency of alcohol consumption should be prioritised and assessed separately for specific types of alcoholic beverages [14] and beverage-specific quantity-frequency measures performed accurately and reliably though only in relation to the consumption of lower levels of alcohol [26, 28]. The use of a ‘diary’ format with a predetermined timeframe (that afforded individuals an opportunity to record all alcohol consumption in a format of their choice; and usually in the format of a short-term recall measure) had good psychometric properties [24, 29]. This finding may suggest that the use of an ‘actual’ time period instead of the ‘usual’ timeframes in quantity-frequency and graduated-frequency measures [51] may add to the reliability and validity of assessments of alcohol consumption. However both reviews found that the quantity-frequency measure performed with most reliability and validity and was the measure with the highest concordance with the short-term recall ‘diary’ measure [22, 29, 33, 38].

Recommendations for improved reliability and validity

The review findings suggest that the reliability and validity of self-reporting alcohol consumption measures may be improved in various ways. For example, computerised or automated modes of administration rather than an interviewer-based mode might facilitate greater privacy and assure more candid reporting [52]. Longer timeframes may be more desirable as they tend to capture less frequent drinkers (i.e. weekly, monthly or annual recall) and questions which involve specified timeframes (i.e. last week, last year) over ‘usual’ reference frames require respondents to focus their recall. Beverage-specific questions and questions that ask respondents to group responses into graduated categories may encourage a more thorough consideration of their alcohol consumption and, in turn, produce more accurate reporting. It is worth considering that the self-report measures themselves are outdated as they focus only upon frequency and volume of alcohol. It may be worthwhile to instead use self-report tests to assess alcohol consumption which take into account symptoms of alcohol addiction/dependence as well. Using review findings, the advantages and disadvantages of each measure type are summarised (Table 3).

Table 3 Summary table of the advantages and disadvantages of the quantity-frequency, graduated-frequency and short-term recall measures

Limitations and strengths

The review found wide variation in the structure, content and format of quantity-frequency, graduated-frequency and short-term recall measures. For example, time-period referents ranged from 24-h recall to alcohol intake over the previous year and alcohol consumption was assessed in terms of units (standardised to the country of each sample of respondents), grams of alcohol, typical sizes of sold drinks and beverage-specific drinks. The included studies from various multidisciplinary databases covered a range of locations, cultures and populations and these factors were taken into account in the analytical comparisons of measures of alcohol consumption. It is important to note that a proportion of the review studies focused on undergraduate student populations (e.g. [8, 10, 34, 40]). Arguably, students may be atypical with respect to the general population [53] and their alcohol consumption patterns may have limited read-across to the general population particularly the population of older people. Some psychometric properties were not assessed including measurement error, cross-cultural validity, internal consistency and responsiveness. All studies were in the English language (in keeping with COSMIN manual guidelines) and it is possible that important studies in other languages may have been missed. The review adhered to the COSMIN manual [15] and whilst the COSMIN method adds rigour to the exercise of psychometric assessment, arguably, a limitation is the use of the ‘worst score counts’ which means that despite attaining higher quality scores on some items, the lowest score of an item list is taken as the overall quality rating (e.g. [28, 31]). Furthermore, studies of poor design quality were included in the review due to the overall lack of studies that met initial eligibility criteria.

Nevertheless, the review was completed in a methodologically robust fashion as per the COSMIN approach which has transparent, tested and validated resources such as a manual, search filters and a quality appraisal tool [15]. Particular strengths include the use of extensive search terms and having two reviewers search the literature.


The studies of quantity-frequency measures indicated good/adequate psychometric properties for test-retest reliability, criterion validity, convergent validity and hypothesis validity; predictive- and structural-validity were rated as poor and inter-rater reliability reported mixed results. Regarding graduated-frequency measures, good/adequate psychometric properties were reported for test-retest reliability, convergent validity and divergent validity; criterion validity and predictive validity reported mixed results and construct validity and inter-rater reliability were reported as poor. Short-term recall measures achieved good/adequate psychometric properties for test-retest reliability, convergent validity, hypothesis validity, construct validity, divergent validity. Criterion validity, predictive validity and inter-rater reliability reported mixed results. The review findings add to previously published alcohol self-report literature by providing an updated appraisal of measures of alcohol consumption research and indicate that a combination of aspects of the various measures may enhance the reliable and valid assessment patterns of drinking.

It is difficult to discern which one of the existing measures is the most reliable and valid given the absence of any assessment of certain psychometric properties and the mixed results of studies included in the review. Arguably, when the results from the range of studies are considered and summed, they indicate that the quantity-frequency measure compared to the other two measures appeared to perform best in psychometric terms and, therefore, it is likely to produce the most reliable and valid assessment of alcohol consumption in population surveys. The results indicated that the features of alcohol consumption measures which performed with good reliability and validity were those that assessed beverage-specific alcohol consumption, used actual timeframes and asked about episodes of binge drinking; and that the quantity-frequency measures appeared to be the ‘best’ questionnaire-type currently available to measure self-reported alcohol consumption. Clearly, there is a need for more focused psychometric studies of measures of alcohol consumption including head-to-head comparative population-based and community surveys. Comparability of review results with previous reviews [13, 14] is difficult because they did not employ a COSMIN methodology to appraise studies. Overall, findings appeared to be in keeping with the results of the Gmel review [13] which found a beverage-specific, quantity-frequency measure recorded alcohol consumption more reliably, and with the Feunekes [14] which reported that the most accurate alcohol intake measurement was provided by quantity-frequency and short-term recall measures.



Alcohol use disorders identification test [44]


Cut down, Annoyed, guilty, eye-opener (test for problem alcohol use) [45]


Consensus-based Standards for the selection of health measurement instruments [15]


Diagnostic and statistical manual of mental disorders [56]


Michigan alcoholism screening Test [55]


Diagnostic and statistical manual of mental disorders revised 3rd edition


Diagnostic and statistical manual of mental disorders 4th edition




United Kingdom


  1. World Health Organisation, “Global strategy to reduce the harmful use of alcohol,” World Health Organisation, 1st May 2010. Available: [Accessed 18 July 2017].

  2. Department of Health, “Health risks from alcohol: new guidelines,”, 8th January 2016. Available: [Accessed 1 Aug 2017].

  3. DrinkAware, “What is an alcohol unit?,” DrinkAware, 16 January 2016. Available: [Accessed 21 Dec 2017].

  4. Murray C, Richards M, Newton JN, Fenton KA, Anderson HR, Atkinson C, Bennett D, Bernabe E, Blencowe H, Bourne R, Braithwaite T, Brayne C, Bruge T, Brugha TS, Burney P, Dherani M, Dolk H, Edmond K, Ezzati M, Fleming ND, Fleming ND, Freedman G, Gunnell D, Hay RJ, Hutchings SJ, LOhno S, Lozano R, Lyons RA, Marcenes W, Magnavi M, Newton CR, Pearce N, Pope D, Rushton L, Salomon JA, Shibuya K, Wang T, Wang T, Williams HC, Woolf AD, Lopez AD, Davis A. UK health performance: findings of the global burden of disease study 2010. Lancet. 2013;381(9871):997–1020.

    Article  PubMed  Google Scholar 

  5. Dawson D. Methodological issues in measuring alcohol use. Alcohol Res Health. 2003;27(1):18–28.

    PubMed  Google Scholar 

  6. Bonevski B, Campbell E, Sanson-Fisher R. The validity and reliability of an interactive computer tobacco and alcohol use survey in general practice. Addicit Behav. 2010;35(1):492–8.

    Article  CAS  Google Scholar 

  7. Reid M, Tinetti M, O'Connor P, Kosten T, Concato J. Measuring alcohol consumption among older adults: a comparison of available methods. Am J Addictions. 2003;12(3):211–9.

    Article  Google Scholar 

  8. O'Hare T. Measuring alcohol consumption: a comparison of the retrospective diary and the quantity-frequency methods in a college drinking survey. J Stud Alcohol. 1991;52(5):500–2.

    Article  PubMed  Google Scholar 

  9. O'Hare T. Comparing the QFI, the retrospective diary and binge drinking in college first offenders. J Alcohol Drug Educ. 1997;42(3):40–53.

    Google Scholar 

  10. Dollinger S, Malmquist D. Reliability and validity of single-item self-reports: with special relevance to college Students’ alcohol use, Religiousity, study and social life. J Gen Psychol. 2009;136(3):231–41.

    Article  PubMed  Google Scholar 

  11. Poikolainen K, Podkletnova I, Alho H. Accuracy of quantity-frequency and graduated frequency questionnaires in measuring alcohol intake: comparison with daily diary and commonly used laboratory markers. Alcohol Alcoholism. 2002;37(6):573–6.

    Article  CAS  PubMed  Google Scholar 

  12. National Health Service, “Alcohol Misuse,” National Health Service, 28 November 2015. Available: [Accessed 21 Dec 2017].

  13. Gmel G, Rehm J. Measuring alcohol consumption. Contemp Drug Probl. 2004;31(3):467–540.

    Google Scholar 

  14. Feunekes G, van ‘t Veer P, van Staveren WA, Kok FJ. Alcohol intake assessment: the sober facts. Am J Epidemiol. 1999;150(1):105–12.

    Article  CAS  PubMed  Google Scholar 

  15. Mokkink L, Terwee C, Patrick D, Alonso J, Stratford P, Knol D, Bouter L, de Vet HC. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study. Qual Life Res. 2010;19(4):539–49.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009;6(7):e1000097.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Toneatto T, Sobell M, Sobell L. Predictors of alcohol abusers’ inconsistent self-reports of their drinking and life events. Alcoholism Clinl Exp Res. 1992;16:542–6.

    Article  CAS  Google Scholar 

  18. C. Terwee, S. Bot, M. de Boer, D. van der Windt , D. Knol, J. Dekker, L. Bouter, H. de Vet, “Terwee C, Bot S, de Boer M, van der Windt D, Knol D, Dekker J, Bouter L and de Vet H (2007) ‘Quality criteria were proposed for measurement properties of health status questionnaires’., J Clin Epidemiol, 60(1), pp. 34-42,”

    Article  PubMed  Google Scholar 

  19. Mokkink L, Terwee C, Knol D, Stratford P, Alonso J, Patrick D, Bouter L, de Vet HC. The COSMIN checklist for evaluating the methodological quality of studies on measurement properties: a clarification of its content. BMC Med Res Methodol. 2006;10(22):1471–2288.

    Google Scholar 

  20. L. Mokkink, H. de Vet, C. Prinsen, D. Patrick , J. Alonso, L. Bouter and C. Terwee, “COSMIN risk of bias checklist for systematic reviews of patient reported outcome measures,” 12th December 2017. Available: [Accessed 21 Dec 2017].

  21. Hansell N, Agrawal A, Whitfield J, Morley K, Zhu G. Long-term stability and heritability of telephone interview measures of alcohol consumption and dependence. Twin Res Hum Genet. 2008;11(3):287–305.

    Article  PubMed  Google Scholar 

  22. Whitfield J, Madden P, Neale M, Heath A, Martin N. The genetics of alcohol intake and of alcohol dependence. Alcoholism Clin Exp Res. 2004;28(8):1153–60.

    Article  Google Scholar 

  23. Gruenewald P, Johnson F. The stability and reliability of self-reported drinking measures. J Stud Alcohol. 2006;67(1):738–45.

    Article  PubMed  Google Scholar 

  24. Chaikelson J, Arbuckle T, Lapidus S, Pushkar Gold D. Measurement of lifetime alcohol consumption. J Stud Alcohol. 1994;55(1):133–40.

    Article  CAS  PubMed  Google Scholar 

  25. Greenfield T, Nayak M, Bond J, Kerr W, Ye Y. Test-retest reliability and validity of life-course alcohol consumption measures: the 2005 National Alcohol Survey Follow up. Alcoholism Clin Exp Res. 2014;38(9):2479–87.

    Article  Google Scholar 

  26. Crum R, Puddley I, Gee G, Fried L. Reproducbility of two approaches for assessing alcohol consumption among older adults. Addict Res Theory. 2002;10(4):373–85.

    Article  Google Scholar 

  27. Parker D, Derby C, Usner D, Gonzalez S, Lapane K, Carleton R. Self-reported alcohol intake using two different question formats in southeastern New England. Int J Epidemiol. 1996;25(4):770–4.

    Article  CAS  PubMed  Google Scholar 

  28. Russell M, Welte J, Barnes G. Quantity-frequency measures of alcohol consumption: beverage-specific vs global questions. Br J Addict. 1991;86(1):409–17.

    Article  CAS  PubMed  Google Scholar 

  29. Sander A, Witol A, Kreutzer J. Alcohol use after traumatic brain injury: concordance of patients’ and relatives’ reports. Alcohol Trauma Brain Inj. 1997;78(1):138–41.

    CAS  Google Scholar 

  30. Cutler S, Wallace P, Haines A. Assessing alcohol consumption in general practice patients- a comparison between questionnaire and interview. Alcohol Alcoholism. 1988;23(6):441–50.

    CAS  PubMed  Google Scholar 

  31. Koppes L, Twisk J, Snel J, Kemper H. Concurrent validity of alcohol consumption measurement in a ‘healthy’ population; quantity-frequency questionnaire v. Dietary history interview. Bri J Nutr. 2002;88(1):427–34.

    Article  CAS  Google Scholar 

  32. Weingardt K, Baer J, Kivlahan D. Episodic heavy drinking among college students: methodological issues and longitudinal perspectives. Psychol Addict Behav. 1998;12(3):155–67.

    Article  Google Scholar 

  33. Read J, Kahler C, Strong D, Colder C. Development and preliminary validation of the young adult alcohol consequences questionnaire. J Stud Alcohol. 2006;67(1):169–77.

    Article  PubMed  Google Scholar 

  34. Northcote J, Livingston M. Accuracy of self-reported drinking: observational verification of ‘last occasion’ drink estimates of young adults. Alcohol Alcoholism. 2011;46(6):709–13.

    Article  PubMed  Google Scholar 

  35. McGinley J, Curran P. Validity counts with multiplying ordinal items defined by binned counts: an application to a quantity-frequency measure of alcohol use. Methodol (Gott). 2014;10(3):108–16.

    Google Scholar 

  36. Tuunanen M, Aalto M, Seppa K. Mean-weekly alcohol questions are not recommended for clinical work. Alcohol Alcoholism. 2013;48(3):308–11.

    Article  PubMed  Google Scholar 

  37. Rehm J, Greenfield T, Walsh G, Xic X, Robson L, Single E. Assessment methods for alcohol consumption, prevalence of high risk drinking and harm: a sensitivity analysis. Int J Epidemiol. 1999;28(1):219–24.

    Article  CAS  PubMed  Google Scholar 

  38. Searles J, Perrine M, Mundt J, Helzer J. Self-report of drinking Uisng touch-tone telephone: extending the limits of reliable daily contact. J Stud Alcohol. 1995;56(4):375–82.

    Article  CAS  PubMed  Google Scholar 

  39. Hilton M. A comparison of a prospective diary and two summary recall techniques for recording alcohol consumption. Br J Addict. 1989;84(1):1085–92.

    Article  CAS  PubMed  Google Scholar 

  40. LaBrie J, Penderson E, Earleywine M. A group-administered timeline Followback assessment of alcohol use. J Stud Alcohol. 2004;66(5):693–7.

    Article  Google Scholar 

  41. Searles J, Helzer J, Walter D. Comparison of drinking patterns measured by daily reports and timeline Followback. Psychol Addict Behav. 2000;14(3):277–86.

    Article  CAS  PubMed  Google Scholar 

  42. Lennox R, Zarkin G, Bray J. Latent variable models of alcohol-related constructs. J Subst Abus. 1996;8(2):241–50.

    Article  CAS  Google Scholar 

  43. Sharpe P. Biochemical detection and monitoring of alcohol abuse and abstinence. Ann Clin Biochem. 2001;38:652–64.

    Article  CAS  PubMed  Google Scholar 

  44. World Health Organisation. The alcohol use disorders identification test. Geneva: Department of Mental Health and Substance Dependence; 2001.

    Google Scholar 

  45. Ewing J. Detecting alcoholism. The CAGE questionnaire. J Am Med Assoc. 1984;252(14):1905–7.

    Article  CAS  Google Scholar 

  46. Daniel WW. Applied nonparametric statistics. London: Houghton Mifflin; 1978.

    Google Scholar 

  47. Bowling A. Mode of questionnaire administration can have serious effects on data quality. J Public Health. 2005;27(3):281–91.

    Article  Google Scholar 

  48. L. Sobell and M. Sobell, “Alcohol consumption measures,” 01 august 2004. Available: [Accessed 07 June 2017].

  49. Sobell L, Toneatto T, Sobell M. Behavioral assessment and treatment planning for alcohol, tobacco, and other drug problems: current status with an emphasis on clinical applications. Behav Ther. 1994;25:533–80.

    Article  Google Scholar 

  50. Midanik L. The validity of self-reported alcohol consumption and alcohol problems: a literature review. Addiction. 1982;77(4):357–82.

    Article  CAS  Google Scholar 

  51. Werch C. Quantity-frequency and diary measures of alcohol consumption for elderly drinkers. Int J Addict. 1989;24(9):859–65.

    Article  CAS  PubMed  Google Scholar 

  52. Lucas R, Mullin P, Luna C, McInroy D. Psychiatrists and a computer as interrogators of patients with alcohol-related illnesses: a comparison. Br J Psychiatry. 1977;131:160–7.

    Article  CAS  PubMed  Google Scholar 

  53. Slutske WS, Hunt-Carter EE, Nabors-Oberg RE, Sher KJ, Bucholz KK, Madden PAF, Anokhin A, Heath AC. Do College students drink more than their non-college-attending peers? Evidence from a population-based longitudinal female twin study. J Abnorm Psychol. 2004;113(4):530–40.

    Article  PubMed  Google Scholar 

  54. Streiner DL, Norman GR, Cairney J. Health measurement scales: a practical guide to their development and use. Oxford: Oxford University Press; 2015.

    Book  Google Scholar 

  55. Selzer M. The Michigan alcoholism screening test: the quest for a new diagnostic instrument. Am J Psychiat. 1971;127(12):1653–8.

    Article  CAS  PubMed  Google Scholar 

  56. Diagnostic & Statistical Manual of Mental Disorder. Diagnostic and statistical manual of mental disorders, fifth edition. 5th ed. Arlington: American Psychiatric Association; 2013.

    Google Scholar 

Download references


Not applicable


This review was completed as part of a PhD which was funded by the Department of Employment and Learning Northern Ireland (DEL NI).

Availability of data and materials

All data generated or analysed during this study are included in this published article [and Additional files 2 and 3].

Author information

Authors and Affiliations



MD and DOR conceived of the study. HMcK and CT created the search strategy and HMcK conducted the search. HMcK, CT and MD reviewed studies for suitability against the inclusion criteria. HMcK extracted study information. MD and CT assisted in drafting the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Hannah McKenna.

Ethics declarations

Authors’ information

The study was conducted at the Centre for Public Health, Queen’s University Belfast.

Ethics approval and consent to participate

All included studies involving the use of human participants were conducted with ethical approval and consent.

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1:

Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA statement checklist [16]. Checklist for the minimum required items to be reported as part of a systematic review. (DOC 62 kb)

Additional file 2: Table S1.

Characteristics of included studies. A full description of the characteristics of each study which met the review inclusion criteria (n = 28). (DOCX 25 kb)

Additional file 3: Table S2.

Psychometric properties of included studies grouped into results reported by study authors and COSMIN quality ratings assigned by review authors (n = 28). (DOCX 41 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

McKenna, H., Treanor, C., O’Reilly, D. et al. Evaluation of the psychometric properties of self-reported measures of alcohol consumption: a COSMIN systematic review. Subst Abuse Treat Prev Policy 13, 6 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: