Fred B. Bryant, Ph.D. Loyola University Chicago
When researchers or practitioners need to find a way to measure a particular variable or outcome in their work, they look for a measurement instrument that has both reliability and validity. But what exactly does it mean for a measurement instrument to be “reliable” or “valid,” why do reliability and validity matter, and how do you go about assessing the reliability and validity of a measurement tool?
The Meaning of Reliability
Imagine a medical researcher wants to evaluate the effectiveness of a 6-month weight-loss program in helping people lose weight. The researcher first weighs clients at the start of the 6-month program and then weighs them again at the end of the six-month program. But imagine that the weighing scale the researcher uses to assess each client’s weight contains a damaged spring that leads it to produce unreliable readings that fluctuate wildly over time, even when the weight of the individual being measured is not actually changing at all. Because this weighing scale is not reliable, you cannot trust it to provide accurate measurements. And for this reason, the weighing scale would be a poor measure to use in trying to assess the effectiveness of the weight-loss clinic in helping people lose weight.
Why Reliability Matters
The more unreliable an assessment is, the less useful it will be in research or practice. A basic tenet of classical test theory in the field of psychometrics is that an observed score obtained from a measurement of a particular variable for a given individual is a function of two influences: (1) the actual true score of the individual on this variable, and (2) error. In others words, Observed Score = True Score + Error. The less error that exists in a measurement, the closer observed scores will be to true scores; when there is no error in measurement, observed scores will be perfectly reliable and will consistently reflect true scores. Because unreliable measurements are filled with error, they are likely to provide false, inconsistent assessments of what one wants to measure.
Psychosocial Measurement
How to Assess the Reliability of Measurement
There are at least three primary methods that researchers use to evaluate how reliable a measurement instrument is.
1. Internal Consistency Reliability. If an instrument consists of multiple self-report items or questions, each of which is intended to measure the same underlying concept, then researchers can assess the degree to which respondents provide consistent responses across the full set of items. In this case, researchers typically compute a coefficient of internal consistency (such as coefficient alpha, a split-half correlation, or an item-total correlation) that indicates the degree to which respondents provide comparable responses across all of the items in the instrument.
2. Inter-Rater (or Inter-Observer) Reliability. If an instrument is designed to be used by observers or trained experts who provide ratings of an underlying concept in a sample of individuals, then researchers can assess the degree to which different raters give consistent ratings of the underlying concept for the same individual. In this case, researchers typically compute a coefficient of inter-rater agreement (such as coefficient kappa, Kendall’s tau, a Spearman rank-order correlation, or an intra-class correlation) that indicates the degree to which raters tend to agree in their assessments of the same individual.
3. Test-Retest Reliability. If an instrument can be administered to the same sample of individuals at two separate points in time, then researchers can assess the degree to which respondents provide stable, consistent responses to the same instrument over time. In this case, researchers typically compute a coefficient of test-retest reliability (such as the Pearson correlation coefficient) that indicates the degree to which individual responses are stable over time. If one can assume that the underlying concept being measured is a relatively stable trait that should not change across the two time points, then this correlation provides an assessment of the degree to which scores on the measurement instrument are influenced by random error.
The Meaning of Validity
Validity concerns whether a particular inference that one wishes to make is reasonable or correct, and it has many different meanings depending on the particular research context in which it is being considered. Within the context of measurement, reliability concerns how consistently an instrument measures a target concept, whereas validity refers to whether a particular instrument actually measures the target concept it is intended to measure. An instrument might be reliable in providing consistent responses across multiple items (internal consistency reliability), multiple raters (inter-rater reliability), or multiple time-points (test-retest reliability), but it might nevertheless measure a different underlying concept than it is assumed to assess.
As an example, consider a new multi-item self-report instrument that has been specifically designed to measure individual differences in empathy (conceptually defined as the ability to understand and vicariously experience other people’s feelings). Although research might demonstrate that scores on this instrument show high levels of internal consistency and test-retest reliability, it might also be the case that scores on the test are only weakly correlated with other existing measures of empathy and are actually more strongly correlated with measures of compassion (conceptually defined as motivation to help others who are suffering). In this case, the new empathy measure would be said to lack validity as a measure of empathy.
It is important to note that validity is not an intrinsic characteristic of an instrument, but rather is a context-specific property of the use of an instrument for a particular purpose in a particular setting with a particular population. For example, a given instrument might be valid for one particular use (e.g., as a measure of compassion for adults), but not for another (e.g., as a measure of compassion for children).
Why Validity Matters
If an assessment tool does not actually measure what it was designed to measure, then it provides an inaccurate result which has little or nothing to do with what one wants to measure. Invalid instruments provide misleading information that is worthless or harmful in drawing meaningful conclusions.
How to Assess the Validity of Measurement
Although researchers can adopt many different strategies to validate the use of an instrument for a particular purpose, the most commonly used approaches to assessing measurement validity focus on how thoroughly (content validity) or accurately (construct validity) an instrument measures its intended target concept, or how useful it is in predicting important outcomes that should be related to what it supposedly measures (criterion validity).
1. Content Validity. Content validity refers to the degree to which an instrument covers all relevant aspects of the conceptual or behavioral domain it is intended to measure. Although the content validity of an instrument is sometimes assessed crudely in terms of a researcher’s subjective impression of how thoroughly its items cover the topics they are supposed to cover, researchers often use multivariate statistical methods (e.g., principal components analysis, or exploratory or confirmatory factor analysis) to obtain a more precise estimate of the degree to which an instrument’s items accurately represent a full breadth of conceptual coverage. These more rigorous quantitative assessments of content validity can be used to test whether people’s responses to different subsets of the instrument’s items actually reflect the various conceptual facets of the target concept that an instrument should include.
2. Construct Validity. Construct validity refers to the degree to which an instrument actually measures what it is supposed to measure. The two most commonly used strategies for assessing the construct validity of an instrument involve correlating scores on the particular instrument with scores on other established instruments that have been shown to measure either (1) the same underlying concept that the instrument being validated is intended to measure (convergent validity) or (2) underlying concepts that are different from the concept that the instrument being validated is intended to measure (discriminant validity). Convergent and discriminant validity are often evaluated in relation to one another, by including within the same study multiple measures, some of which are hypothesized to demonstrate stronger relationships than others to the instrument being validated. The most typical approach to construct validation has traditionally been to compute Pearson correlation coefficients among scores on a battery of instruments and then inspect the resulting patterns of association for evidence that scores on the instrument being validated correlate more strongly with other measures of the same concept (convergent validity) than with measures of different concepts (discriminant validity).
3. Criterion Validity. Criterion validity refers to the degree to which an instrument can be used to predict an important external outcome (or criterion) associated with the concept it is supposed to measure. The validational criterion of interest may be an outcome, event, or behavior measured either in the past (retrospective validity), present (concurrent validity), or future (prospective validity). The assessment of an instrument’s criterion validity typically involves statistical analyses that quantify the degree which individuals’ scores on the instrument can be used to predict or classify accurately their scores or values on the criterion measure. If the criterion is measured using a continuous, equal interval scale, then the most commonly used statistical method for assessing criterion validity is linear regression analysis; if the criterion is measured using a dichotomous (or multiple-category) nominal scale, then the most commonly used statistical method is binary (or multinomial) logistic regression.
Psychosocial Measurement Tools
How to Use the HaPI Database to Select Appropriate Measurement Instruments for Research
The Health and Psychosocial Instruments (HaPI) database provides primary bibliographic records and summary information for more than 50,000 unique behavior measurement tools (contained in records called “primary sources”) that have been employed in the health and psychosocial research literature. Users can search the electronic database specifying combinations of key terms to identify relevant measures to administer in their research; and they can also evaluate the reliability and validity of these relevant tools across later studies (called “secondary sources” in HaPI) that have used these same measures. Users can examine reliability and validity across the primary and secondary sources for an instrument, to make informed, evidence-based decisions about which instruments to use in their own research. Through this process, the HaPI database can help users find the most appropriate instrument, whose content most closely matches their conceptual needs and for which sufficient evidence exists regarding reliability and validity.
In addition to helping researchers identify the most appropriate tools for assessing key variables in experiments and surveys, the HaPI database is also invaluable in helping users develop and validate new measurement instruments. In particular, developers of new instruments can systematically use the database to identify relevant measures of: (a) similar, related concepts with which a new instrument should correlate in evaluating convergent validity; (b) distinct, unrelated concepts with which the new instrument should be uncorrelated in evaluating discriminant validity; and (c) important external outcomes that the new instrument should predict in evaluating criterion validity. Thus, the HaPI database is not only a powerful and versatile tool for identifying reliable and valid health measurement tools in health and psychosocial research, but also a rich and unparalleled resource in the development and validation of new measurement instruments.