Frontiers in Development Policy: A Primer on Emerging Issues
One commonly encountered pitfall is a lack of satisfactory global model fit in confirmatory factor analysis conducted on a new sample following a satisfactory initial factor analysis performed on a previous sample. Lack of satisfactory fit offers the opportunity to identify additional underperforming items for removal. Also, modification indices, produced by M plus and other structural equation modeling SEM programs, can help identify items that need to be modified. Sometimes a higher-order factor structure, where correlations among the original factors can be explained by one or more higher-order factors, is needed.
A good example of best practice is seen in the work of Pushpanathan et al. They tested this using three different models—a unidimensional model 1-factor CFA ; a 3-factor model 3 factor CFA consisting of sub-scales measuring insomnia, motor symptoms and obstructive sleep apnea, and REM sleep behavior disorder; and a confirmatory bifactor model having a general factor and the same three sub-scales combined. The results of this study suggested that only the bifactor model with a general factor and the three sub-scales combined achieved satisfactory model fitness.
Based on these results, the authors cautioned against the use of a unidimensional total scale scores as a cardinal indicator of sleep in Parkinson's disease, but encouraged the examination of its multidimensional subscales Finalized items from the tests of dimensionality can be used to create scale scores for substantive analysis including tests of reliability and validity. Scale scores can be calculated by using unweighted or weighted procedures.
The unweighted approach involves summing standardized item scores or raw item scores, or computing the mean for raw item scores For instance, in using confirmatory factor analysis, structural equation models, or exploratory factor analysis, each factor produced reveals a statistically independent source of variation among a set of items The contribution of each individual item to this factor is considered a weight, with the factor loading value representing the weight. The scores associated with each factor in a model then represents a composite scale score based on a weighted sum of the individual items using factor loadings In general, it does not make much difference in the performance of the scale if scales are computed as unweighted items e.
Reliability is the degree of consistency exhibited when a measurement is repeated under identical conditions A number of standard statistics have been developed to assess reliability of a scale, including Cronbach's alpha , ordinal alpha , specific to binary and ordinal scale items, test—retest reliability coefficient of stability 1 , 2 , McDonald's Omega , Raykov's rho 2 or Revelle's beta , , split-half estimates, Spearman-Brown formula, alternate form method coefficient of equivalence , and inter-observer reliability 1 , 2.
Of these statistics, Cronbach's alpha and test—retest reliability are predominantly used to assess reliability of scales 2 , Cronbach's alpha assesses the internal consistency of the scale items, i. An alpha coefficient of 0. Cronbach's alpha has been the most common and seems to have received general approval; however, reliability statistics such as Raykov's rho, ordinal alpha, and Revelle's beta, which are debated to have improvements over Cronbach's alpha, are beginning to gain acceptance.
An additional approach in testing reliability is the test—retest reliability. The test—retest reliability, also known as the coefficient of stability, is used to assess the degree to which the participants' performance is repeatable, i. Researchers vary in how they assess test—retest reliability. While some prefer to use intra class correlation coefficient , others use the Pearson product-moment correlation In both cases, the higher the correlation, the higher the test—retest reliability, with values close to zero indicating low reliability.
In addition, study conditions could change values on the construct being measured over time as in an intervention study, for example , which could lower the test-retest reliability. The work of Johnson et al.
As part of testing for reliability, the authors tested for the internal consistency reliability values for the ASES and its subscales using Raykov's rho produces a coefficient similar to alpha but with fewer assumptions and with confidence intervals ; they then tested for the temporal consistency of the ASES' factor structure. This was then followed by test—retest reliability assessment among the latent factors. The different approaches provided support for the reliability of the ASES scale. Other approaches found to be useful and support scale reliability include split-half estimates, Spearman-Brown formula, alternate form method coefficient of equivalence , and inter-observer reliability 1 , 2.
Although it is discussed at length here in Step 9, validation is an ongoing process that starts with the identification and definition of the domain of study Step 1 and continues to its generalizability with other constructs Step 9 The validity of an instrument can be examined in numerous ways; the most common tests of validity are content validity described in Step 2 , which can be done prior to the instrument being administered to the target population, and criterion predictive and concurrent and construct validity convergent, discriminant, differentiation by known groups, correlations , which occurs after survey administration.
There are two forms of criterion validity: predictive criterion validity and concurrent criterion validity. Thus, the scale should be able to predict a behavior in the future.
About the series: Frontiers of Economics and Globalization | Emerald Insight
An example is the ability for an exclusive breastfeeding social support scale to predict exclusive breastfeeding Here, the mother's willingness to exclusively breastfeed occurs after social support has been given, i. Predictive validity can be estimated by examining the association between the scale scores and the criterion in question. Concurrent criterion validity is the extent to which test scores have a stronger relationship with criterion gold standard measurement made at the time of test administration or shortly afterward 2.
This can be estimated using Pearson product-moment correlation or latent variable modeling. The work of Greca and Stone on the psychometric evaluation of the revised version of a social anxiety scale for children SASC-R provides a good example for the evaluation of concurrent validity In this study, the authors collected data on an earlier validated version of the SASC scale consisting of 10 items, as well as the revised version, SASC-R, which had additional 16 items making a item scale. With a validity coefficient of 0. A limitation of concurrent validity is that this strategy for validity does not work with small sample sizes because of their large sampling errors.
This reason may account for its omission in most validation studies. Four indicators of construct validity are relevant to scale development: convergent validity, discriminant validity, differentiation by known groups, and correlation analysis. Convergent validity is the extent to which a construct measured in different ways yields similar results.
This is best estimated through the multi-trait multi-method matrix 2 , although in some cases researchers have used either latent variable modeling or Pearson product-moment correlation based on Fisher's Z transformation. Evidence of convergent validity of a construct can be provided by the extent to which the newly developed scale correlates highly with other variables designed to measure the same construct 2 , It can be invalidated by too low or weak correlations with other tests which are intended to measure the same construct.
Discriminant validity is the extent to which a measure is novel and not simply a reflection of some other construct This is best estimated through the multi-trait multi method matrix 2. Discriminant validity is indicated by predictably low or weak correlations between the measure of interest and other measures that are supposedly not measuring the same variable or concept The newly developed construct can be invalidated by too high correlations with other tests which are intended to differ in their measurements This approach is critical in differentiating the newly developed construct from other rival alternatives Differentiation or comparison between known groups examines the distribution of a newly developed scale score over known binary items This is premised on previous theoretical and empirical knowledge of the performance of the binary groups.
An example of best practice is seen in the work of Boateng et al. In this study, we compared the mean household water insecurity scores over households with or without E. Consistent with what we knew from the extant literature, we found households with E. This suggested our scale could discriminate between particular known groups.
- Functional Foods and Biotechnology (Food Science and Technology)!
- Global Environmental Change?
- Food Policy.
- This Is for You (The Means of Grace)?
Although correlational analysis is frequently used by several scholars, bivariate regression analysis is preferred to correlational analysis for quantifying validity , Regression analysis between scale scores and an indicator of the domain examined has a number of important advantages over correlational analysis. First, regression analysis quantifies the association in meaningful units, facilitating judgment of validity.
Second, regression analysis avoids confounding validity with the underlying variation in the sample and therefore the results from one sample are more applicable to other samples in which the underlying variation may differ. Third, regression analysis is preferred because the regression model can be used to examine discriminant validity by adding potential alternative measures.
In addition to regression analysis, alternative techniques such as analysis of standard deviations of the differences between scores and the examination of intraclass correlation coefficients ICC have been recommended as viable options Taken together, these methods make it possible to assess the validity of an adapted or a newly developed scale.
In addition to predictive validity, existing studies in fields such as health, social, and behavioral sciences have shown that scale validity is supported if at least two of the different forms of construct validity discussed in this section have been examined. Further information about establishing validity and constructing indictors from scales can be found in Frongillo et al.
In sum, we have sought to give an overview of the key steps in scale development and validation Figure 1 as well as to help the reader understand how one might approach each step Table 1. We have also given a basic introduction to the conceptual and methodological underpinnings of each step.
Because scale development is so complicated, this should be considered a primer, i. The technical literature and examples of rigorous scale development mentioned throughout will be important for readers to pursue. There are a number of matters not addressed here, including how to interpret scale output, the designation of cut-offs, when indices, rather than scales, are more appropriate, and principles for re-testing scales in new populations. Also, this review leans more toward the classical test theory approach to scale development; a comprehensive review on IRT modeling will be complementary.
We hope this review helps to ease readers into the literature, but space precludes consideration of all these topics. The necessity of the nine steps that we have outlined here Table 1 , Figure 1 will vary from study to study.
Tables of Contents
While studies focusing on developing scales de novo may use all nine steps, others, e. Resource constraints, including time, money, and participant attention and patience are very real, and must be acknowledged as additional limits to rigorous scale development. We cannot state which steps are the most important; difficult decisions about which steps to approach less rigorously can only be made by each scale developer, based on the purpose of the research, the proposed end-users of the scale, and resources available.
It is our hope, however, that by outlining the general shape of the phases and steps in scale development, researchers will be able to purposively choose the steps that they will include, rather than omitting a step out of lack of knowledge. Well-designed scales are the foundation of much of our understanding of a range of phenomena, but ensuring that we accurately quantify what we purport to measure is not a simple matter.
By making scale development more approachable and transparent, we hope to facilitate the advancement of our understanding of a range of health, social, and behavioral outcomes. GB and SY developed the first draft of the scale development and validation manuscript.
- The Jewish divide over Israel : accusers and defenders.
- Gardens of a Chinese Emperor: Imperial Creations of the Qianlong Era, 1736-1796!
- Frontiers in Development Policy: A Primer on Emerging Issues?
- Frontiers in Microbiology?
- The Hunters Prey: Erotic Tales of Texas Vampires.
All authors participated in the editing and critical revision of the manuscript and approved the final version of the manuscript for publication. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Mental Health or the National Institutes of Health. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
About the series: Frontiers of Economics and Globalization
We would like to acknowledge the importance of the works of several scholars of scale development and validation used in developing this primer, particularly Robert DeVellis, Tenko Raykov, George Marcoulides, David Streiner, and Betsy McCoach. We would also like to acknowledge the help of Josh Miller of Northwestern University for assisting with design of Figure 1 and development of Table 1 , and we thank Zeina Jamuladdine for helpful comments on tests of unidimensionality. DeVellis RF.