Statistical Procedures for Scale Development
This paper will define what a scale of measurement is, the steps in creating a scale, and briefly describe three statistical procedures for developing a sound instrument: Cronbach's alpha, factor analysis, and cluster analysis. Recognize that not all studies lend themselves to using scales for measurement, nor it is possible in some studies to carefully follow all the steps outlined below in constructing a scale prior to using it. However, violations of these steps often means that your confidence in your measurement scale and the conclusions you draw may not be sound.

A summed rating scale is a collection of rated statements which, when added together, measure some underlying property they hold in common. A single item has a very limited range, for example a 1-5 Likert format, that only allows respondents to vary across 1-5 points. A scale, as a collection of items, enables respondents to vary across a wider range. For example, a 10 item scale with five Likert options for each item would allow respondents to range from 10-50 in their total score. This variability is similar to watching Grandma's Marathon in which the runners are bunched at the beginning and one cannot distinguish better and poorer runners (e.g., single item). However, after an hour into the race, the runners are well spread out and gradations in runner performance is much more obvious (e.g., a scale).

A scale is based on the idea that several items all tap into or reflect some underlying property, called a latent or hidden structure structure--some quality of a person that must be inferred rather than direcftly observed. The collection of items for a scale are not identical, but are related in that they measure different facets of this underlying structure.

Step 1: Define the construct to be measured. A scale that we will use for demonstration purposes here is one reflecting Work Locus of Control (Spector, 1992). This has to do with the degree to which a person views reinforcements (e.g., rewards and punishments) in the workplace as a result of personal effort (internal locus of control), or whether outcomes happen as a result of luck, fate, or powerful others (external locus of control). All items in the scale should reflect these facets of control.

The definition of the construct sets the limits for the scale. The literature search should yield definitions as well as the theory that underlies the construct being studies. For example, locus of control research is based on the early work of Rotter (1966). Internals believe they can control work reinforcements, while externals believe they cannot do so.

Step 2: Design the Scale. Likert-type items (pron. Lick-ert) are most commonly used for summated rating scales. These may vary by agreement, evaluation, or frequency. Agreement asks the respondent to note the degree to which s/he agrees with a statement: strongly agree, agree, uncertain, disagree, or strongly disagree. Evaluation asks for an evaluative rating of each item: poor, below average, average, above average, excellent. Frequency asks how often something occurs: rarely, seldom, occasionally, frequently, usually.

Response choices are usually bipolar (an extreme at each end), symmetrical (same descriptors on each pole), and may have a neutral or middle point. They typically provide five to seven response options, since too much information is lost with fewer than four options, and not much is gained with more than ten. Even numbered options require that a person make a "forced choice" in that there is no middle ground--a useful format if reducing fence-sitting is desired. However, if a middle choice, neutral or uncertain position is important, it should be included.

Step 3: Generate an item pool. The item pool is an initial collection of draft items. They can be constructed based on literature review, theory, other tests and inventories, observation, and conversations with others. Items that are rated are referred to as "stems" since they present a phrase or statement which the respondent completes by selecting a rating option. Some important considerations in constructing items include:

• Each item should express only one idea. Avoid "double-barreled" questions that ask a respondent to react to two or more parts of a statement, for example: a good leader is trustworthy and direct.
• Use statements that are both positive and negative, state them favorably and unfavorably, or reverse the direction of the options. This will reduce response bias of respondents and require them to slow down and read each item rather than responding the same to items written in one direction.
• Write to your audience's reading level and avoid jargon and trite expressions. Fry (1977) has developed a scheme for rating sentence reading level, and  suggests that an average sentence at the sixth grade level has about 15 words and 24 syllables.
• Avoid sensitive wording that may bias respondents.
• Avoid using negatively worded items. For example, "I am not satisfied with my performance evaluation" may risk the reader not noticing the "not."  Double negatives may also confuse the reader, as in: "I am not dissatisfied with my performance evaluation."
The following items were developed for the Work Locus of Control Scale (Spector, 1992).
1. A job is what you make of it.
2. On most jobs, people can pretty much accomplish whatever they set out to accomplish.
3. If you know what you want out of a job, you can find a job that gives it to you.
4. If employees are unhappy with a decision made by their boss, they should do something about it.
5. Getting the job you want is mostly a matter of luck.
6. Making money is primarily a matter of good fortune.
7. Most people are capable of doing their jobs well if they make the effort.
8. In order to get a really good job, you need to have family members or friends in high places.
9. Promotions are usually a matter of good fortune.
10. When it comes to landing a really good job, who you know is more important than what you know.
11. Promotions are given to employees who perform well on the job.
12. To make a lot of money you have to know the right people.
13. It takes a lot of luck to be an outstanding employee on most jobs.
14. People who perform their jobs well generally get rewarded.
15. Most employees have more influence on their supervisors than they think they do.
16. The main difference between people who make a lot of money and people who make little money is luck.

Step 4: Page layout. Generally instructions are located at the top of the page with other information (such as human subjects review statements), followed by the rating format, and finally the items. The rating options should be clearly stated and easily located on the page for reader reference. An example might also be given if the respondents are unfamiliar with such a rating scale. If a 1-5 scale is used, it is easy to place the response choices in a column next to each item. If there is more than one page, be sure to repeat the rating options on each page. Unless you are using a computerized scoring sheet (which some naive respondents may be unfamiliar with, thereby increasing your error or nonreturn rate), consider a layout that will make it easy for you to quickly see item numbers and enter data for analysis.

The following instructions were designed for the WLCS:

The following statements concern your beliefs about jobs in general. They do not refer to your present job. Please use the 1-6 point scale to rate the extent to which you agree with each statement.
1= Disagree very much   2= Disagree moderately  3= Disagree slightly
4= Agree slightly              5= Agree moderately        6= Agree very much
Step 5: Administer the scale. Prior to administering the scale to a sample of respondents, it can be submitted to a few experts (perhaps two to five) for their opinion and feedback on items and format. Although this is not statistically treated, it can help identify potential problems in wording, format, and conceptualization before a sample it used. Sometimes the Delphi method can be used with experts in which they are given several rounds of revised items until they reach an agreement on the item composition.

To establish the measurement properties of a scale with any confidence in what is being measure, a fairly large sample of respondents must be obtained. While you may sometimes be able to only obtain a small sample (e.g., 20), this is probably too small to perform parametric statistics on (using the normal curve, such as means, standard deviations, correlations, etc.), it is difficult to find statistical significance even when it is there, and changes in a few extreme scores can dramatically affect measures of central tendency. For some statistical procedures, such as a factor analysis, a large person to item ratio is also required.

Generally, the larger the number of items and the larger the number of factors anticipated, the more respondents that should be included in the sample. For a factor analysis of a scale with 20 items, 100 respondents might be too few, but for a 90-item analysis 400 might be acceptable (DeVellis, 1991). Tinsley & Tinsley (1987) suggest about 5-10 respondents per item, up to a maximum of 300 respondents. Comrey (1988) suggested that a sample size of about 200 is usually acceptable since most scales have no more than 40 items. A larger number of respondents enables more stable results and generalization of the conclusions.

Step 6: Check the data. While this step states the obvious, far too often the data is simply entered and statistics run--errors included. Inspect the data for extreme scores ("outliers") that suggests that some respondents may have had an unusual response set. You must decide whether to include or discard them. Likewise, you may find some protocols (answer sheets) with many missing items, improperly marked items (poorly fillied in computer score sheets), personal identifiers, evidence of confusion, or scores outside the range of possible scores. All these require some decision whether to include or discard. Again, a large sample allows you to discard protocols with problems. Run SPSS Frequencies on each item and check the distribution of each item. Highly skewed items may be poorly worded and do not discriminate well. Item intercorrelations also show which items are most related to each other. If you need 10 items, the top intercorrelated items may be selected as the best candidates while the others are dropped.

Step 7. Coefficient alpha. Cronbach's (1951) coefficient alpha is a measure of the internal consistency of a scale. A high alpha is desirable since it reflects that the items are homogeneous and thereby are measuring the same underlying property. As a correlation, alpha ranges in value from 0 to 1 (negative values can occur when items are not positively correlated with each other). Like other coefficients, alpha can also be squared to identify the proportion of variance it shares with other items. Based on this, DeVellis (1991) recommends an alpha below .60 as unacceptable; .60-.65 undesirable; .65-.70 minimally acceptable; .70-.80 respectable; .80-.90 very good; and if much above .90 excellent and you should consider shortening the scale. When developing a scale to compare groups on some property, an alpha of .85 is recommended. Scales that will be used for diagnostic, employment, academic placement, or other important purposes should have higher reliabilities, in the .90s.

Alpha can also be used to identify poor items that should be dropped, and those items that do not contribute significantly to scale homogeneity that can be dropped to make the scale shorter. Alpha as an estimate of reliability increases with the number of items in the scale, therefore a longer scale will have a narrower confidence interval than a shorter scale. A longer scale will give similar alpha values across administrations, but may be somewhat lower when administered to a sample different than the one on which it was developed.

The alpha is a conservative measure, and sets an upper limit on reliability (Nunnally, 1967). If it is too low, either the test is too short, or the items share little in common. If the latter is the case, there is nothing to be gained with other tests of reliability since they will all be lower than alpha, and you shjould generate a new item set. Coefficient alpha can be run in SPSS by requesting the Statistics/Reliability/Alpha option.

In Table 1, the effects of selectively removing items is shown. Not all 16 items from the scale are shown.

 Step Item Alpha if    item removed 1            Alpha=.72 1 2 3 4 5 6 7 8 9 10 .68 .70 .71 .74 .75 .80 .71 .79 .68 .70 2 Items 4, 5, 6, 8 removed Alpha=.83 1 2 3 7 9 10 .79 .81 .84 .82 .78 .81 3 Item 3 removed Alpha=.84 1 2 7 9 10 .79 .80 .80 .79 .80

The result of removing items with lower correlations is to increase the homogeneity of items in the scale, enhance reliability, and increase confidence in the stability of the measure. Reducing the scale by deleting too many items can also lower alpha, and the alpha level should be monitored while tinkering with scale length and item composition.

Step 7: Factor Analysis. Factor analysis is considered an advanced statistical procedure and deserves more coverage than is attempted here--only its relationship to scale development will be presented. Factor analysis can be used to validate unidimensional and multidimensional scales. For unidimensional scales like the WLCS, it can help idetify whether there are subdimensions operating with the group of items, and verify whether the selected items empirically form the scale as intended.

The basic idea underlying factor analysis is that of reducing a number of items to a smaller number of items of explanation called factors. For example, a large scale of 60 items may actually be comprised of five factors or separate constructs that explain the larger pool of items. Factor analysis derives factors by examining the pattern of correlations among the items, and groups them together with those that are similar. Items that highly intercorrelate as presumed to measure the same underlying property. If all items are related, there should be a single factors. If more than one factor is identified, you can drop it or expand the number of related items to form more subscales.

SPSS can be used to easily conduct a factor analysis (though its interpretation is more complicated and difficult), by selecting Statistics/Data Reduction/Factor Analysis. The factor analysis can yield several kinds of important information. One such information is the "scree" test based on the image of rocks piling up along the edge of a mountain. This test is based on the eigenvalue rule, which states that only factors should be retained that explain more variance than the average amount explained by single items (again, it is an attempt to reduce items to a more basic and underlying factor). Figure 1 displays s scree plot of the factors in our WLCS.

Figure 1. Scree Plot of WLCS

The number of factors contained in the scale can be identified by noticing where the "elbow" or bend in the scree plot occurs. In the example above, the factors (items) 3-8 form a fairly straight line and lie under an eigenvalue of 1. Factors 1 and 2 break away from the rest and account for  higher eigenvalues. Therefore, we would conclude that there are two factors operating in our item pool. To test that interpretation, we would then run a factor analysis.

A factor analysis produces a table that displays each item and its factor loading (correlation) with the underlying factors. The specific type of factor analysis and its settings will not be discussed here, but are covered in the help guide in SPSS. Table 2 shows typical factor analysis results for our items.

Table 2. Factor loadings (not all items shown)

 Item No. Factor 1 Factor 2 1 2 3 4 5 6 7 8 .86 .82 .77 .65 .34 .22 .21 .13 .03 .14 .24 .26 .84 .73 .70 .69

From this table we might conclude that our scale actually has two subscales within it, both of which measure WLC, but are do so from different perspectives. We would then have the choice of keeping a single scale or having a total score as well as separate and perhaps more diagnostic subscales.

Conclusion. Scale design and construction allows the researcher to tailor a measuring instrument to a specific property under investigation. Items are based on literature, observation, and other sources, and can be tested by expert opinion before final testing with respondents. The stability of the scale (reliability) and final item composition can be determined by Cronbach's coefficient alpha, and the presence of subscales established by factor analysis. SPSS provides a convenient interface for conducting these statistical analyses.

References

Comrey, A. L. (1988). Factor analytic methods of scale development in personality and clinical psychology. Journal of Consulting and Clinical Psychology, 56, 754-761.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334.

DeVellis, R. F. (1991). Scale development: Theory and applications. Newbury Park, CA: Sage

Fry, E. (1977). Fry's readability graph: Clarifications, validity, and extension to level 17. Journal of Reading, 21, 249.

Nunnally, J. C. (1967). Psychometric theory. New York: McGraw-Hill.

Rotter, J. B. (1966). Generalized expectancies for internal versus external control of reinforcement. Psychological Monographs, 80(1), Whole no. 609.

Spector, P. E. (1992). Summed rating scale construction. Newbury Park, CA: Sage.

Tinsley, H. E. A., & Tinsley, D. J. (1987). Uses of factor analysis in counseling psychology research. Journal of Counseling Psychology, 34 414-424.

.