Utilizing Generalizability Theory to Investigate the Reliability of Grades Assigned to Undergraduate Research Papers
Main Article Content
Abstract
Background: Educational researchers have long espoused the virtues of writing with regard to student cognitive skills. However, research on the reliability of the grades assigned to written papers reveals a high degree of contradiction, with some researchers concluding that the grades assigned are very reliable whereas others suggesting that they are so unreliable that random assignment of grades would have been almost as helpful.
Purpose: The primary purpose of the study was to investigate the reliability of grades assigned to written reports. The secondary purpose was to illustrate the use of Generalizability Theory, specifically the fully-crossed two-facet model, for computing interrater reliability coefficients.
Setting: The participants for this study were 29 undergraduate students enrolled in an introductory-level course on Political Behavior in Spring 2011 at a Midwest university.
Intervention: Not applicable.
Research Design: Students were randomly assigned to one of nine groups. Two-facet fully crossed G-study and D-study designs were used wherein two raters graded four assignments for 9 student groups—72 evaluations in total. The universe of admissible observations was deemed to be random for both raters and assignments, whereas the universe of generalization was deemed to be mixed (random for two raters but fixed for four assignments).
Data Collection and Analysis: The semester-long project was assigned to groups consisting of an annotated bibliography, survey development, sampling design, and analysis and final report. Four grading rubrics were developed and utilized to evaluate the quality of each written report. Two-facet generalizability analyses were conducted to assess interrater reliability using software developed by one of the authors.
Findings: This study found a very high interrater reliability coefficient (0.929) for only two raters who received no training in how to use the four grading rubrics.
Downloads
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Copyright and Permissions
Authors retain full copyright for articles published in JMDE. JMDE publishes under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY - NC 4.0). Users are allowed to copy, distribute, and transmit the work in any medium or format for noncommercial purposes, provided that the original authors and source are credited accurately and appropriately. Only the original authors may distribute the article for commercial or compensatory purposes. To view a copy of this license, visit creativecommons.org
References
ACT. (2011). 2011 ACT National and State Scores. Retrieved December 23, 2011, from ACT: http://www.act.org/newsroom/data/2011/profilereports.html
Alexopoulos, D. S. (2007). Classical Test Theory. In N. J. Salkind (Ed.), Encyclopedia of Measurement and Statistics (pp. 140-143). Thousand Oaks: CA: SAGE Publications.
Anatol, T., & Hariharan, S. (2009). Reliability of the Evaluation of Students' Answers to Essay-type Questions. West Indian Medical Journal, 58(1), 13-16.
Arum, R., & Roksa, J. (2011). Academically Adrift: Limited Learning on College Campuses. Chicago, IL: University of Chicago Press. https://doi.org/10.7208/chicago/9780226028576.001.0001 DOI: https://doi.org/10.7208/chicago/9780226028576.001.0001
Barrett, S. (1999). Question Choice: Does Makers Variability Make Examinations a Lottery? Cornerstones: What Do We Value in Higher Education? (pp. 1-17). Melbourne, Australia: HERDSA Annual International Conference, July 12-15.
Barringer, S. A. (2008). The Lazy Professor's Guide to Grading: How to Increase Student Learning While Decreasing Professor Homework. Journal of Food Science Education, 7, 47-53. https://doi.org/10.1111/j.1541-4329.2008.00050.x DOI: https://doi.org/10.1111/j.1541-4329.2008.00050.x
Bell, R. C. (1980). Problems in Improving the Reliability of Essay Marks. Assessment & Evaluation in Higher Education, 5(3), 254-263. https://doi.org/10.1080/0260293800050303 DOI: https://doi.org/10.1080/0260293800050303
Blok, H. (1985). Estimating the Reliability, Validity, and Invalidity of Esssay Ratings. Journal fo Educational measurement, 22(1), 41-52. https://doi.org/10.1111/j.1745-3984.1985.tb01048.x DOI: https://doi.org/10.1111/j.1745-3984.1985.tb01048.x
Branthwaite, A., Trueman, M., & Berrisford, T. (1981). Unreliability of Marking: Further Evidence And a Possible Explanation. Education Review, 33(1), 41-46. https://doi.org/10.1080/0013191810330105 DOI: https://doi.org/10.1080/0013191810330105
Brennan, R. L. (2000). Performance Assessments From The Perspective of Generalizability Theory. Applied Psychological Measurement, 24(4), 339-353. https://doi.org/10.1177/01466210022031796 DOI: https://doi.org/10.1177/01466210022031796
Brennan, R. L. (2001a). Generalizability Theory. New York: Springler-Verlag New York, Inc.
Brennan, R. L. (2001b). Statistics for Social Science and Public Policy. New York: Springer.
Brennan, R. L., & Kane, M. T. (1977). An Index of Dependability for Mastery Tests. Journal of Educational Measurement, 14(3), 277-289. https://doi.org/10.1111/j.1745-3984.1977.tb00045.x DOI: https://doi.org/10.1111/j.1745-3984.1977.tb00045.x
Brown, G. T. (2010). The Validity of Examination Essays in Higher Education: Issues and Responses. Higher Education Quarterly, 64(3), 276-291. https://doi.org/10.1111/j.1468-2273.2010.00460.x DOI: https://doi.org/10.1111/j.1468-2273.2010.00460.x
Bull, G. M. (1956). An Examination of the Final Examination in Medicine. The Lancet, 271, 368-372. https://doi.org/10.1016/S0140-6736(56)91011-X DOI: https://doi.org/10.1016/S0140-6736(56)91011-X
Cannings, R., Hawthorne, K., Hood, K., & Houston, H. (2005). Putting Double Marking to The Test: A Framework to Assess If It Is Worth The Trouble. Medical Education, 39(3), 299-308. https://doi.org/10.1111/j.1365-2929.2005.02093.x DOI: https://doi.org/10.1111/j.1365-2929.2005.02093.x
Coffman, W. E. (1971). On the Reliability of Ratings of Essay Examinations in English. Research in the Teaching of English, 5(1), 24-36. https://doi.org/10.58680/rte197120156 DOI: https://doi.org/10.58680/rte197120156
Crocker, L., & Algina, J. (1986). Introduction to Classical & Modern Test Theory. Belmont, CA: Wadsworth Group/Thomson Learning.
Cronbach, L. (1951). Coefficient Alpha and the Internal Structure of Tests. Psychometrika, 16(3), 297-334. https://doi.org/10.1007/BF02310555 DOI: https://doi.org/10.1007/BF02310555
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. New York: John Wiley & Sons.
Eells, W. C. (1930). Reliability of Repeated Grading of Essay Type Examinations. The Journal of Educational Psychology, 21(1), 48-52. https://doi.org/10.1037/h0071103 DOI: https://doi.org/10.1037/h0071103
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions (2nd ed.). New York: John Wiley & Sons.
Frijns, P., van der Vleuten, C., Verwijnen, G., van Leeuwen, Y., & Wijnen, W. (1990). The effect of structure in scoring methods on the reproducibility of scores of tests using open-ended questions. In W. Bender, R. Hiemstra, A. Scherpbier, & R. Zwierstra (Ed.), Proceedings of the Third International Conference on Teaching and Assessing Clinical Competence(pp. 466-471). Groningen: BoekWerk Publications.
Gugiu, P. C. (2011). Summative Confidence. Unpublished doctoral dissertation, Western Michigan University, Kalamazoo, MI.
Hafner, J. C., & Hafner, P. M. (2003). Quantitative Analysis of The Rubric as an Assessment Tool: An Empirical Study of Student Peer-group Rating. International Journal of Science Education, 25(12), 1509-1528. https://doi.org/10.1080/0950069022000038268 DOI: https://doi.org/10.1080/0950069022000038268
Hartog, P., Rhodes, E., & Burt, C. (1936). The Marks of Examiners. London: Macmillan.
Hopkins, K. D. (1998). Educational and Psychological Measurement and Evaluation (8th ed.). Boston, MA: Allyn and Bacon.
Kuper, A. (2006). Literature and Medicine: A Problem of Assessment. Academic Medicine, 81(10), 128-137. https://doi.org/10.1097/00001888-200610001-00032 DOI: https://doi.org/10.1097/00001888-200610001-00032
Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159-174. https://doi.org/10.2307/2529310 DOI: https://doi.org/10.2307/2529310
Luft, J. (1997). Design Your Own Rubric. Educational Leadership, 20(5), 25-27. https://doi.org/10.1016/S1350-4789(97)80027-X DOI: https://doi.org/10.1016/S1350-4789(97)80027-X
McDonald, R. P. (1999). Test Theory: A Unified Treatment. Mahwah, NJ: Lawrence Erlbaum Associates.
Moskal, B., & Leydens, J. (2000). Scoring Rubric Development: Validity and Reliability. Practical Assessment, Research and Evaluation, 7(10), 1-11.
National Commission on Writing in America's Schools and Colleges. (2003). The Neglected "R": The Need for a Writing Revolution. New York, NY: College Board.
Nunnally, J. C. (1978). Psychometric Theory. New York: McGraw-Hill.
Orpen, C. (1980). What Lecturers Like And Dislike About Their Jobs. Unpublished manuscript. Johannesburg, South Africa: Department of Psychology, University of Witwatersrand.
Pare, D. E., & Joordens, S. (2008). Peering Into Large Lectures: Examining Peer And Expert Mark Agreement Using PeerScholar, An Online Peer Assessment Tool. Journal of Computed Assisted Learning, 24, 526-540. https://doi.org/10.1111/j.1365-2729.2008.00290.x DOI: https://doi.org/10.1111/j.1365-2729.2008.00290.x
Popham, J. W. (1997). What's Wrong--and What's Right--with Rubrics. Educational Leadership, 55(2), 72-75.
Reed, M. W., & Burton, J. K. (1985). Effective and Ineffective Evaluation of Essays: Perceptions of College Freshmen. Journal of Teaching Writing, 4(2), 270-283.
Sadler, P. M., & Good, E. (2006). The Impact of Self- and Peer-Grading on Student Learning. Educational Assessment, 11(1), 1-31. https://doi.org/10.1207/s15326977ea1101_1 DOI: https://doi.org/10.1207/s15326977ea1101_1
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass Correlations: Uses in Assessing Rater Reliability. Psychological Bulletin, 86(2), 420-428. https://doi.org/10.1037//0033-2909.86.2.420 DOI: https://doi.org/10.1037//0033-2909.86.2.420
Stemler, S. E. (2007). Cohen's Kappa. In N. J. Salkind (Ed.), Encyclopedia of Measurement and Statistics (pp. 164-165). Thousand Oaks, CA: SAGE Publications.
Stevens, D. D., & Levi, A. J. (2005). Introduction to Rubrics. Sterling, VA: Stylus Publishing.
Williams, R., Sanford, J., Stratford, P. W., & Newman, A. (1991). Grading Written Essays: A Reliability Study. Physical Therapy, 71(9), 679-686. https://doi.org/10.1093/ptj/71.9.679 DOI: https://doi.org/10.1093/ptj/71.9.679
Zumbo, B. D., Gadermann, A. M., & Zeisser, C. (2007). Ordinal Versions of Coefficient Alpha and Theta for Likert Rating Scales. Journal of Modem Applied Statistical Methods, 6(1), 21-29. https://doi.org/10.22237/jmasm/1177992180 DOI: https://doi.org/10.22237/jmasm/1177992180