Utilizing Generalizability Theory to Investigate the Reliability of Grades Assigned to Undergraduate Research Papers

Main Article Content

Mihaiela Ristei Gugiu
https://orcid.org/0000-0003-1554-3123
P. Cristian Gugiu
https://orcid.org/0000-0003-0022-287X
Robert Baldus

Abstract

Background:  Educational researchers have long espoused the virtues of writing with regard to student cognitive skills. However, research on the reliability of the grades assigned to written papers reveals a high degree of contradiction, with some researchers concluding that the grades assigned are very reliable whereas others suggesting that they are so unreliable that random assignment of grades would have been almost as helpful.


Purpose: The primary purpose of the study was to investigate the reliability of grades assigned to written reports. The secondary purpose was to illustrate the use of Generalizability Theory, specifically the fully-crossed two-facet model, for computing interrater reliability coefficients.


Setting: The participants for this study were 29 undergraduate students enrolled in an introductory-level course on Political Behavior in Spring 2011 at a Midwest university.


Intervention: Not applicable.


Research Design: Students were randomly assigned to one of nine groups. Two-facet fully crossed G-study and D-study designs were used wherein two raters graded four assignments for 9 student groups—72 evaluations in total. The universe of admissible observations was deemed to be random for both raters and assignments, whereas the universe of generalization was deemed to be mixed (random for two raters but fixed for four assignments).


Data Collection and Analysis: The semester-long project was assigned to groups consisting of an annotated bibliography, survey development, sampling design, and analysis and final report. Four grading rubrics were developed and utilized to evaluate the quality of each written report. Two-facet generalizability analyses were conducted to assess interrater reliability using software developed by one of the authors.


Findings: This study found a very high interrater reliability coefficient (0.929) for only two raters who received no training in how to use the four grading rubrics.

Downloads

Download data is not yet available.

Article Details

How to Cite
Gugiu, M. R., Gugiu, P. C., & Baldus, R. (2012). Utilizing Generalizability Theory to Investigate the Reliability of Grades Assigned to Undergraduate Research Papers. Journal of MultiDisciplinary Evaluation, 8(19), 26–40. https://doi.org/10.56645/jmde.v8i19.362
Section
Research on Evaluation Articles

References

ACT. (2011). 2011 ACT National and State Scores. Retrieved December 23, 2011, from ACT: http://www.act.org/newsroom/data/2011/profilereports.html

Alexopoulos, D. S. (2007). Classical Test Theory. In N. J. Salkind (Ed.), Encyclopedia of Measurement and Statistics (pp. 140-143). Thousand Oaks: CA: SAGE Publications.

Anatol, T., & Hariharan, S. (2009). Reliability of the Evaluation of Students' Answers to Essay-type Questions. West Indian Medical Journal, 58(1), 13-16.

Arum, R., & Roksa, J. (2011). Academically Adrift: Limited Learning on College Campuses. Chicago, IL: University of Chicago Press. https://doi.org/10.7208/chicago/9780226028576.001.0001 DOI: https://doi.org/10.7208/chicago/9780226028576.001.0001

Barrett, S. (1999). Question Choice: Does Makers Variability Make Examinations a Lottery? Cornerstones: What Do We Value in Higher Education? (pp. 1-17). Melbourne, Australia: HERDSA Annual International Conference, July 12-15.

Barringer, S. A. (2008). The Lazy Professor's Guide to Grading: How to Increase Student Learning While Decreasing Professor Homework. Journal of Food Science Education, 7, 47-53. https://doi.org/10.1111/j.1541-4329.2008.00050.x DOI: https://doi.org/10.1111/j.1541-4329.2008.00050.x

Bell, R. C. (1980). Problems in Improving the Reliability of Essay Marks. Assessment & Evaluation in Higher Education, 5(3), 254-263. https://doi.org/10.1080/0260293800050303 DOI: https://doi.org/10.1080/0260293800050303

Blok, H. (1985). Estimating the Reliability, Validity, and Invalidity of Esssay Ratings. Journal fo Educational measurement, 22(1), 41-52. https://doi.org/10.1111/j.1745-3984.1985.tb01048.x DOI: https://doi.org/10.1111/j.1745-3984.1985.tb01048.x

Branthwaite, A., Trueman, M., & Berrisford, T. (1981). Unreliability of Marking: Further Evidence And a Possible Explanation. Education Review, 33(1), 41-46. https://doi.org/10.1080/0013191810330105 DOI: https://doi.org/10.1080/0013191810330105

Brennan, R. L. (2000). Performance Assessments From The Perspective of Generalizability Theory. Applied Psychological Measurement, 24(4), 339-353. https://doi.org/10.1177/01466210022031796 DOI: https://doi.org/10.1177/01466210022031796

Brennan, R. L. (2001a). Generalizability Theory. New York: Springler-Verlag New York, Inc.

Brennan, R. L. (2001b). Statistics for Social Science and Public Policy. New York: Springer.

Brennan, R. L., & Kane, M. T. (1977). An Index of Dependability for Mastery Tests. Journal of Educational Measurement, 14(3), 277-289. https://doi.org/10.1111/j.1745-3984.1977.tb00045.x DOI: https://doi.org/10.1111/j.1745-3984.1977.tb00045.x

Brown, G. T. (2010). The Validity of Examination Essays in Higher Education: Issues and Responses. Higher Education Quarterly, 64(3), 276-291. https://doi.org/10.1111/j.1468-2273.2010.00460.x DOI: https://doi.org/10.1111/j.1468-2273.2010.00460.x

Bull, G. M. (1956). An Examination of the Final Examination in Medicine. The Lancet, 271, 368-372. https://doi.org/10.1016/S0140-6736(56)91011-X DOI: https://doi.org/10.1016/S0140-6736(56)91011-X

Cannings, R., Hawthorne, K., Hood, K., & Houston, H. (2005). Putting Double Marking to The Test: A Framework to Assess If It Is Worth The Trouble. Medical Education, 39(3), 299-308. https://doi.org/10.1111/j.1365-2929.2005.02093.x DOI: https://doi.org/10.1111/j.1365-2929.2005.02093.x

Coffman, W. E. (1971). On the Reliability of Ratings of Essay Examinations in English. Research in the Teaching of English, 5(1), 24-36. https://doi.org/10.58680/rte197120156 DOI: https://doi.org/10.58680/rte197120156

Crocker, L., & Algina, J. (1986). Introduction to Classical & Modern Test Theory. Belmont, CA: Wadsworth Group/Thomson Learning.

Cronbach, L. (1951). Coefficient Alpha and the Internal Structure of Tests. Psychometrika, 16(3), 297-334. https://doi.org/10.1007/BF02310555 DOI: https://doi.org/10.1007/BF02310555

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. New York: John Wiley & Sons.

Eells, W. C. (1930). Reliability of Repeated Grading of Essay Type Examinations. The Journal of Educational Psychology, 21(1), 48-52. https://doi.org/10.1037/h0071103 DOI: https://doi.org/10.1037/h0071103

Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions (2nd ed.). New York: John Wiley & Sons.

Frijns, P., van der Vleuten, C., Verwijnen, G., van Leeuwen, Y., & Wijnen, W. (1990). The effect of structure in scoring methods on the reproducibility of scores of tests using open-ended questions. In W. Bender, R. Hiemstra, A. Scherpbier, & R. Zwierstra (Ed.), Proceedings of the Third International Conference on Teaching and Assessing Clinical Competence(pp. 466-471). Groningen: BoekWerk Publications.

Gugiu, P. C. (2011). Summative Confidence. Unpublished doctoral dissertation, Western Michigan University, Kalamazoo, MI.

Hafner, J. C., & Hafner, P. M. (2003). Quantitative Analysis of The Rubric as an Assessment Tool: An Empirical Study of Student Peer-group Rating. International Journal of Science Education, 25(12), 1509-1528. https://doi.org/10.1080/0950069022000038268 DOI: https://doi.org/10.1080/0950069022000038268

Hartog, P., Rhodes, E., & Burt, C. (1936). The Marks of Examiners. London: Macmillan.

Hopkins, K. D. (1998). Educational and Psychological Measurement and Evaluation (8th ed.). Boston, MA: Allyn and Bacon.

Kuper, A. (2006). Literature and Medicine: A Problem of Assessment. Academic Medicine, 81(10), 128-137. https://doi.org/10.1097/00001888-200610001-00032 DOI: https://doi.org/10.1097/00001888-200610001-00032

Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159-174. https://doi.org/10.2307/2529310 DOI: https://doi.org/10.2307/2529310

Luft, J. (1997). Design Your Own Rubric. Educational Leadership, 20(5), 25-27. https://doi.org/10.1016/S1350-4789(97)80027-X DOI: https://doi.org/10.1016/S1350-4789(97)80027-X

McDonald, R. P. (1999). Test Theory: A Unified Treatment. Mahwah, NJ: Lawrence Erlbaum Associates.

Moskal, B., & Leydens, J. (2000). Scoring Rubric Development: Validity and Reliability. Practical Assessment, Research and Evaluation, 7(10), 1-11.

National Commission on Writing in America's Schools and Colleges. (2003). The Neglected "R": The Need for a Writing Revolution. New York, NY: College Board.

Nunnally, J. C. (1978). Psychometric Theory. New York: McGraw-Hill.

Orpen, C. (1980). What Lecturers Like And Dislike About Their Jobs. Unpublished manuscript. Johannesburg, South Africa: Department of Psychology, University of Witwatersrand.

Pare, D. E., & Joordens, S. (2008). Peering Into Large Lectures: Examining Peer And Expert Mark Agreement Using PeerScholar, An Online Peer Assessment Tool. Journal of Computed Assisted Learning, 24, 526-540. https://doi.org/10.1111/j.1365-2729.2008.00290.x DOI: https://doi.org/10.1111/j.1365-2729.2008.00290.x

Popham, J. W. (1997). What's Wrong--and What's Right--with Rubrics. Educational Leadership, 55(2), 72-75.

Reed, M. W., & Burton, J. K. (1985). Effective and Ineffective Evaluation of Essays: Perceptions of College Freshmen. Journal of Teaching Writing, 4(2), 270-283.

Sadler, P. M., & Good, E. (2006). The Impact of Self- and Peer-Grading on Student Learning. Educational Assessment, 11(1), 1-31. https://doi.org/10.1207/s15326977ea1101_1 DOI: https://doi.org/10.1207/s15326977ea1101_1

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass Correlations: Uses in Assessing Rater Reliability. Psychological Bulletin, 86(2), 420-428. https://doi.org/10.1037//0033-2909.86.2.420 DOI: https://doi.org/10.1037//0033-2909.86.2.420

Stemler, S. E. (2007). Cohen's Kappa. In N. J. Salkind (Ed.), Encyclopedia of Measurement and Statistics (pp. 164-165). Thousand Oaks, CA: SAGE Publications.

Stevens, D. D., & Levi, A. J. (2005). Introduction to Rubrics. Sterling, VA: Stylus Publishing.

Williams, R., Sanford, J., Stratford, P. W., & Newman, A. (1991). Grading Written Essays: A Reliability Study. Physical Therapy, 71(9), 679-686. https://doi.org/10.1093/ptj/71.9.679 DOI: https://doi.org/10.1093/ptj/71.9.679

Zumbo, B. D., Gadermann, A. M., & Zeisser, C. (2007). Ordinal Versions of Coefficient Alpha and Theta for Likert Rating Scales. Journal of Modem Applied Statistical Methods, 6(1), 21-29. https://doi.org/10.22237/jmasm/1177992180 DOI: https://doi.org/10.22237/jmasm/1177992180