On the Effect of Using Different Scoring Methods for Two Versions of a Test

Martina Hulešová

On the Effect of Using Different Scoring Methods for Two Versions of a Test

Vol.5,No.1(2015)

PDF

This article presents a study of the effect of a different scoring method on the construct of the Czech Maturita English examination. In particular it focuses on decision consistency made on the basis of the test results and the implications for test fairness and validity of the interpretations of test results. Questions are discussed concerning construct validity, decision consistency and fairness by comparing the test results of two versions of the same test, but with different scoring. The findings show that rescoring causes changes the weights of skills measured by the tests, and thus changes in construct; decision consistency of the tests with different scoring was low, and therefore the interpretation of the results of the two test versions cannot be the same. It was found in this particular case that the students tested do not change their strategies, as they believe that the tests are equivalent and fair, and they are not conscious of the possible consequences of rescoring. On the basis of the results, this article tentatively concludes that introducing different scoring may increase unreliability and cause unfair decisions and judgements of students’ ability.

Keywords:
fairness; construct; scoring method; equivalence

References

Alderson, J.C., Clapham, C. & Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press.

Chapelle, C. (2012). Validity argument for language assessment: The framework is simple... Language Testing 29, 19-27.

Council of Europe. (2001). Common European framework of reference for languages: learning, teaching, assessment. Cambridge: Cambridge University Press.

Ebel, R.L. & Frisbie, D.A. (1991). Essentials of educational measurement. New Jersey: Prentice Hall.

Jenkinson, C. (1991). Why are we weighting? Critical examination of the use of item weights in a health status measure. Social Science & Medicine 32, 1413-1416.

Khalifa, H. & Weir, C. (2009). Examining reading. Cambridge: Cambridge University Press.

Manual for language test development and examining. (2011). Council of Europe. Retrieved December 12, 2012, from

http://www.coe.int/t/dg4/linguistic/ManualtLangageTest-Alte2011_EN.pdf

Messick, S. (1995). Validity of Psychological Assessment. Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist 50(9), 741-749.

Pallant, J. (2007). SPSS survival manual – 3rd edition. McGraw-Hill Education.

Pižorn, K. & Nagy, E. (2009). The politics of examination reform in Central Europe. In Alderson, J. Ch. (Ed.). The Politics of Language Education: Individuals and Institutions. Bristol: Multilingual Matters.

Rotou, O., Headrick, T.C & Elmore, P.B. (2002). A proposed number correct scoring procedure based on classical true-score theory and multidimensional item response theory. International Journal of Testing 2(2), 131-141.

Sim, J. & Wright, Ch.C. (2005). The Kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical Therapy 85, 257-268.

Verhelst, N. & Hulešová, M. (2011). Standard setting in the national examination of English in the Czech Republic. Retrieved November, 13, 2012, from www.promz.cz/download/1404034454/?at=1

Metrics

32

Views

10

PDF views

PDF