Quality Assurance on Internal Attributes
of a Good Assessment Language Device: Reliability, Validity, and Classical Item
Analysis
By:
Agus Eko Cahyono and Jumariati
An
assessment language device is said to be good provided that it has met these
attributes: reliability and validity. Reliability of a test is achieved if the
test results are consistent and dependable in its conditions across two or more
administrations. Heaton (1988) and Brown and Abeywickrama (2010) mention that
reliability of a test can be determined by the student, scoring, test
administration, and test itself. Student-related
reliability deals with the conditions of the student taking the test. The
student’s fatigue, anxiety, motivation, and other physical and psychological
factors can hinder the student in performing his true ability in the test.
Therefore, teachers need to consider the condition of the students before
administering a test. Rater-reliability
deals with the consistency of the scores given by one rater (intra-rater
reliability) or two or more raters (inter-rater reliability). This is especially
difficult to score subjective tests like essay writing wherein the rater may
feel fatigue in scoring and thus may reduce the reliability. It is suggested
that the rater reads through the essays and recycles back through to arrive at
a good judgment. When there are two or more raters involve and the scores are
quite different, probably the scoring criteria need to be revised. Hughes
(2003) suggests that training to raters is necessary to have similar
interpretation regarding the scoring criteria. Test-administration reliability is determined by the condition of
the room where the test is administered, the seating position, the room
temperature, and the quality of the test sheet or test audio. Therefore,
teachers should carefully prepare a good room for the test and provide clear
audio or copies of the test sheets. Finally, test-reliability which directly relates to the test itself: the
clear instruction, the unambiguous item, and balanced item numbers with the
time allotted for the test. These factors may help increasing the reliability
of a test.
Validity is also important to determine the
quality of a test. It is the extent to which a test measures what it is
supposed to measure. Brown and Abeywickrama (2010:30) propose that a valid test
measures exactly what it proposes to measure which relies on test-taker’s
performance and offers meaningful information about the test-taker’s ability.
There are several types of validity evidence. First, content validity deals with the content of the test that should
cover the materials taught or the instructional objectives. It also deals with
the direct testing requiring direct skill students to perform for instance a
writing test which directly asks students to produce a piece of writing.
Second, construct validity in which a
test contains the concepts or theories that students must perform. For example
a test of speaking ability requires students to use English orally regarding
the fluency, intonation, and pronunciation since the constructs of speaking
performance deal with those elements. Third, face validity is the extent to which a test looks appropriate to
measure students’ knowledge or abilities based on the subjective judgment of
the students as the test-takers. Fourth, empirical
validity is the validity that can be achieved by comparing the test result
with other tests’ results for instance the result from another existing and
valid test which is also known as concurrent validity. Another empirical
validity evidence is obtained by comparing the test results with the result of
teachers’ ratings given later. It is also called as predictive validity. The
last is consequential validity. It
is the degree of which all consequences produced as the impacts of a test like
its accuracy in measuring the intended criteria, its effect on the preparation
of test-takers, and test’ interpretation and use.
A
classical item analysis is also done to ensure a test’s quality covering the
analysis of item difficulty and item discrimination. In addition, a
multiple-choice test needs an analysis on distracter items. Analysis on the item difficulty is aimed at finding out
how easy or how difficult each item is for the test-taker. It is done by
finding how many students can answer each item correctly. The total students
whose answer is correct is divided by the total students taking the test resulting
in a value of difficulty level. Meanwhile, the analysis on item discrimination is aimed at discriminating students who are
able to answer correctly from those who are unable. An item which can be
answered by all students needs to be revised because a good item can
discriminate high achieving students from low achieving ones. Analysis on distracter deals with the
similarity of each distracter to the correct answer which can distract students
from choosing the correct alternative answer. If nobody is distracted by the
alternative answer, it means the distracter power is low.
In
conclusion, there are three factors of assuring the test’s quality:
reliability, validity, and item analysis. Test-developers should take these
three factors into consideration so that the tests that they develop are really
meaningful.
References:
Brown, H.D. & Abeywickrama, P. 2010. Language Assessment: Principles and
Classroom
Practices.
Second Edition. White Plains: Pearson Education, Inc.
Heaton, J.B. 1988. Writing English Language Tests. New
York: Longman Inc.
Hughes, A. 2003. Testing for Language Teachers. Cambridge: Cambridge University
Press.
Tidak ada komentar:
Posting Komentar