Quality assurance on internal attributes of a good assessment consist of
reliability, validity, and classical item analysis. Bachman (1990)
says that reliability and validity are those two essentials to the
interpretation and the use language ability. Besides, the is classical item
analysis is also important. Here, the summary of quality assurance on internal
attributes of good assesment will be described as follows:
1.
Reliability
Reliable means that the test can be
trusted as a good test and it can be used many times and in the different time.
Johnson and Johnson (2002) mention that reliability exists when students’
performance remain the same on repeated measurement. Reliability refers to the
consistency of test scores; how consistent a particular students test scores
are from one testing to another. Weir (1993) states that the test can be said
to have high reliability if the result of the test shows the consistency when
it’s re-used many times to a group of students in different time. The test can
be said reliable if it is consistent. The types of reliability are: First, Inter-Rater or
Inter-Observer Reliability which is used to assess the degree
to which different raters/observers give consistent estimates of the same
phenomenon. Second, Test-Retest
Reliability is used to assess the consistency of a measure from one
time to another. Third, Parallel-Forms
Reliability is used to assess the consistency of the results of two
tests constructed in the same way from the same content domain. Fourth, Internal Consistency
Reliability is used to assess the consistency of results across items
within a test.
2.
Validity
Validity is the extent to which a test
measurement claims to measure. Johnson and Johnson (2002) state that validity
means that the test actually measure what it was designed to measure, all of
what it was designed to measure and nothing but what it was designed to
measure. Test has a high validity if it is able to measure what its objectives
are. Jack and Norman (1993) validity is the most important idea to be
considered when preparing or selecting an instrument to use. There are 4 types
of validity.
First, Content validity
is the extent to which a test measures a representative sample of the subject
content. Content validity here should cover the content of the test based on
the curriculum that is used. If the test materials are suitable with the subject
matter or curriculum, it can be concluded that the test has a content validity.
In other words, if the test is not suitable with the subject matter or
curriculum, it can be said that the test has no content validity. So, if the
test does not have content validity it is not considered as a good test anymore
to be given to the students. Second,
face validity is a property of a test intended to measure something. The test
is said to have face validity if it “looks like” it is going to measure what it
is supposed to measure. Brown (2004) states that face validity refers to the
degree to which a test looks right, and appears to measure the knowledge or
abilities it claims to measure, based on subjective judgment of the examines
who take it, the administrative personal who decide on its use, and other
psychometrically unsophisticated observers. Third, Construct validity is a judgment based on the accumulation
of correlations from the numerous studies using the instrument being evaluated.
Measuring certain specific characteristics in accordance with a theory of
language behavior and learning is construct validity. Fourth, Criterion-related validity is test has demonstrated its effectiveness in predicting
criterion or indicators of a construct. There are two different types of
criterion validity; Concurrent Validity occurs when the criterion
measures are obtained at the same time as the test scores. This indicates the
extent to which the test scores accurately estimate an individual’s current
state with regards to the criterion. Furthermore, Predictive Validity
occurs when the criterion measures are obtained at a time after the test. The predictor
scores are collected first and criterion data are collected at some
later/future point. this is appropriate for tests designed to asses a person’s
future status on a criterion.
3. Classical Item
Analysis
In analyzing test item, it should
fulfill the characteristics of a good test. The test should be tried out,
analyzed, and revised. All of the tests should have shown the indicators as a
good test. Brown (2004) states that item analysis performance is simply to
investigate how well the items on the tests are working with particular group
of students. Item analysis is divided into Index of difficulty and index of
discrimination. Index of Difficulty
is to show how many students can answer the items correctly. A good test item
must not too difficult or too easy for the students. Andrew (1994) states that
item difficulty refers to the ratio of correct responses to total responses
given in test items. Index of
Discrimination means that the ability to differentiate between students who
answer in upper and lower and group who answer the items correctly. Andrew
(1994) defines that the item-discrimination index tells how well an item performs
in separating the better students from the weaker ones. The index is intended
to distinguish respondents who know the most or have the skills or abilities
being tested from those who do not. Effectiveness of Distracters is very
important. It is applied to analyze the items whether it shows according the
expectation in evaluating distracters. Brown (2004) states that distracters
efficiency is one more important measure of multiple choice items value in the
test and more that is related to the item discrimination
References
Bacham,
L.F. 1990. Fundamental Consideration in
Language Testing. New York:
Oxford University Press.
Brown,
J. D.2004. Testing in Language
Program: A Comprehensive Guide to
English
Language
Assessment.
New York.McGraw-Hill Companies, Inc.
Andrew.
C.D. 1994. Assessing Language Ability in
The Classroom. USA. Heinle and
Heinle Publisher
Jack.
F.R. and Norman. 1993. How to Design and
Evaluate Research in Education.
Singapore. McGraw-Hill, Inc.
Johnson,
D.W. and Johnson, R.T. 2002. Meaningful Assesment. A Manageable and
Cooperative
Process.
USA: Allyn and Bacon.
Weir,
Cyrill. 1993. Understanding and
Developing Language Tests. UK: Prentice Hall
International Ltd.
Tidak ada komentar:
Posting Komentar