Sabtu, 21 Februari 2015

Quality Assurance on Internal Attributes of a Good Assessment Language Devices: Reliability, Validity, and Classical Item Analysis: Summary byMarwa & Erlik Widiyani Styati



Quality assurance on internal attributes of a good assessment consist of reliability, validity, and classical item analysis. Bachman (1990) says that reliability and validity are those two essentials to the interpretation and the use language ability. Besides, the is classical item analysis is also important. Here, the summary of quality assurance on internal attributes of good assesment will be described as follows:
1.      Reliability
Reliable means that the test can be trusted as a good test and it can be used many times and in the different time. Johnson and Johnson (2002) mention that reliability exists when students’ performance remain the same on repeated measurement. Reliability refers to the consistency of test scores; how consistent a particular students test scores are from one testing to another. Weir (1993) states that the test can be said to have high reliability if the result of the test shows the consistency when it’s re-used many times to a group of students in different time. The test can be said reliable if it is consistent. The types of reliability are: First, Inter-Rater or Inter-Observer Reliability which is used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon. Second, Test-Retest Reliability is used to assess the consistency of a measure from one time to another. Third, Parallel-Forms Reliability is used to assess the consistency of the results of two tests constructed in the same way from the same content domain. Fourth, Internal Consistency Reliability is used to assess the consistency of results across items within a test. 

2.      Validity
Validity is the extent to which a test measurement claims to measure. Johnson and Johnson (2002) state that validity means that the test actually measure what it was designed to measure, all of what it was designed to measure and nothing but what it was designed to measure. Test has a high validity if it is able to measure what its objectives are. Jack and Norman (1993) validity is the most important idea to be considered when preparing or selecting an instrument to use. There are 4 types of validity.
First, Content validity is the extent to which a test measures a representative sample of the subject content. Content validity here should cover the content of the test based on the curriculum that is used. If the test materials are suitable with the subject matter or curriculum, it can be concluded that the test has a content validity. In other words, if the test is not suitable with the subject matter or curriculum, it can be said that the test has no content validity. So, if the test does not have content validity it is not considered as a good test anymore to be given to the students. Second, face validity is a property of a test intended to measure something. The test is said to have face validity if it “looks like” it is going to measure what it is supposed to measure. Brown (2004) states that face validity refers to the degree to which a test looks right, and appears to measure the knowledge or abilities it claims to measure, based on subjective judgment of the examines who take it, the administrative personal who decide on its use, and other psychometrically unsophisticated observers. Third, Construct validity is a judgment based on the accumulation of correlations from the numerous studies using the instrument being evaluated. Measuring certain specific characteristics in accordance with a theory of language behavior and learning is construct validity. Fourth, Criterion-related validity is test has demonstrated its effectiveness in predicting criterion or indicators of a construct. There are two different types of criterion validity; Concurrent Validity occurs when the criterion measures are obtained at the same time as the test scores. This indicates the extent to which the test scores accurately estimate an individual’s current state with regards to the criterion. Furthermore, Predictive Validity occurs when the criterion measures are obtained at a time after the test. The predictor scores are collected first and criterion data are collected at some later/future point. this is appropriate for tests designed to asses a person’s future status on a criterion.
3. Classical Item Analysis
In analyzing test item, it should fulfill the characteristics of a good test. The test should be tried out, analyzed, and revised. All of the tests should have shown the indicators as a good test. Brown (2004) states that item analysis performance is simply to investigate how well the items on the tests are working with particular group of students. Item analysis is divided into Index of difficulty and index of discrimination. Index of Difficulty is to show how many students can answer the items correctly. A good test item must not too difficult or too easy for the students. Andrew (1994) states that item difficulty refers to the ratio of correct responses to total responses given in test items. Index of Discrimination means that the ability to differentiate between students who answer in upper and lower and group who answer the items correctly. Andrew (1994) defines that the item-discrimination index tells how well an item performs in separating the better students from the weaker ones. The index is intended to distinguish respondents who know the most or have the skills or abilities being tested from those who do not.  Effectiveness of Distracters is very important. It is applied to analyze the items whether it shows according the expectation in evaluating distracters. Brown (2004) states that distracters efficiency is one more important measure of multiple choice items value in the test and more that is related to the item discrimination

References
Bacham, L.F. 1990. Fundamental Consideration in Language Testing. New York:
Oxford University Press.
Brown, J. D.2004. Testing in Language Program:  A Comprehensive Guide to English
Language Assessment. New York.McGraw-Hill Companies, Inc.
Andrew. C.D. 1994. Assessing Language Ability in The Classroom. USA. Heinle and
Heinle Publisher
Jack. F.R. and Norman. 1993. How to Design and Evaluate Research in Education.
Singapore. McGraw-Hill, Inc.
Johnson, D.W. and Johnson, R.T.  2002. Meaningful Assesment. A Manageable and
Cooperative Process. USA: Allyn and Bacon.
Weir, Cyrill. 1993. Understanding and Developing Language Tests. UK: Prentice Hall
International Ltd.

Tidak ada komentar:

Posting Komentar