Assessment Class B 2014 UM: Februari 2015

Rabu, 25 Februari 2015

The summary of Developing standardized test of language proficiency: By: I.G.A. Lokita Purnamika Utami and Rina Sari

The summary of Developing standardized test of language proficiency:

By: I.G.A. Lokita Purnamika Utami and Rina Sari

Standardized test for language proficiency presuppose a comprehensive definition of proficiency.Swain (1990) refers proficiency assessment to three linguistic traits: grammar, discourse and sociolinguistics that can be measure through oral, multiple choice and written responses. Another definition of proficiency is offered by ACTFL which offer a more holistic and unitary view: superior, advanced, intermediate and novice.

Baca selengkapnya »

Sabtu, 21 Februari 2015

Quality Assurance on Internal Attributes of a Good Assessment Language Devices: Reliability, Validity, and Classical Item Analysis: Summary byMarwa & Erlik Widiyani Styati

Quality assurance on internal attributes of a good assessment consist of reliability, validity, and classical item analysis. Bachman (1990) says that reliability and validity are those two essentials to the interpretation and the use language ability. Besides, the is classical item analysis is also important. Here, the summary of quality assurance on internal attributes of good assesment will be described as follows:

1. Reliability

Reliable means that the test can be trusted as a good test and it can be used many times and in the different time. Johnson and Johnson (2002) mention that reliability exists when students’ performance remain the same on repeated measurement. Reliability refers to the consistency of test scores; how consistent a particular students test scores are from one testing to another. Weir (1993) states that the test can be said to have high reliability if the result of the test shows the consistency when it’s re-used many times to a group of students in different time. The test can be said reliable if it is consistent. The types of reliability are: First, Inter-Rater or Inter-Observer Reliability which is used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon. Second, Test-Retest Reliability is used to assess the consistency of a measure from one time to another. Third, Parallel-Forms Reliability is used to assess the consistency of the results of two tests constructed in the same way from the same content domain. Fourth, Internal Consistency Reliability is used to assess the consistency of results across items within a test.

Baca selengkapnya »

Selasa, 17 Februari 2015

A Summary on “Parallel Tests and Equating: Theories, Principles, and Practice”
By: Agus Eko Cahyono and Jumariati

In the context of language testing, parallel tests is an important issue. Multiple test forms are said to be parallel when they are as equal to one another as possible in terms of test specification like the type, form, content, purpose, and of statistical criteria like level of difficulty, discriminating power, and distracters. The common example is a school program which has two types of parallel tests: one is for the achievement test while the other is for those test-takers who need the retesting. In this case, the tests must be parallel as the function is the same that is to assess the achievement of the test-takers. High-stakes test like Ujian Nasional in Indonesian schools is used to be parallel with regards to the function to assess students’ learning achievement into certain level in spite of its administration that may be in different point in time throughout the country. Thus, the tests are different from one administration with another but the forms are still similar (equal). This implies that the assembly of multiple test forms should be designed very carefully and properly to ensure the fairness to each of the test-taker and at the same time to maintain the security of the tests.
In fact, there is still the possibility that the multiple test forms that have been developed are not similar; some differences in the statistical characteristics are still found. Therefore, equating methods to face this problem are needed. Equating parallel tests is an important issue in standardized tests to maintain the fairness among test-takers taking the test either at the same time or at different point in time. In order to be equal, Kolen and Brennan (2004) define several equality characteristics that need to be met. First, the equal construct requirement in which the tests to be equated must measure the same construct. If the tests’ constructs are different, they cannot be equated. Second is the equal reliability wherein the tests should yield reliable results. The third is the equal symmetry which means that the equating transformations must be symmetrical. Fourth, the equity requirement which deals with a matter of indifference to each test taker whether test form X or test form Y is administered. Finally is the population invariance requirement which means that the equating is the same regardless of the group of test-takers on which the equating is performed. These principles need to be taken into consideration once an equating is made.
In the practices of equating test forms, some methods are used such as the equating traditional method which utilizes the Item-Response Theory method and computer software like Kernell method and Automated Test Assembly (ATA). The traditional equating model is commonly done through random group design. In this type, test Model A is given to test-taker one, test Model B is given to test-taker two, and then test Model A is for test-taker three and so forth. The results obtained by the test-takers working on test Model A are compared to the result of those working on Model B. The conclusion then is made based on whether or not there is a difference between the two groups. If students in Model B obtain higher scores than those working on Model A, we can conclude that test Model B is easier than Model A and thus they are not equal (parallel).
The use of computer technology as ATA in assembling multiple test forms is preferred by test assemblers lately because of the fast processing and abundance of item pools (Lin, 2008). With the development of ATA, pre-equated parallel test forms can be achieved more efficiently. The computer software will process the test criteria that have been laid out in the test blueprint and these criteria are separated into two: psychometric and non-psychometric attributes. Non-psychometric attributes include the test content, test format, test length, item usage frequency, and item exclusion. Meanwhile, psychometric attributes deal with classical item statistic, IRT-based item parameter estimates, item-response function, or item information functions.
In conclusion, equating multiple test forms is a crucial method of ensuring the equality of the tests; it can help test designers guarantee the fairness of the test to each of the test-taker and the security of the test forms.

References:
Kolen MJ, Brennan RJ.2004. Test Equating: Methods and Practices (2nd ed.). New York:
Springer-Verlag.

Lin, C.-J. 2008. Comparisons between Classical Test Theory and Item Response Theory
in Automated Assembly of Parallel Test Forms. Journal of Technology, Learning, and
Assessment, 6(8). Retrieved at February, 10th 2015 from http://www.jtla.org

agus Eko & jummariati

Baca selengkapnya »

Parallel Tests & Equating: Theory, Principles, and Practice by: I.G.A. Lokita P. & Rina Sari

Parallel Tests & Equating: Theory, Principles, and Practice

by: I.G.A. Lokita P. & Rina Sari

Equating is the strongest form of linking between the scores on two tests. Equating may be viewed as a form of scale aligning in which very strong requirements are placed on the tests being linked. The goal of equating is to produce a linkage between scores on two test forms such that the scores from each test form can be used as if they had come from the same test. Strong requirements must be put on the blueprints for the two tests and on the method used for linking scores in order to establish an effective equating. Among other things, the two tests must measure the same construct at almost the same level of difficulty and with the same degree of reliability.

Baca selengkapnya »

Selasa, 10 Februari 2015

Test-Wiseness: Definition, Types, and Implications, as well as Studies Related with Test Wiseness
By: Agus Eko Cahyono and Jumariati

Test-wiseness which is also called test-taking strategy, test-familiarity or test-wisdom is defined by Millman, et al. (1965) as cited by Ferrier, et al. (2011: 101) as “a subjects’ capacity to utilize the characteristics and formats of the test/or test-taking situation to receive a high score.” This is to say that test-wiseness is the test-taker’s ability in recognizing and utilizing cues in the test items or formats that can improve his score on the test. Furthermore, Millman et al. categorize test-wiseness taxonomy into three domains which include the test-takers, the test-constructor, and the test itself. In terms of the test-takers, there are four traits involve: time-management strategies, minimizing mistakes, employing a guessing strategy, and using deductive reasoning. Meanwhile, the trait of test-constructor deals with the advice given on what will be tested. Finally, the test itself may contain clues that help test-takers in giving the correct answer on the test. It can be inferred that test-wiseness can affect the validity of a test as it does not reflect the actual performance of a test-taker. Some efforts can be done to reduce test-wiseness effect such as training all test-takers prior to the test or developing a good test item without containing clues or inappropriate distracters. This is to say that the test should be free from test-wiseness items so that every test-taker can perform their actual ability on the test. As it is affirmed by Rogers and Bateson (1991) that test-wiseness can affect one’s score on a test and thus the score should not be misinterpreted as his real performance.

Baca selengkapnya »

Selasa, 03 Februari 2015

TEST WISENESS : DEFINITION, TYPES AND IMPLICATION AS WELL AS SOME STUDIES RELATED WITH TEST WISENESS By: I.G.A. Lokita Purnamika Utami Rina Sari

TEST WISENESS : DEFINITION, TYPES AND IMPLICATION AS WELL AS

SOME STUDIES RELATED WITH TEST WISENESS

By: I.G.A. Lokita Purnamika Utami

Rina Sari

Definition

(Millman, Bishop & Ebel, 1965) defined test wiseness “ a subject's capacity to utilize the characteristics and format of a test to receive a higher score, independent of the examinee's knowledge of the subject matter”

Oakland & Weilert (1971), that defined test wiseness as

“ the ability to manifest test-taking skills which utilize the characteristics and formats of a test and/or test-taking situation in order to receive a score commensurate with the abilities being measured”.

Baca selengkapnya »

TEST FORMS BASED ON APPROACHES AND THEIR APPLICATIONS IN STANDARDIZED TEST (SUMMARY 3) MARWA & ERLIK WIDIYANI STYATI

Standardized tests are prepared for the wide nation. It is different with the teacher made test or the group of people who make it. It provides accurate and meaningful information for the students. It is any form of test that (1) requires all test takers to answer the same questions, or a selection of questions from common bank of questions, in the same way, and that (2) is scored in a “standard” or consistent manner, which makes it possible to compare the relative performance of individual students or groups of students. While different types of tests and assessments may be “standardized” in this way, the term is primarily associated with large-scale tests administered to sizeable populations of students, such as a multiple-choice test given to all the eighth-grade public-school students in a particular state, for example.

There are two types of standardized tests that are: achievement test and aptitudes test. In achievement test, it focuses on the knowledge and skills learned in school and may be form of achievement batteries, diagnostic test, or subject-specific test. Aptitudes tests focuses on the potentially maximum achievement of students and may measure general intellectual aptitudes, aptitude to do well in college or certain vocational training programs, reading aptitude, mechanical aptitude or perceptual aptitude (Johnson and Johnson, 2002)

In addition to the familiar multiple-choice format, standardized tests can include true-false questions, short-answer questions, essay questions, or a mix of question types. While standardized tests were traditionally presented on paper and completed using pencils, and many still are, they are increasingly being administered on computers connected to online programs. While standardized tests may come in a variety of forms, multiple-choice and true-false formats are widely used for large-scale testing situations because computers can score them quickly, consistently, and inexpensively.

The procedure to get standardized test are constructing, pre-testing, analyzing, revising, and editing. The use of standardized test is to get information about the students’ ability in the wide nation. It is also to place the students based on their capability, arrange the individual instruction and arrange the remedial teaching if the test is given early. It is analyzed statistically and claimed its validity to be used widely. It has been tried out on a proper sample or the population from whom it is intended and that on this sample. Most of tests of these types are made up of items each of which have characteristics in themselves and have been shown to contribute toward the total performance of the test (Arikunto, 2003; Ebel as cited by Nurgiyantoro 1995)

Standardized tests may be used for a wide variety of educational purposes. Therefore, the first thing testers have to do, according to Hughes, is to have a clear purpose for testing. In this sense, the author categorizes the kinds of tests according to the types of information each test provides. The following are a few representative examples of the most common forms of standardized test. There are proficiency tests to measure people’s language ability regardless of any training that they have had in the language, achievement tests that are administered at the end of a course of study, diagnostic tests that are used to identify the test taker’s strengths and weaknesses, placement tests that are intended to place students at a certain level, direct testing which is when the test requires the test taker to perform exactly the skill that is being measured and indirect testing which measures the abilities that underlie the skill that is meant to be measured; the testing of one element at a time which is referred to as discrete point testing and integrative testing, by contrast, requires the test taker to incorporate several language elements as he or she performs a task; there is the norm-referenced testing designed to relate one test taker’s performance to that of another test taker, and criterion-reference testing that tells teachers what test takers can do with the language; objective testing means that no judgment is required on the part of the tester and subjective testing requires judgment on the part of the tester, computer adaptive testing offers an efficient way of collecting data on tests and test items by programming whatever information is desired and communicative language testing that focuses on measuring the test takers’ communicative language ability.

Reference

Arikunto, Suharsimi. 2003. Dasar-dasar Evaluasi Pendidikan. Jakarta: Bumi Aksara.

Hughes, Arthur. 1989. Testing for Language Teachers. Second Edition. United Kingdom:

Cambridge University press.

Johnson, David W. and Johnson, Roger T. 2002. Meaningful Assesment. A Manageable and

Cooperative Process. USA: Allyn and Bacon.

Nurgiyantoro, Burman. 1995. Penelitian Dalam Pengajaran Bahasa dan Sastra. Yogyakarta:

BPFE.

Baca selengkapnya »

Summary 4: Quality Assurance on Internal Attributes of a Good Assessment Language Devices: Reliability, Validity, and Classical Item Analysis by: I.G.A. Lokita Purnamika Utami & Rina Sari

Summary 4: Quality Assurance on Internal Attributes of a Good Assessment Language Devices: Reliability, Validity, and Classical Item Analysis

by:

I.G.A. Lokita Purnamika Utami & Rina Sari

Reliability is the degree to which an assessment tool produces stable and consistent results.

Types of Reliability

Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time. Example: A test designed to assess student learning in psychology could be given to a group of students twice, with the second administration perhaps coming a week after the first. The obtained correlation coefficient would indicate the stability of the scores.

Parallel forms reliability is a measure of reliability obtained by administering different versions of an assessment tool (both versions must contain items that probe the same construct, skill, knowledge base, etc.) to the same group of individuals. The scores from the two versions can then be correlated in order to evaluate the consistency of results across alternate versions. Example: If you wanted to evaluate the reliability of a critical thinking assessment, you might create a large set of items that all pertain to critical thinking and then randomly split the questions up into two sets, which would represent the parallel forms.

Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their assessment decisions. Inter-rater reliability is useful because human observers will not necessarily interpret answers the same way; raters may disagree as to how well certain responses or material demonstrate knowledge of the construct or skill being assessed. Example: Inter-rater reliability might be employed when different judges are evaluating the degree to which art portfolios meet certain standards. Inter-rater reliability is especially useful when judgments can be considered relatively subjective. Thus, the use of this type of reliability would probably be more likely when evaluating artwork as opposed to math problems.

Internal consistency reliability is a measure of reliability used to evaluate the degree to which different test items that probe the same construct produce similar results.

a. Average inter-item correlation is a subtype of internal consistency reliability. It is obtained by taking all of the items on a test that probe the same construct (e.g., reading comprehension), determining the correlation coefficient for each pair of items, and finally taking the average of all of these correlation coefficients. This final step yields the average inter-item correlation.

b. Split-half reliability is another subtype of internal consistency reliability. The process of obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to probe the same area of knowledge (e.g., World War II) in order to form two “sets” of items. The entire test is administered to a group of individuals, the total score for each “set” is computed, and finally the split-half reliability is obtained by determining the correlation between the two total “set” scores.

Baca selengkapnya »

Minggu, 01 Februari 2015

Quality Assurance on Internal Attributes of a Good Assessment Language Device: Reliability, Validity, and Classical Item Analysis By: Agus Eko Cahyono and Jumariati

Quality Assurance on Internal Attributes of a Good Assessment Language Device: Reliability, Validity, and Classical Item Analysis

By: Agus Eko Cahyono and Jumariati

An assessment language device is said to be good provided that it has met these attributes: reliability and validity. Reliability of a test is achieved if the test results are consistent and dependable in its conditions across two or more administrations. Heaton (1988) and Brown and Abeywickrama (2010) mention that reliability of a test can be determined by the student, scoring, test administration, and test itself. Student-related reliability deals with the conditions of the student taking the test. The student’s fatigue, anxiety, motivation, and other physical and psychological factors can hinder the student in performing his true ability in the test. Therefore, teachers need to consider the condition of the students before administering a test. Rater-reliability deals with the consistency of the scores given by one rater (intra-rater reliability) or two or more raters (inter-rater reliability). This is especially difficult to score subjective tests like essay writing wherein the rater may feel fatigue in scoring and thus may reduce the reliability. It is suggested that the rater reads through the essays and recycles back through to arrive at a good judgment. When there are two or more raters involve and the scores are quite different, probably the scoring criteria need to be revised. Hughes (2003) suggests that training to raters is necessary to have similar interpretation regarding the scoring criteria. Test-administration reliability is determined by the condition of the room where the test is administered, the seating position, the room temperature, and the quality of the test sheet or test audio. Therefore, teachers should carefully prepare a good room for the test and provide clear audio or copies of the test sheets. Finally, test-reliability which directly relates to the test itself: the clear instruction, the unambiguous item, and balanced item numbers with the time allotted for the test. These factors may help increasing the reliability of a test.

Baca selengkapnya »