Senin, 04 Mei 2015

Requirements, Assumptions, @ Estimation of Parameters as well as Instrument Development based on IRT

By:
Marwa & Erlik Widiyani Styati

IRT has a number of advantages over CTT methods to assess test outcomes.  CTT statistics such as item difficulty (proportion of correct responses), item discrimination (corrected item-total correlation), and reliability are contingent on the sample of respondents to whom the questions were administered.  IRT item parameters are not dependent on the sample used to generate the parameters, and are assumed to be invariant (within a linear transformation) across divergent groups within a research population and across populations.

Introduction to Item Response Theory (Comparison between CTT and IRT)

By:
Marwa & Erlik Widiyani Styati
Classical test theory (CTT and item response theory (IRT) are widely perceived as representing two very different measurement frameworks. There have been a brief review of related theories. Additional detail is provided elsewhere (Crocker & Algina, 1986; McKinley & Mills, 1989).

Issues in the Development of Authentic Assessment: Portfolio, Project, Extended Response, etc.

By:
Marwa & Erlik Widiyani Styati
A commonly advocated best practice for classroom assessment is to make the assessments authentic. Authentic is often used as meaning the mirroring of real-world tasks or expectations. There is no consensus, however, in the actual definition of the term or the characteristics of an authentic classroom assessment. Sometimes, the realistic component is not even an element of a researcher’s or practitioner’s meaning.
            Simply testing an isolated skill or a retained fact does not effectively measure a student's capabilities. To accurately evaluate what a student has learned, an assessment method must examine his or her collective abilities. The term authentic assessment describes the multiple forms of assessment that reflect student learning, achievement, motivation, and attitudes on instructionally relevant classroom activities.
           
            There are several challenges to using authentic assessment methods. They include managing its time-intensive nature, ensuring curricular validity, and minimizing evaluator bias. Despite these challenges, efforts must be made to appropriately assess all LEP students and to welcome the possibility of assessment strategies that can empower students to take control of their own learning and to become independent thinkers and users of the English language.




REFERENCES
O’Malley, J.M. and Pierce, L.V. 1996: Authentic Assessment for English Language      Teachers: Practical Approaches for Teachers. New York:Addison-Wesley Publishing         Company.




Issues in the Development of Non-Test Instruments (Rating scales, semantic differential scales, checklists, questionnaires and others)

By:
(Marwa & Erlik Widiyani Styati)
Rating scale is one of the enquiry form. Form is a term applied to expression or judgment regarding some situation, object or character. Opinions are usually expressed on a scale of values. Rating techniques are devices by which such judgments may be quantified. Rating scale is a very useful device in assessing quality, especially when quality is difficult to measure objectively. For Example, “How good was the performance?” is a question which can hardly be answered objectively. Rating scales record judgment or opinions and indicates the degree or amount of different degrees of quality which are arranged along a line is the scale. For example: How good was the performance?

Parallel Tests & Equating: Theory, Principles, and Practice

By:
(Marwa & Erlik Widiyani Styati)
Definition of Equating
            The process of equating is used to obtain comparable scores when more than one test forms are used in a test administration. As Petersen et al. (1989) point out the process of equating “is used to ensure that scores resulting from the administration of the multiple forms can be used interchangeably.” They further argue that equating can be defined as empirical procedures for establishing a relationship between raw scores on two test forms that can then be used to express the scores on one form in terms of the scores on the other forms. Angoff (1971) has defined the equating of tests as a process “to convert the system of units of one form to the system of units of the other” so that the scores obtained from one form could be compared directly with the scores obtained from the other form.

Minggu, 03 Mei 2015

Issues in the Development of Standardized Proficiency Tests (Language Skills and Components): Academic Vocational: BULATS, TOEIC, & TOEP



By: Marwa and Erlik Widiyani Styati

The standardized test such as TOEFL and IELTS are discussed in the previous chapter. The other standardized test are BULATS, TOIEC, and TOEP. BULATS (Business Language Testing Service) is a multilingual set of workplace language assessment. It is used internationally for business and industry recruitment, for identifying and delivering training, for admission to study business-related courses and for assessing the effectiveness of language courses and training. The language skills in BULATS are listening, reading, writing, speaking. Topics in BULATS refers to the real-life topic. The test can be done without particular software but it needs internet connection. The listening and reading test in online. There are also knowledge and grammar in online test. It takes about an hour in doing the test and the result is available at the end of the test.

Issues in the Development of Standardized Proficiency Tests (Language Skills and Components): Academic: IELTS and TOEFL



 By: Marwa and Erlik Widiyani Styati

Standardized test is developed, administered, and scored using established procedures and guidelines. All the students are given test in the same test, under the same condition. The students are given opportunity to determine the correct answers and all the scores are established and intepreted using the criteria.

Jumat, 01 Mei 2015

Requirements, Assumptions, Estimation of Parameters as well as Instrument Development Based on IRT (Item Response Theory)

Requirements, Assumptions, Estimation of Parameters
as well as Instrument Development Based on IRT (Item Response Theory)

I.G.A. Lokita Purnamika Utami & Rina Sari

Item Response Theory relates student’s ability and item characteristics to the probability of obtaining a particular score on an item. IRT models depend on item and person parameters. Item and person parameters have to be estimated. The end products are best estimates of the item parameters and person ability estimates.
 Item parameters are, discrimination (a) and difficulty parameters (b) and the ‘guess ability’ (c). Person parameters are the ability estimates, for example, represent a person's intelligence or the strength of an attitude.

Selasa, 28 April 2015



A Summary on the Requirements, Assumptions, and Estimations of Parameters as well as Instrument Development based on IRT
By: Agus Eko Cahyono and Jumariati

In the IRT based test development, we need to consider the requirements, assumptions, and estimations of the parameters of the test items. Large samples of examinees are required to accurately estimate the IRT item parameters, and longer tests provide more accurate y estimates. To a lesser extent, increasing the test length can also improve the accuracy of the item parameter estimation. This results from either improved estimation of the ys or improved estimation of the shape of the y distribution. In addition, increasing the number of examinees can somewhat improve the estimation of y through improved estimation of the item parameters.


A Summary on the Introduction to Item Response Theory
By: Agus Eko Cahyono and Jumariati

In the practices of equating test forms, some methods are used such as the Item-Response Theory (IRT) and the Classical Test Theory (CTT) methods. The CTT is a theory about test scores that introduces three concepts-test score (often called the observed score), true score, and error score. Within that theoretical framework, models of various forms have been formulated. For example, in what is often referred to as the "classical test model," a simple linear model is postulated linking the observable test score (X) to the sum of two unobservable (or often called latent) variables, true score (T) and error score (E), that is, X = T + E. Because for each examinee there are two unknowns in the equation, the equation is not solvable unless some simplifying assumptions are made. The assumptions in the classical test model are that (a) true scores and error scores are uncorrelated, (b) the average error score in the population of examinees is zero, and (c) error scores on parallel tests are uncorrelated. In this formulation, where error scores are defined, true score is the difference between test score and error score.

Selasa, 21 April 2015

Introduction to Item response theory I.G.A. Lokita Purnamika Utami Rina Sari



Introduction to Item response theory
I.G.A. Lokita Purnamika Utami
Rina Sari
 Item analysis provides a way of measuring the quality of questions - seeing how appropriate they were for the respondents and how well they measured their ability/trait.  It also provides a way of re-using items over and over again in different tests with prior knowledge of how they are going to perform; creating a population of questions with known properties (e.g. test bank)

Issues in the Development of Authentic Assessment

Issues in the Development of Authentic Assessment

by: I.G.A. Lokita Purnamika Utami & Rina Sari

WHAT IS IT? Performance assessment, also known as alternative or authentic assessment, is a form of testing that requires students to perform a task rather than select an answer from a ready-made list. For example, a student may be asked to explain historical events, generate scientific hypotheses, solve math problems, converse in a foreign language, or conduct research on an assigned topic. Experienced raters--either teachers or other trained staff--then judge the quality of the student's work based on an agreed-upon set of criteria. This new form of assessment is most widely used to directly assess writing ability based on text produced by students under test instructions.

Minggu, 19 April 2015

A Summary on the Issues on the Development of Authentic Assessment
By: Agus Eko Cahyono and Jumariati

There is mismatch between measures of language competence and the actual communicative competence required in real world communicative interaction (Duran, 1988; Kitao & Kitao, 1996; McNamara, 1996; O'Malley and Valdez Pierce, 1996; Spolsky, 1995). The movement of authentic assessment is an attempt to achieve a more appropriate and valid representation of student communicative competencies than that derived from standardized objective tests. Authentic assessment is also named performance-based assessment (Meyer, 1992; Marzano, 1993; but Wiggins (1990) named it as alternative assessment. Authentic assessment is an assessment that simulates, as far as possible, the authentic behavior which learners will need to enact in real situations. Some examples of authentic assessment are self- and peer-assessment, projects or exhibitions, observations, journals, and portfolios.

Selasa, 07 April 2015

Issues in the Development of Non-test Instruments (Rating Scales, Semantic Differential Scales, Checklists, Questionnaires, and others)
By: Agus Eko Cahyono and Jumariati

Rating scales are instruments used when the aspect of performance or the quality of a product varies from low to high, best to worst, good to bad, or on some other implicit continuum (Roid & Haladyna, 1982). The steps to construct a rating scale is first to define what aspects of performance are to be rated. Then, create the scale by employing one of these types: (1) simple numerical, (2) simple graphic, and (3) descriptive. Simple numerical uses certain scales to rate the qualities for instance from 1 (very poor), 2 (poor), 3 (fair), 4 (good) and 5 (very good). This type is very efficient and probably the most popular one. Numbers are used to represent degrees and the rater merely assigns a number to each object or performance being observed. The next type is simple graphic scale in which the rater is confronted with 3 to 7 terms representing the degrees. This type allows less chance for deviation among raters but it is less effective as it requires more time and more pages for writing the answer options. The last type describes the points on the rating scale more fully and easy to use even by untrained raters. The disadvantage is that it takes more time to develop.

Selasa, 31 Maret 2015

Issues in the development of standardized proficiency tests (language skills and components) Vocational: BULATS, TOEIC, TOEP

Issues in the development of standardized proficiency tests (language skills and components)
Vocational: BULATS, TOEIC, TOEP
Rojab Siti R. & Muhammad Yunus
1.       BULATS
BULATS stands for Business Language Testing Service
This test is designed to help companies to find out the level of language skills aming theri staffs, trainees, or joib applicants and it assesses language skills which are needed for the workplace and for students  and employees on language courses

Issues in the Development of Standardized Proficiency Tests (Language Skills and Components): Vocational: BULATS @ TOEIC, TOEP

Issues in the Development of Standardized Proficiency Tests (Language Skills and Components): Vocational: BULATS @ TOEIC, TOEP

Rojab Siti R & Muhammad Yunus

BULATS (BUSINESS LANGUAGE TESTING SERVICE)
The right language skills are an essential element for success in international business and industry. Designed to help companies find out the level of the staff, employees, and other workers in business company. Business Language Testing Service (BULATS) Online tests can be taken at any computer with a fast internet connection. No software needs to be downloaded or installed. A tutorial and demonstration test are available for candidates to familiarise themselves with the task types. On-screen help guides are available throughout the tests and information for candidates handbooks are also available.

Issues in the Development of Standardized Proficiency Tests for Vocational Purposes: Language Skills and Component


Issues in the Development of Standardized Proficiency Tests for Vocational
Purposes: Language Skills and Components
By Muhammad Yunus & Rojab Siti R.

A standardized test is any form of test that (1) requires all test takers to answer the same questions, or a selection of questions from common bank of questions, in the same way, and that (2) is scored in a “standard” or consistent manner, which makes it possible to compare the relative performance of individual students or groups of students.

Selasa, 24 Maret 2015

Issues in the Development of Standardized Proficiency Tests for Vocational
Purposes: Language Skills and Components
By: Agus Eko Cahyono and Jumariati

Standardized tests for vocational purposes are developed for the need of assessing test-takers ability in using English in business area. The development is also made by following careful design and steps to ensure the validity and reliability. The first type is the Test of English for International Communication (TOEIC) which is developed by the Education Testing Service (ETS) to measure the ability in using English in business setting. The TOEIC listening and reading test consists of the multiple-choice. In the listening section, the test-takers are to listen to picture description, question or statement, conversation, and monologues. For reading section, the comprehension on reading passages is measured. The contents of listening and reading test are about those materials related to today’s international market. This test, also known as the “paper and pencil test”, validates abilities to listen and read in English in everyday workplace scenarios. Each section of the TOEIC is scored with specific scale. To ensure the reliable results, the test results are sent to secure online system then scored by trained and certified raters.
Issues in the Development of Standardized Proficiency Tests for Academic
Purposes: Language Skills and Components
By: Agus Eko Cahyono and Jumariati

Proficiency tests have been used widely to measure test-takers ability in using English for education purposes like the Test of English as a Foreign Language (TOEFL) and the International English Language Testing System (IELTS). The TOEFL is designed by the Education Testing Service (ETS) with the Headquarter is in Princeton, New Jersey. In its item development, TOEFL had gone through careful construction and revision. Previously, it contained some test-wiseness items as studies found (Allan, 1992; Yang, 2000) which then make the items revised. This is seen as the effort to ensure the validity of the TOEFL.

Parallel Tests & Equating: Theory, Principles, and Practice

Summarized by:
Marwa & Erlik Widiyani Styati

Parallel test is multiple test type. It refers to the same objective one another as possible in terms of test form, content, and item analysis in testing. It also refers to multiple test form to be equivalent in term of content, cognitive, demand and test item format. It is an important issue in testing. Multiple test forms should be designed very well to ensure the fairness of the test-taker. Parallel test should be stable and tried out. Then, parallel test is administered. After paralell test is administered, the test is scored which is commonly called equating test.
           

Selasa, 17 Maret 2015

Parallel Tests & Equating: Theory, Principles, and Practice

Parallel Tests & Equating: Theory, Principles, and Practice
By Rojab Siti R. & Muhammad Yunus

Score equating is essential for any testing program that continually produces new editions of a test and for which the expectation is that scores from these editions have the same meaning over time. Different editions may be built to a common blueprint and be designed to measure the same constructs, but they almost invariably differ somewhat in their psychometric properties.

Selasa, 10 Maret 2015

STANDARDIZED PROFICIENCY TEST (PART II) : BULATS, TOEIC, TOEP By: I.G.A. Lokita Purnamika Utami Rina sari



STANDARDIZED PROFICIENCY TEST (PART II) : BULATS, TOEIC, TOEP
By: I.G.A. Lokita Purnamika Utami
Rina sari
This summary is the continuation of the previous summary about developing standardized test with elaboration on TOEFL and IELTS. However, this summary is now focusing on the other standardized test: BULATS, TOEIC, and TOEP which are related to job or business purposes. The last one can be also considered for academic purposes. 

Sabtu, 07 Maret 2015

Test wiseness: Definition, Types, and Implications, as well as Studies related with Test Wiseness

Test wiseness: Definition, Types, and Implications, as well as Studies related with Test Wiseness
By Rojab Siti R., & Muhammad Yunus

Gibb (1964) defined test-wiseness as the ability to respond advantageously to item clues in a multiple-choice setting and therefore to obtain credit without knowledge of the subject matter being tested. Test wiseness is also called test familiarity or test wisdom by Thorndike (1951: 569). It can lead to lower validity.
Types of test wiseness:

Selasa, 03 Maret 2015

Issues in the Development of Non-Test Instruments


Issues in the Development of Non-Test Instruments
 (Rating scales, semantic differential scales, checklists, questionnaires and others)

by: I.G.A. Lokita Purnamika Utami and Rina Sari

For the purpose of collecting new relevant data for a research study, the investigator needs to select proper instruments termed as tools and techniques. The major tools of research can be classified into broad categories of inquiry form, observation, interview, social measures and Psychological tests.
Among the inquiry forms, there are Rating scale, attitude scale, opinionnaire, questionnaire checklist and semantic differential scale.  Observation and Interview are explained as the techniques of data collection.  In psychological tests, Aptitude tests and inventories are discussed.

TEST WISENESS : DEFINITION, TYPES AND IMPLICATION AS WELL AS SOME STUDIES RELATED WITH TEST WISENESS



Marwa & Erlik Widiyani Styati
Defining Test-wiseness
Test-wiseness (TW hereafter) is a skill that permits a test-taker to utilize the characteristics and forms of tests and/ or test-taking situation to receive a high score. Some researchers (e.g., Benson, 1988; Rogers and Bateson, 1991) believe that TW is a cognitive ability or a set of test-taking strategies that a test taker can use to improve a test score no matter what the content area of a test. Bond (1981) distinguishes between test-wiseness and test-coaching. TW is independent of content areas whereas test-coaching refers to: “sustained instruction in the domain presumably being measured”.

Rabu, 25 Februari 2015

The summary of Developing standardized test of language proficiency: By: I.G.A. Lokita Purnamika Utami and Rina Sari



The summary of Developing standardized test of language proficiency:
By: I.G.A. Lokita Purnamika Utami and Rina Sari

Standardized test  for language proficiency presuppose a comprehensive definition of proficiency.Swain (1990) refers proficiency assessment to three linguistic traits: grammar, discourse and sociolinguistics that can be measure through oral, multiple choice and written responses. Another definition of proficiency is offered by ACTFL which offer a more holistic and unitary view: superior, advanced, intermediate and novice.

Sabtu, 21 Februari 2015

Quality Assurance on Internal Attributes of a Good Assessment Language Devices: Reliability, Validity, and Classical Item Analysis: Summary byMarwa & Erlik Widiyani Styati



Quality assurance on internal attributes of a good assessment consist of reliability, validity, and classical item analysis. Bachman (1990) says that reliability and validity are those two essentials to the interpretation and the use language ability. Besides, the is classical item analysis is also important. Here, the summary of quality assurance on internal attributes of good assesment will be described as follows:
1.      Reliability
Reliable means that the test can be trusted as a good test and it can be used many times and in the different time. Johnson and Johnson (2002) mention that reliability exists when students’ performance remain the same on repeated measurement. Reliability refers to the consistency of test scores; how consistent a particular students test scores are from one testing to another. Weir (1993) states that the test can be said to have high reliability if the result of the test shows the consistency when it’s re-used many times to a group of students in different time. The test can be said reliable if it is consistent. The types of reliability are: First, Inter-Rater or Inter-Observer Reliability which is used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon. Second, Test-Retest Reliability is used to assess the consistency of a measure from one time to another. Third, Parallel-Forms Reliability is used to assess the consistency of the results of two tests constructed in the same way from the same content domain. Fourth, Internal Consistency Reliability is used to assess the consistency of results across items within a test. 

Selasa, 17 Februari 2015

A Summary on “Parallel Tests and Equating: Theories, Principles, and Practice”
By: Agus Eko Cahyono and Jumariati

In the context of language testing, parallel tests is an important issue. Multiple test forms are said to be parallel when they are as equal to one another as possible in terms of test specification like the type, form, content, purpose, and of statistical criteria like level of difficulty, discriminating power, and distracters. The common example is a school program which has two types of parallel tests: one is for the achievement test while the other is for those test-takers who need the retesting. In this case, the tests must be parallel as the function is the same that is to assess the achievement of the test-takers. High-stakes test like Ujian Nasional in Indonesian schools is used to be parallel with regards to the function to assess students’ learning achievement into certain level in spite of its administration that may be in different point in time throughout the country. Thus, the tests are different from one administration with another but the forms are still similar (equal). This implies that the assembly of multiple test forms should be designed very carefully and properly to ensure the fairness to each of the test-taker and at the same time to maintain the security of the tests.
In fact, there is still the possibility that the multiple test forms that have been developed are not similar; some differences in the statistical characteristics are still found. Therefore, equating methods to face this problem are needed. Equating parallel tests is an important issue in standardized tests to maintain the fairness among test-takers taking the test either at the same time or at different point in time. In order to be equal, Kolen and Brennan (2004) define several equality characteristics that need to be met. First, the equal construct requirement in which the tests to be equated must measure the same construct. If the tests’ constructs are different, they cannot be equated. Second is the equal reliability wherein the tests should yield reliable results. The third is the equal symmetry which means that the equating transformations must be symmetrical. Fourth, the equity requirement which deals with a matter of indifference to each test taker whether test form X or test form Y is administered. Finally is the population invariance requirement which means that the equating is the same regardless of the group of test-takers on which the equating is performed. These principles need to be taken into consideration once an equating is made.
In the practices of equating test forms, some methods are used such as the equating traditional method which utilizes the Item-Response Theory method and computer software like Kernell method and Automated Test Assembly (ATA). The traditional equating model is commonly done through random group design. In this type, test Model A is given to test-taker one, test Model B is given to test-taker two, and then test Model A is for test-taker three and so forth. The results obtained by the test-takers working on test Model A are compared to the result of those working on Model B. The conclusion then is made based on whether or not there is a difference between the two groups. If students in Model B obtain higher scores than those working on Model A, we can conclude that test Model B is easier than Model A and thus they are not equal (parallel).
The use of computer technology as ATA in assembling multiple test forms is preferred by test assemblers lately because of the fast processing and abundance of item pools (Lin, 2008). With the development of ATA, pre-equated parallel test forms can be achieved more efficiently.  The computer software will process the test criteria that have been laid out in the test blueprint and these criteria are separated into two: psychometric and non-psychometric attributes. Non-psychometric attributes include the test content, test format, test length, item usage frequency, and item exclusion. Meanwhile, psychometric attributes deal with classical item statistic, IRT-based item parameter estimates, item-response function, or item information functions.
In conclusion, equating multiple test forms is a crucial method of ensuring the equality of the tests; it can help test designers guarantee the fairness of the test to each of the test-taker and the security of the test forms.


References:
Kolen MJ, Brennan RJ.2004. Test Equating: Methods and Practices (2nd ed.). New York:
Springer-Verlag.

Lin, C.-J. 2008. Comparisons between Classical Test Theory and Item Response Theory
in Automated Assembly of Parallel Test Forms. Journal of Technology, Learning, and
Assessment, 6(8). Retrieved at February, 10th 2015 from http://www.jtla.org

agus Eko & jummariati

A Summary on “Parallel Tests and Equating: Theories, Principles, and Practice”
By: Agus Eko Cahyono and Jumariati

In the context of language testing, parallel tests is an important issue. Multiple test forms are said to be parallel when they are as equal to one another as possible in terms of test specification like the type, form, content, purpose, and of statistical criteria like level of difficulty, discriminating power, and distracters. The common example is a school program which has two types of parallel tests: one is for the achievement test while the other is for those test-takers who need the retesting. In this case, the tests must be parallel as the function is the same that is to assess the achievement of the test-takers. High-stakes test like Ujian Nasional in Indonesian schools is used to be parallel with regards to the function to assess students’ learning achievement into certain level in spite of its administration that may be in different point in time throughout the country. Thus, the tests are different from one administration with another but the forms are still similar (equal). This implies that the assembly of multiple test forms should be designed very carefully and properly to ensure the fairness to each of the test-taker and at the same time to maintain the security of the tests.

Parallel Tests & Equating: Theory, Principles, and Practice by: I.G.A. Lokita P. & Rina Sari


Parallel Tests & Equating: Theory, Principles, and Practice
by: I.G.A. Lokita P. & Rina Sari

Equating is the strongest form of linking between the scores on two tests. Equating may be viewed as a form of scale aligning in which very strong requirements are placed on the tests being linked. The goal of equating is to produce a linkage between scores on two test forms such that the scores from each test form can be used as if they had come from the same test. Strong requirements must be put on the blueprints for the two tests and on the method used for linking scores in order to establish an effective equating. Among other things, the two tests must measure the same construct at almost the same level of difficulty and with the same degree of reliability.