Wednesday, December 28, 2016

TYPES OF LANGUAGE TESTING NORM-REFERENCED TEST AND CRITERION-REFERENCED TEST



A. TYPES OF LANGUAGE TESTS

Introduction
A test, often administered on paper or on the computer, intended to measure the test-takers' or respondents' (often a student) knowledge, skills, aptitudes, or classification in many other topics (e.g., beliefs). Tests are often used in education, professional certification, counseling, psychology, the military, and many other fields. One of the test kinds is language test. All language tests are not of the same kind. They differ mainly in terms of design (method) and purpose. In terms of method, a broad distinction can be made between pen-and-paper language tests and performance tests.
The needs of assessing the outcome of learning have led to the development and elaboration of different test formats. Testing language has traditionally taken the form of testing knowledge about language, usually the testing of knowledge of vocabulary and grammar. Stern (1983, p. 340) notes that „if the ultimate objective of language teaching is effective language learning, then our main concern must be the learning outcome‟. In the same line of thought, Wigglesworth (2008, p. 111) further adds that “In the assessment of languages, tasks are designed to measure learners‟ productive language skills through performances which allow candidates to demonstrate the kinds of language skills that may be required in a real world context.” This is because a “specific purpose language test is one in which test content and methods are derived from an analysis of a specific purposes target language use situation, so that test tasks and content are authentically representative of tasks in the target situation” (Douglas, 2000, p. 19). 
Thus, the issue of authenticity is central to the assessment of language for specific functions.  This is another way of saying that testing is a socially situated activity although the social aspects have been relatively under-explored (Wigglesworth, 2008). Yet, language tests differ with respect to how they are designed, and what they are for, in other words, in respect to test method and test purpose. In terms of method, we can broadly distinguish traditional paper-and-pencil language tests from performance tests. 
Paper-and-pen tests are typically used for the assessment of
·         separate components of language (grammar, vocabulary …)
·         receptive understanding (listening & reading comprehension)
In performance tests language skills are assessed in an act of communication. e.g. tests of speaking and writing where:
  • extended samples of speech/writing is elicited
  • judged by trained markers
  • common rating procedure used
There are several types of common tests, namely Objective test; Subjective test; Direct test; Indirect Tests; Discrete-Point test; Integrative Tests; Aptitude test; Achievement test; Proficiency Tests; Norm-referenced test; Criterion-referenced test; Speed Test; and Power test.

A.   Types of Language Tests

1.      Objectve vs. Subjective Tests
Objective test is a psychological test that measure an individual's characteristics in a way that is independent of rater bias or the examiner's own beliefs, usually by the administration of a bank of questions that are marked and compared against exacting scoring mechanisms that are completely standardized, much in the same examinations are administered. Objective tests are often contrasted with projective tests, which are sensitive to rater or examiner beliefs. Objective tests tend to be more reliable and valid than projective tests, however they are still subject to the willingness of the subject to be open about his/her personality and as such can sometimes be badly representative of the true personality of the subject. Projective tests purportedly expose certain aspects of the personality of individuals that are impossible to measure by means of an objective test, and are much more reliable at uncovering "protected" or unconscious personality traits or features.
An objective test is built by following a rigorous protocol which includes the following steps:
  • Making decisions on nature, goal, target population, power.
  • Creating a bank of questions.
  • Estimating the validity of the questions, by means of statistical procedures and/or judgement of experts in the field.
  • Designing a format of application (a clear, easy-to-answer questionnaire, or an interview, etc.).
  • Detecting which questions are better in terms of discrimination, clarity, ease of response, upon application on a pilot sample.
  • Applying a revised questionnaire or interview to a sample.
  • Use appropriate statistical procedures to establish norms for the test.
Usually this type of tests are distinguished on the basis of the manner in which they are scored. An objective test is said to be one of that may be scored by comparing examinee responses with an established set off acceptable responses or scoring key. A common example would be a multiple-choice recognition test.
Conversely a subjective test is said to require scoring by opinionated judgement, hopefully based insight and expertise, on the part of the scorer. An example might be the scoring of free, written, composition for the presence of creativity in a situation whereon operational definitions of creativity are provided and where there is only one rater.
2.      Direct vs. Indirect Tests
A test is said to be direct when the test actually requires the candidate to demonstrate ability in the skill being sampled. It is a performance test. For example, if we wanted to find out if someone could drive a vehicle, we would test this most effectively by actually asking him to drive the vehicle. In language terms, if we wanted to test whether someone could write an academic essay, we would ask him to do just that. In terms of spoken interaction, we would require candidates to participate in oral activities that replicated as closely as possible [and this is the problem] all aspects of real-life language use, including time constraints, dealing with multiple interlocutors, and ambient noise. Attempts to reproduce aspects of real life within tests have led to some interesting scenarios
Direct tests, try to introduce authentic tasks, which model the student’s real life future use of language. Such tests include:
·        Role-playing.
·        Information gap tasks.
·        Reading authentic texts, listening to authentic texts.
·        Writing letters, reports, form filling and note taking.
·        Summarising.
Direct tests are task oriented rather than test oriented, they require the ability to use language in real situations, and they therefore should have a good formative effect on your future teaching methods and help you with curricula writing. However, they do call for skill and judgment on the part of the teacher.
An indirect test measures the ability or knowledge that underlies the skill we are trying to sample in our test. So, for example, you might test someone on the Highway Code in order to determine whether he is a safe and law-abiding driver [as is now done as part of the UK driving test]. An example from language learning might be to test the learners’ pronunciation ability by asking them to match words that rhymed with each other. e.g :
One of these words sound different from the others. Underline it.
door
law
though
pore

Indirect testing makes no attempt to measure the way language is used in real life, but proceeds by means of analogy. Some examples that you may have used are:
·        Most, if not all, of the discrete point tests mentioned above.
·        Cloze tests
·        Dictation (unless on a specific office skills course)
Indirect tests have the big advantage of being very ‘test-like’.  They are popular with some teachers and most administrators because can be easily administered and scored, they also produce measurable results and have a high degree of reliability.
3.      Discrete-Point vs. Integrative Tests
Discrete-Point tests are based on an analytical view of language. This is where language is divided up so that components of it may be tested. Discrete point tests aim to achieve a high reliability factor by testing a large number of discrete items. From these separated parts, you can form an opinion is which is then applied to language as an entity.  You may recognise some of the following Discrete Point tests:
1.      Phoneme recognition.
2.      Yes/No, True/ False answers.
3.      Spelling.
4.      Word completion.
5.      Grammar items.
6.      Most multiple choice tests.
Discrete-point testing assumes that language knowledge can be divided into a number of independent facts: elements of grammar, vocabulary, spelling and punctuation, pronunciation, intonation and stress. These can be tested by pure items (usually multiple-choice recognition tasks). Discrete-point are designed to measure knowledge of  performance in a very restricted target language. Thus test of ability to use correctly the perfect tenses of English verbs or to supply correct prepositions in a cloze passage may be termed a discrete-point.
Integrative tests, on the other hand, are said to tap a greater variety of language abilities concurrently and therefore may have less diagnostic and remedial-guidance value and greater value in measuring overall language proficiency.
Such tests usually require the testees to demonstrate simultaneous control over several aspects of language, just as they would in real language use situations. Examples of Integrative tests that you may be familiar with include:
1.      Cloze tests
2.      Dictation
3.      Translation
4.      Essays and other coherent writing tasks
5.      Oral interviews and conversation
6.      Reading, or other extended samples of real text.
            Integrative testing argues that any realistic language use requires the coordination of many kinds of knowledge in one linguistic event, and so uses items which combine those kinds of knowledge, like comprehension tasks, dictation, speaking and listening.
Discrete-point testing risks ignoring the systematic relationship between language elements; integrative testing risks ignoring accuracy of linguistic detail.
Frequently an attempt is made to achieve the best of all possible worlds through the construction and the use of test batteries comprised of discrete-point subtest for diagnostic purposes, but which provide a total score that is considered to reflect overall language proficiency. The comparative success or failure of such attempt can be determined empirically by reference to the data from tests administrations. Farhady (1979) presents evidence that “There are no statistically revealing differences” between discrete-point and integrative tests.
4.      Aptitude, Achievement, and Proficiency Tests
An aptitude is an innate, acquired or learned or developed component of a competency (being the others: knowledge, understanding and attitude) to do a certain kind of work at a certain level. Aptitudes may be physical or mental. The innate nature of aptitude is in contrast to achievement, which represents knowledge or ability that is gained.
Aptitude tests are most often used to measure the suitability of a candidate for a specific program of instruction or particular kind of employment. For this reason these tests are often used synonymously with intelligence tests. A language aptitude test is designed to measure the students’ probable performance in a foreign language which they have not started to learn.
Aptitude tests generally seek to predict the students’ probable strengths and weaknesses in learning a foreign language by measuring performance in an artificial language.
An achievement test is a test of developed skill or knowledge. The most common type of achievement test is a standardized test developed to measure skills and knowledge learned in a given grade level, usually through planned instruction, such as training or classroom instruction. Achievement test scores are often used in an educational system to determine what level of instruction for which a student is prepared. High achievement scores usually indicate a mastery of grade-level material, and the readiness for advanced instruction. Low achievement scores can indicate the need for remediation or repeating a course grade.
Achievement tests reflect a student's ability and willingness to learn and show, on a percentage basis, how much of the training and materials presented in a particular class were absorbed. A score of 90% on an achievement test would indicate that the student had understood and carefully covered about 90% of what was presented in a particular class. A score of 40% would indicate that the student had only accomplished 40% of the class goals.
Another type of test measures overall language proficiency is called a proficiency test. This is a test that globally measures how much of a language the student has acquired over a period of time from all sources. It may represent a few months of study or years of study and use of the language. Think of a proficiency test as showing the "tip of an iceberg." A scientist can measure the tip of an iceberg and calculate, with a great deal of accuracy, how much ice is under the water. A language proficiency test looks at a carefully selected group of language items and the results determine how much of the whole language is probably understood. The score is not a "percentage" of anything. It is a number that provides useful information based on consistent results.
5.      Speed test vs. Power test
A speed test is one in which the items are so easy that every person taking the test might be expected to get every item correct. In a speed test the scope of the questions is limited and the methods you need to use to answer them is clear. Taken individually, the questions appear relatively straightforward. Speed test are concerned with how many questions you can answer correctly in the allotted time.
For example:
139 + 235
A) 372
B) 374
C) 376
D) 437

Power test by definition is the test that allow sufficient time for every person to finish, but that contain such difficult items that few if any examines are expected to get every item correct. A power test will present a smaller number of more complex questions. The methods you need to use to answer these questions are not obvious, and working out how to answer the question is the difficult part. Once you have determined this, arriving at the correct answer is usually relatively straightforward.

For example:
Below are the sales figures for 3 different types of network server over 3 months.
Server
January
February
March

Units
Value
Units
Value
Units
Value
ZXC43
32
480
40
600
48
720
ZXC53
45
585
45
585
45
585
ZXC63
12
240
14
280
18
340

In which month was the sales value highest?

A) January
B) February
C) March
What is the unit cost of server type ZXC53?
A) 12
 B) 13
C) 14
 
In summary, speed tests contain more items than power tests although they have the same approximate time limit. Speed tests tend to be used in selection at the administrative and clerical level. Power tests tend to be used more at the graduate, professional or managerial level. Although, this is not always the case, as speed tests do give an accurate indication of performance in power tests. In other words, if you do well in speed tests then you will do well in power tests as well.
B.   Norm-referenced test vs. Criterion-referenced test
1.      Norm-Referenced Tests
According to Paul (1995) norm-referenced tests are formal assessments and have specific properties that allow a meaningful comparison of performance among children. These properties include clear administration and scoring criteria; validity, reliability, standardization, central tendency, standard error of measurement, and variability measures; and norm-referenced scores. Mills and Hambleton (1980) stated norm-referenced assessments are constructed to facilitate comparisons among individuals in relation to the performance of the normative group. A standardized test is also used when comparing a child to the norm. Standardization is defined as the process of administering a test under uniform conditions to each child who is tested (Montgomery & Connolly, 1987).
There are advantages and disadvantages for using norm-referenced tests. Norm-referenced tests will provide evidence regarding the existence of a problem, suggest a need for further assessment, and/or help document a need for the initiation or continuation of therapy (McCauley & Swisher, 1984). Montgomery and Connolly (1987) reported that norm-referenced tests were designed to delineate differences among individuals and used for diagnostic and placement purposes. Johnson and Martin (1980) concluded that norm-referenced tests spread out individuals along a continuum of performance in order to detect deviations from the average.
McCauley and Swisher (1984) noted disadvantages to norm-referenced tests if misused. A misused norm-referenced test can lead to (a) a mistaken understanding of an individual’s problem, (b) an inappropriate and fruitless therapy program, and (c) an inaccurate conclusion regarding efficacy of therapy. Another disadvantage for norm-referenced tests is that the comparison of a test taker’s score to the relative norms involves a comparison of estimated, rather than absolute, or true values.
Besides the disadvantages mentioned above, McCauley and Swisher (1984) reported four specific problems in the use of norm-referenced tests. The first problem is using age-equivalent scores as test summaries. “This problem concerns the relation of age-equivalent scores and the raw scores on which they are based” (Saliva & Ysseldyke, 1981, p. 67). With most norm-referenced tests, similar differences in age-equivalent scores are the result of smaller and smaller differences in raw scores (McCauley & Swisher, 1984). This problem is not necessarily based directly on evidence collected for children at that chronological age and can serve as a basis of misinterpretation. A second problem is the profile analysis. McCauley and Swisher (1984) stated the scores to be compared in a profile, on norm-referenced tests, are only estimates of the ideal or true scores one would obtain if the scores were free from measurement error. Performance on individual test items as indications of deficit is the third problem. That is, the small number of items on a norm-referenced test cannot adequately sample all of the specific forms and developmental levels that might be appropriate. The fourth problem with using norm-referenced tests is the repeated testing as a means of assessing progress. The result is underestimation or overestimation of change, since the individuals are able to learn the items on the test. These problems demonstrate that norm-referenced tests provide incomplete and possibly misleading information for the formulation of language objectives and language analyses.
2.      Criterion-Referenced Tests
Paul (1995) proposed that criterion-referenced tests are procedures devised to examine a particular form of communicative behavior. Criterion-referenced tests do not reference to other children’s achievement but only determine if the child can attain a certain level of performance. Montgomery and Connolly (1987) stated that criterion-referenced tests document individual performance in relation to a domain of information or specific set of skills. Therefore, criterion-reference tests are designed to measure changes in successive performance in an individual. Criterion-referenced tests are used specifically for program planning and evaluating; however, they can also be standardized.

Much like the norm-referenced tests, criterion-referenced tests have their own advantages and disadvantages. One advantage for the criterion-referenced tests is their scoring procedures. This type of test is based on absolute rather than relative standards. Its primary use is to measure mastery of specific skills and test 15 items, based on known performance objectives associated with the tasks of interest. Criterion-referenced tests are sensitive to and can be used to measure the effects of instruction, based on task analysis, related directly to instructional objectives. Sensitivity is defined as the accuracy with which the test identifies children with language impairment as language impaired (Merrell & Plante, 1997).  The ability to tie the test directly to the program objectives is another benefit of criterion-referenced tests. Freeman and Miller (2001) reported that criterion-referenced tests were consistently rated as the most useful assessment tool, both for understanding the child’s abilities and needs, and for planning teaching responses to them. This assessment tool refers directly to the curriculum, and is likely to be considered comprehensible and relevant.
Although there are a number of advantages for criterion-referenced tests, there are a few disadvantages that need to be mentioned. One disadvantage includes the inability to assign age levels if not normed or administered in a standardized manner. MacTurk and Neisworth (1978) stated another disadvantage for the criterion-referenced tests is the lack of comparative interpretability.
3.      Similarities between Norm-Referenced and Criterion-Referenced Tests
Even though norm-referenced and criterion-referenced tests have many differences, there are a few similarities. For example, criterion-referenced and norm-referenced tests should demonstrate the same interpreter and test-retest reliability (Montgomery & Connolly, 1987). Issues of validity, such as content, concurrence, and predictive value, should also be similar between the two tests when administered.
4.       Differences between Norm-Referenced and Criterion-Referenced Tests
McCauley (1996) summarized the differences between norm-referenced and criterion-referenced tests in a simplistic way. The first difference is the fundamental purpose of both tests. The fundamental purpose of norm-referenced tests is to rank individuals, whereas the fundamental purpose of criterion referenced tests is to distinguish specific levels of performance. A second difference is the test planning. Norm-referenced tests address a broad content and criterion-referenced tests address a clearly specific domain. Lastly, a third difference is how the individual’s performance is summarized. With norm referenced tests, performance is summarized meaningfully by using percentile ranks and standard scores; and criterion-referenced test performance is summarized meaningfully by using raw scores. 
Many educators and members of the public fail to grasp the distinctions between criterion-referenced and norm-referenced testing. It is common to hear the two types of testing referred to as if they serve the same purposes, or shared the same characteristics. Much confusion can be eliminated if the basic differences are understood.Many educators and members of the public fail to grasp the distinctions between criterion-referenced and norm-referenced testing. It is common to hear the two types of testing referred to as if they serve the same purposes, or shared the same characteristics. Much confusion can be eliminated if the basic differences are understood.  The following is adapted from: Popham, J. W. (1975). Educational evaluation. Englewood Cliffs, New Jersey: Prentice-Hall, Inc.
Criterion-Referenced
Tests
Norm-Referenced
Tests
To determine whether each student has achieved specific skills or concepts.
To find out how much students know before instruction begins and after it has finished.
To rank each student with respect to theachievement of others in broad areas of knowledge.
To discriminate between high and low achievers.
Measures specific skills which make up a designated curriculum. These skills are identified by teachers and curriculum experts.
Each skill is expressed as an instructional objective.
Measures broad skill areas sampled from a variety of textbooks, syllabi, and the judgments of curriculum experts.
Each skill is tested by at least four items in order to obtain an adequate sample of student performance and to minimize the effect of guessing.
The items which test any given skill are parallel in difficulty.
Each skill is usually tested by less than four items.
Items vary in difficulty.
Items are selected that discriminate between high
and low achievers.
Each individual is compared with a preset standard for acceptable achievement. The performance of other examinees is irrelevant.
A student's score is usually expressed as a percentage.
Student achievement is reported for individual skills.
Each individual is compared with other examinees and assigned a score--usually expressed as a percentile, a grade equivalent
score, or a stanine.
Student achievement is reported  for broad skill areas, although some norm-referenced tests do report student achievement for individual skills.

References
Anastasi, A. (1988). Psychological Testing. New York, New York: MacMillan Publishing Company.
Buck, Gary (1989) Written tests of pronunciation: do they work? ELT Journal, no. 43, pp.
50-56.Corbett, H.D. & Wilson, B.L. (1991). Testing, Reform and Rebellion. Norwood, New Jersey: Ablex Publishing Company.
Popham, J. W. (1975). Educational evaluation. Englewood Cliffs, New Jersey: Prentice-Hall, Inc.
Romberg, T.A., Wilson, L. & Mamphono Khaketla (1991). "The Alignment of Six Standardized Tests with NCTM Standards", an unpublished paper, University of Wisconsin-Madison. In Jean Kerr Stenmark (ed; 1991). Mathematics Assessment: Myths, Models, Good Questions, and Practical Suggestions. The National Council of Teachers of Mathematics (NCTM)
Stenmark, J.K (ed; 1991). Mathematics Assessment: Myths, Models, Good Questions, and Practical Suggestions. Edited by. Reston, Virginia: The National Council of Teachers of Mathematics (NCTM)
Stiggins, R.J. (1994). Student-Centered Classroom Assessment. New York: Merrill
U.S. Congress, Office of Technology Assessment (1992). Testing in America's Schools: Asking the Right Questions. OTA-SET-519 (Washington, D.C.: U.S. Government Printing Office)

No comments:

Post a Comment

Give your positive comments.
Avoid offensive comments.
Thank you.