[Back to the information page]

The JLTA Code of Good Testing Practice

Basic Considerations for good testing practice in all situations

1. The test developer's understanding of just what the test, and each sub-part of it, is supposed to measure (its construct) must be clearly stated.

2. All tests, regardless of their purpose or use, must be valid and reliable to the degree necessary to allow the decisions based on their results to fair to the test takers.

Validity refers to the accuracy of the inferences that are drawn from the test results. If, for example, the test purports to be measuring the ability to use English in business communication, the test is valid to the degree it does in fact measure that ability. However, the ability to use English in business communication is a construct. The test developer must spell out just what that construct is or what it consists of. The test can only be valid if the test construct is a complete and accurate picture of the skill or ability it is supposed to measure. To summarize Messick 1996, construct validity depends on the degree to which positive answers can be given to the following five questions.

1. Does our construct and our implementation of it, include all and only the necessary elements?

2. Do we have these elements correctly weighted?

3. Do these elements interact in the same way in the test task and in real world performance?

4. Does our scoring scheme evaluate the test performance in the same way real-world performance is evaluated?

5. Is there anything about our test that will cause the test takers, or a portion of them, to perform in a less than optimal fashion?

Reliability refers to the consistency of the test results and Messick's sixth question addresses this issue.

1. Are the results generalizable?

  a. Are the results comparable across time?

  b. Are the results comparable across settings?

Responsibilities of test designers and test writers

1. A test designer must begin by deciding on the construct to be measured before deciding how that construct is to be operationalized.

2. Once the test tasks have been decided, their specifications should be spelled out in detail.

3. The work of the item writers needs to be edited before the items are pretested. If pretesting is not possible, the items should be analysed after the test has been administrated but before the results are reported. Malfunctioning or misfitting items should not be included in the calculation of individual test takers' reported scores.

4. Grading check sheets or rubrics must be prepared for test tasks requiring hand scoring. These check sheets or rubrics must be tried out to demonstrate that they permit reliable evaluation of the test takers' performance.

5. Those doing the grading should be trained for the task and both inter and intra-rater reliability should be calculated and published.

6. Test materials should be kept in a safe place and handled in such a way that no test taker is allowed to gain an unfair advantage over the other test takers.

7. Care must be taken to assure that all test takers are treated in the same way in the administration of the test.

8. Grading procedures must be carefully followed and score processing routines checked to make certain that no mistakes have been made.

9. The test results should be reported in a way that allows the test taker and other stakeholders to draw the correct inferences from it.

Obligations of institutions preparing or administering high stakes exams

Institutions (schools, companies, certification bodies, etc.) developing and administering entrance, certification, or other high stakes examinations must utilize test designers and item writers who are well versed in current language testing theory and practice and have native or near native competence in the language being tested. Items written by non-native speakers of the language being tested must be checked by competent native speakers.

Responsibilities to test takers and related stakeholders:

Before the test is administered.

The institution should provide all potential test takers with adequate information about the nature of the test, the construct (or constructs) the test is attempting to measure (Ideally this should include any evidence and arguments showing that the test tasks are in fact measuring what they are claimed to measure.), the way the test will be graded, and how the results will be reported.

At the time of administration.

The institution shall provide facilities for the administration of the test that do not disadvantage any test taker. Test administration materials should be carefully prepared and proctors trained and supervised so that each administration of the test can be uniform, assuring that all test takers receive the same instructions, time to do the test, and access to any permitted aids. If something occurs that calls into question the uniformity of the administration of the test, the problem should be identified and any remedial action to be taken to offset the negative impact on the effected test takers should be promptly announced.

At the time of scoring

The institution shall take the steps necessary to see that each test taker's exam paper is graded accurately and the result correctly placed in the data-base used in the assessment. There should be on-going quality control checks to assure that the scoring process is working as intended.

Other considerations

If a decision must be made on candidates who did not all take the same test or the same form of a test, care must be taken to assure that the different measures used are in fact comparable. Equivalence must be demonstrated statistically.

If more than one form of the test is used, inter-form reliability estimates should be published as soon as they are available.

Obligations of those preparing and administering commercially available exams

In addition to the obligations placed on any test designer and on those preparing high stakes examinations, developers and sellers of commercially available examinations must:

1. Make a clear statement as to what groups the test is appropriate for and for which groups it is not appropriate.

2. Make a clear statement of the construct the test is designed to measure in terms a layperson can understand.

3. Publish validity and reliability estimates for the test along with sufficient explanation to allow potential uses to decide if the test is suitable in their situation.

4. Report the results in a form that will allow the test users to draw correct inferences from them and make them difficult to misinterpret.

5. Refrain from making any false or misleading claims about the test.

6. Produce a test manual available to the public which:

  1. Explains the relevant measurement concepts so that they can be understood by non-specialists.

  2. Reports evidence of the reliability and validity of the test

  3. Describes the scoring procedure and, if multiple forms exist, steps taken to assure consistency of results across forms.

  4. Explains the proper interpretation of test results and any limitations on their accuracy.

Responsibilities of users of test results

Persons who utilize test results for decision making must:

1. Use results from a test that is sufficiently reliable and valid to allow fair decisions to be made.

2. Make certain that the test construct is relevant to the decision to be made.

3. Clearly understand the limitations of the test results on which they will base their decision.

4. Take into consideration the standard error of measurement (SEM) of the device that provides the data for their decision.

5. Be prepared to explain and provide evidence of the fairness and accuracy of their decision making process.

Special considerations

In norm-referenced testing

1. The characteristics of the population on which the test was normed must be reported so that test users can determine if this group is appropriate as a standard to which their test takers can be compared.

In criterion referenced testing

1. The appropriateness of the criterion must be confirmed by experts in the area being tested.

2. Since correlation is not a suitable way of determining the reliability and validity of criterion referenced tests, methods appropriate for such test data must be used.

In computer adaptive testing

1. The sample sizes must be large enough to assure the stability of the IRT estimates.

2. Test takers and other stakeholders must be informed of the rationale of computer adaptive testing and given advice on test taking strategies for such tests.

[Back to the information page]