The original study
Each teacher assessed ten learners in this way.
In the second phase a small subset of these learners (the ones who had been assessed in the first phase) were video-recorded while they performed various communicative tasks in English. The video recordings were shown at a "rating conference" attended by all 100 teachers who had participated in the first phase.
The teachers were asked to rate each speaker, again using questionnaires made up of descriptors. These questionnaires could only comprise descriptors relating to speaking skills, of course, and they were applied to only a small number of learners, but this time each learner was assessed by all the teachers in the study, whereas in the first phase they had been assessed only by their own teacher.
The questionnaire responses from both phases were analysed using FACETS, a computer program that implements the multi-faceted Rasch statistical model. This is normally used to analyse test results, and thereby to calibrate test items (i.e.to place them on a scale of difficulty). In this case, however, they were used to scale the descriptors. Each descriptor was placed on a scale of difficulty depending on which learners had been judged to meet the criterion.
The point of the second phase of data collection - the video conference - was to scale the teachers. Each was given assigned scale value indicating their degree of harshness or leniency as a judge, and so it would be possible to correct for this factor when scaling the descriptors.
The original research which led to the calibration of the CEFR descriptors was undertaken by Brian North and Günther Schneider in 1994 and 1995 in the context of a project funded by the Swiss National Science Foundation.
To begin with, they amassed a pool of language proficiency descriptors (CAN DO statements), drawing on existing scales. The pool was refined in the course of a series of teachers' workshops. Where teachers couldn't agree as to what aspect of proficiency a descriptor related to, it was rejected. Also, a number of near duplicates were removed or amalgamated.
The remaining descriptors were subjected to a process of statistical calibration. This culminated in the creation of the now familiar CEFR scale, and the placement of the descriptors in their respective bands on that scale. It is this process that I aim to - partially - replicate.
The calibration exercise undertaken by North and Schneider was based on judgments made by 100 teachers. A crucial aspect of their method was that they did not ask the teachers to estimate the difficulty level of the descriptors, or to place hem in rank order of difficulty. Instead they asked teachers to actually use the descriptors to assess real learners.
They did this in two phases. In the first phase teachers used questionnaires made up of descriptors to assess the proficiency level of particular students whom they were currently teaching. So, for each descriptor the teacher had to indicate whether or not the student in question met the criterion expresses in the descriptor (.i.e CAN he or she DO it or not?).