When preparing for REF 2021, every academic department needs an experienced senior academic to read through all the department’s possible submissions and rate them on the REF scale.
There may be difficulty distinguishing papers at the lower end, but 4* papers will stand out and be clear to all, ensuring that the best outputs get their due credit in the exercise. Right? Wrong!
Academics might feel that they know what makes a 4* or a 1* paper, but when we compare the judgements of one experienced academic with another, there is actually considerable variation. Some raters are lenient, some are harsh, and some are idiosyncratic. Contrary to expectations, there is greater agreement at the lower end of the rating scale than there is at the top.
Myself and Steve Higgins have analysed all the pre-submission ratings by 23 senior academic staff of the papers written by 42 fellow academics in one department for the past REF. What we found confirmed that there is significant variation in severity and consistency of judgements.
Nonetheless, we were able identify a few assessors who were very consistent; these were the ideal raters. If these consistent raters were a bit lenient or a bit harsh, their results could be adjusted accordingly. Throughout the exercise the aim was not to aspire to some hypothetical true score, but to estimate the likely judgement of a fallible REF panel.
We found, contrary to expectation, that the best raters were able to place papers unambiguously on to the scale running from unclassified to 4*. In other words, even within our own subject of education, with the numerous different disciplines that feed into it, one clear construct (quality) can be identified. We also found that there was little point in trying to distinguish fractions of grades; academics were only able to judge points on the scale (unclassified, 1*, 2*, 3*and 4*).
We also found, again surprisingly, that it was harder for assessors to pick out a 4* paper than a 1* paper. Put another way, there were bigger errors of measurement around the more highly rated submissions, whilst agreement was pretty tight at the lower end. More ratings were needed to ensure confident estimates at the top end.
Our chosen analysis technique (Rasch modelling) put each paper and each rater on to an equal interval scale. This scale showed that the ‘range’ of a 1* paper (i.e. between unclassified and 2*) was much less than the range of a 3* paper (from 2* to 4*). As a consequence, it would seem easier for the author of a potential 1* paper to upgrade it to a 3* paper than it would be to upgrade a 2*paper to a 4*.
We also tried to see if there were features in the raters’ profiles which might help predict who was likely to provide either consistent or idiosyncratic judgements. We failed. The holding of a senior academic position, or being an editor of a well-respected journal, or success in publications, is no indication of more ‘accurate’ ratings, in the sense that they provided no guarantee of ratings which were in agreement with others. Of course, they might be “right” in some sense, but not in the sense of second guessing the REF panels’ final judgement.
Our analysis has clear implications for departments and others preparing for the next REF. The first is that having only one rater for each possible submission is not a good idea. The second is that one can expect a few idiosyncratic raters in a large academic department, and that they are unlikely to be easily spotted by knowledge of their experience or seniority.
The third lesson is for the REF panels themselves. In order to select the best REF panel members, there should be an exercise in which candidates are asked to rate papers which have already been rated by several other experts, to assess if they give judgements of sufficient quality and consistency.
This is a summary of Tymms, P., & Higgins, S. (2017), ‘Judging research papers for research excellence’, Studies in Higher Education, 1-13.
We had four different people assess out papers prior to 2014, two of them external Russell group professors. Two of my papers got the full range of scores, 1,2,3 and 4 stars! The externals varied the most: one gave both papers one star, the other gave both 4 stars. I ended up using neither paper.