The key issue of the size and nature of the ‘subject instances’ that will underpin the Subject Level TEF has not been given enough attention. In fact the viability of the entire exercise turns on this apparently technical question.
Subject instances large enough to produce reliable results will usually be too heterogeneous in nature for those results to be useful, while instances homogeneous enough to be meaningful will be too small to yield statistically reliable data.
A short but necessary lesson in statistical logic
Random variation plagues the measurement of small groups, and is compounded when we want to make comparisons between these groups. This is because our real interest isn’t the groups themselves, but what they might tell us about future similar groups. Thus, if I evaluate a hospital unit by counting patient outcomes, my interest is in predicting the prospects of future patients. The larger the number of past patients whose experience of facing similar conditions I can measure, the more likely it is that they will be representative of future patients. Indeed, this number is the only guide I have, since I cannot ever know just what ‘representativeness’ might comprise.
This is where the paradox of randomness strikes. If I can measure 1,000 people, I can make estimates with a margin of error of only a few percentage points. The ‘noise’ of variation is small compared to the strength of any ‘signal’ about their experience. Indeed almost everything we know about modern economy and society comes from this logic, perfected in Britain a century ago.
Conversely, measurements of small numbers of people, no matter how carefully made, and no matter how ‘typical’ we might hope these people to be, tell us disappointingly little, because random variation swamps any signal. With 100 people my margin of error rises to around +/- 10 percentage points. With 15 people the noise becomes as strong as the signal. With fewer, noise is about all we get.
Why this matters for the subject level TEF
UK Universities currently offer about 37,000 undergraduate degree programmes. Half a million students enter annually, giving a mean course cohort of 15 students.
Thus the Subject level TEF proposes to measure 35 CAH2 ‘subject instances’ rather than degree courses. This raises the average size of units to about 115 students, and evades the problem that universities divide disciplinary boundaries in diverse ways.
Four key problems
Unfortunately this will not fix the problem, for four reasons. First, we know that for many of the NSS metrics on which TEF relies, the variance of student opinion and behaviour within the same department and provider is large compared to that between departments and providers. Analysing early NSS satisfaction data Marsh and Cheung (2008) found that most of the variance in the results occurs at individual level, with only a little over 10% being attributable to some combination of provider and department. This increases the size of the unit we need to distinguish underlying unit performance.
Second, ‘subject instances’ will vary in size. Many will still be smaller than any threshold needed to find any signal amongst the noise. But if the aim of the TEF is to inform stakeholders, there will be a perverse incentive to report some signal, lest providers disappear from league tables or other commentaries. Thus a highly likely unintended consequence of the subject level TEF will be the rolling up of specialist departments or niche degrees into larger ‘one size fits all’ programmes.
Third, random variation also hobbles the business of making comparisons between units. The larger the number of potential comparisons, the greater is the risk that any individual one is merely the result of random variation. With a couple of hundred universities this can be managed, although even here it is hard to do more than identify a handful of providers that do better or worse than the average. However this has not stopped the widespread abuse of such metrics to construct league tables or ‘identify’ failing units by assuming the numbers to be far more reliable than they in fact are.
But what comparisons might students make when choosing between several thousand degree courses? We do not know. Although statistical techniques exist to mitigate this problem, it is difficult to envisage any system that could avoid conflating random variation with real differences on the one hand, but do more than distinguish the brilliant from the abysmal on the other.
Finally, subject instances are a queer fish. The DfE consultation document asserts that not only the 35 CAH2 subjects but also the seven ‘broad’ subject groupings into which they will be sorted ‘are likely to have similar teaching practices, teaching quality and student outcomes.’ No evidence is offered to support this questionable claim. Is teaching in Computer Science and Civil Engineering similar, or Maths and Agriculture, or Architecture and Politics or Archaeology and French language? The unit that is of interest to most students is the degree course, or the subject area or department teaching it. This is also typically the lowest unit through which university governance and compliance processes operate and for good reason: the demands of teaching organisation, delivery and assessment are typically subject specific.
What next?
In order to judge the potential viability of the Subject level TEF, DfE ought therefore to supply some basic information, including:
- the distribution of subject instance sizes
- NSS and DLHE metric variance between and within subject instances
- the associated standard errors and their means of calculation
- the mitigation strategy for dealing with multiple comparisons.
The Office for National Statistics asked for an independent review of benchmarking. None has yet taken place. We also need an account of benchmarking that can be understood by stakeholders. Without it, results based upon it are likely to be abused by appraisers in the same way as previous performance indicators.
The Scylla and Charybdis of the subject level TEF is that the aggregation of students into groups large enough to make meaningful statistical analysis possible debases the validity of the analysis by treating disparate groups of students with a variety of educational experiences, studying different subjects, located in disparate units of university governance, as if they were in fact homogeneous. Randomness is not something the Office for Students or the Department for Education can change. Without a robust account of how they intend to deal with it, the prospects for a viable subject level TEF look poor.
‘With a couple of hundred universities this can be managed, although even here it is hard to do more than identify a handful of providers that do better or worse than the average.’
This is the nub, isn’t it. For years the published data have shown that a small group of institutions consistently do worse than the average; but rather than acting on those data we have chosen to prioritise finding a way to rank the great majority which are about average.
Very informative post, thanks John.
Excellent assessment. DofE and OfS now need to respond to this. My guess is they won’t, because they can’t without losing face.