A Bad Week for Algorithms? - Not Quite

Posted 2020-08-26 Posted by Tom O'Connell


A Bad Week for Algorithms? - Not Quite

In March, the government announced that this year’s exams, along with most planned events for 2020, were cancelled. However, unlike most annulled events, where the remedy was postponement or ferrying them off to cyber-space, A-Levels and GCSEs would require a result, with or without exams being sat.

To provide a solution, the Office of Qualifications and Examinations Regulation (Ofqual) asked teachers to predict the grades students would have achieved. It wasn’t believed that this alone would be enough to ensure confidence in the integrity of results to universities and employers. This is where Ofqual’s now-infamous algorithm made its entrance, aiming to restore credibility to those predictions. So, what went wrong?

First, when evaluating the suitability of big data as a means to an objective, it’s crucial that your data is just that - big. However, Ofqual’s experiment did not meet this qualification. If anything, it was the opposite, using just three years of historical exam data - in some cases, where grading systems had been revamped, there was only one years’ worth of viable results available.

This is not a process of obtaining as much information on a student as possible to determine their grade, something we would typically see in the commercial application of data science. It wasn’t powered by machine learning; it was simply a basic statistical model with contradictory in-built constraints on how many pupils could attain certain grades.

Some of the data, which it was reliant on, from a student’s previous results were absent. This was due to a host of reasons, including having moved to the UK recently, or something as simple as a name change. Adding to the liability of a limited data set, the approach itself was complex and inconsistent. The algorithm, when certain criteria were met, followed nine steps:

1. Look at historic grades in the subject at the school
2. Understand how prior attainment maps to final results across England
3. Predict the achievement of previous students based on this mapping
4. Predict the achievement of current students in the same way
5. Work out the proportion of students that can be matched to their prior attainment
6. Create a target set of grades
7. Assign rough grades to students based on their rank
8. Assign marks to students based on their rough grade
9. Work out national grade boundaries and final grades

However, this process was only applied if there were more than 15 students taking exams in any given subject. If this criterion was not met, steps 1-7 were skipped and grades were based on the original teacher’s prediction.

Once applied, this algorithm downgraded 40% of teacher-predicted A-Level grades, leading to nationwide controversy and the eventual ditching of the algorithm, reverting back to the originally teacher-predicted grades.

There were a myriad of problems with this process that led to its failure. By using a national average, excellent schools were penalised and exceptional students at average-performing schools suffered. Independent schools benefited from typically smaller class sizes, bypassing the algorithm entirely and receiving teachers’ predicted grades.

The process was too over-reliant on the ranking systems. Teachers were asked to rank all the students in one cohort from best to worst. This meant if a student was ranked 10th and predicted a B, but in the previous three years the students ranked 10th all achieved a D, that student too would likely achieve a D.

Lack of consistent data, inconsistent application, and heavy influence on a grade by students you had never met created a surge of panic, stress, and confusion. So, what could they have done differently?

Ofqual tested its algorithm to see if it could accurately predict the 2019 exam results. This failed - for some subjects the model was only 40% accurate. This should’ve set off alarm bells for Ofqual; they were working with the actual final rankings of cohorts, whereas the data this would be applied to for 2020 was a prediction of that ranking. If your model fails to score well against facts, what do you expect it to do against assumptions you’re already sceptical of?

Ofqual sought to rush the process to achieve the earliest plausible assumption it could. Data analysis is reliant on humans in the loop, ensuring synergy between people and machines. Whereas this process appeared to breadcrumb an algorithm with sporadic intelligence and tell it to undermine the most critical cornerstone of education: teachers.

Data and AI shouldn’t be used as an off-the-shelf replacement for human activity. Rather, it is reliant on diligent work from experts in the field to ensure it is appropriate, ethical, and accurate. It helps us to enhance our abilities, allowing us to transcend limitations, such as healthcare, where we can diagnose conditions difficult to detect with the human eye, and security, where algorithms can spot unusual patterns of behaviour to prevent fraud.

When the initial tests failed to achieve high accuracy scores, Ofqual should’ve questioned both the quality of data and the process they were using. They should’ve tested as many different methodologies required until they consistently produced accurate results against several sets of previous grades. Instead, they ploughed ahead on an assumption that the findings would be, for the majority, accepted.

With more time, more data, rigorous testing by experts, and better methodology, this algorithm could have produced more viable results. In addition, with a mitigation process for anomalies, clearly laid out before its publication, it may have offset the outrage.

So, a bad week for algorithms? No, just for those using them incorrectly.

Get priority access to Pivigo news, features, events and networking opportunities

TOP