Education Workforce Council

Tel: 029 2046 0099 | Email: | Twitter link@ewc_cga | yt icon mono dark YouTube


Dan Davies - ‘Is teacher assessment good enough for Successful Futures?’

Dan Davies

Whilst acknowledging its shortcomings and widespread dissatisfaction leading to a Welsh Government review, Graham Donaldson has affirmed in Successful Futures that teacher assessment ‘should remain as the main vehicle for assessment before qualifications’ (Recommendation 39). This represents continuity with the Welsh Government policy of trusting primary teachers’ professional judgement, as when it abolished compulsory national tests for 11 year olds in 2005. At the time this marked a divergence from assessment policy in England, where SATs in English and Maths at the end of Key Stage 2 have remained in place until the new National Curriculum is fully implemented.

However, even in England it has been acknowledged that national testing in science was having a detrimental effect on teaching and learning in the final year of primary school, leading to its abolition in 2010. This mirrors a wider movement towards teacher assessment for primary science in a number of OECD countries (e.g. Finland, Australia, New Zealand, Scotland, Northern Ireland). Sahlberg (2011) cites three main reasons why Finnish education relies on teacher assessment rather than external testing: 1. The progress of each pupil is judged more against his or her individual development and abilities rather than against statistical indicators; 2. assessment is embedded in the teaching and learning process and is used to improve both teachers' and pupils' work throughout the academic year; 3. academic performance and social development are seen as a responsibility of the school, not external assessors. This rationale reflects the democratic, highly professionalised role that education plays in the Finnish state. Whilst there have been some concerns over the reliability of teacher judgements, Finnish pupils' performance on TIMMS and PISA tests appear to bear out the effectiveness of the system overall, whilst the centrality of assessment design to the teacher's role in Finland ensures that it receives a great deal of attention in teacher education and professional development.

Elsewhere, however, the increasing reliance on teacher assessment raises a number of issues. The first is the extent to which evidence of pupil learning collected in the classroom for the formative purposes of supporting their future learning can legitimately be used to summarise attainment against external criteria. Another is the extent to which teachers’ ‘assessment literacy’ is sufficient to ensure that their assessments meet key criteria for effectiveness, including validity, reliability and manageability. Whilst we might argue that teacher assessment has greater validity than testing because it can be based on a wider range of evidence, teachers can be consciously or unconsciously influenced by ‘construct-irrelevant’ pupil characteristics (e.g. gender, ethnicity, socioeconomic status). Potentially its greatest weakness lies in inter-rater reliability, which concerns whether the same judgement would be made on the same evidence by different teachers. External testing using standardised instruments can be argued to produce results of greater consistency, whose reliability is measurable. Whilst few studies have attempted to assign coefficients of reliability to teacher judgements, it is widely acknowledged in the countries listed above that the most effective way to improve reliability of teacher assessment is through consensus moderation.

Whilst there are statutory requirements for cluster group moderation when transferring assessment information between primary and secondary schools in Wales, moderation between primary teachers within schools does not appear to be common practice. Some jurisdictions such as Queensland also employ external moderation and exemplification of criteria to support teachers’ judgements, though studies of moderation processes have found that it takes up to three years to achieve acceptable inter-rater reliability through such approaches. An alternative may be to adopt an approach such as Adaptive Comparative Judgement, where teachers are asked to compare pairs of samples of pupils’ work, making the simple decision as to which is ‘better’. The samples are loaded into an online ‘pairs engine’ which, through presenting a number of teachers with a series of pair comparisons, gradually sorts the samples of work into a rank order with a reliability coefficient greater than 0.9 (comparable with SATS tests). This is an approach I have used in research projects involving both primary and secondary teachers assessing science, and is based on the psychological principle that humans are better at making comparative judgements between two things than rating them against external criteria.

Whichever approach we use to improve the reliability of teacher assessment, there will inevitably be a trade-off between reliability and validity, since the wider range of evidence required to represent fully the constructs within a field of learning may be more difficult to judge consistently than the relatively narrow dataset obtained through a single assessment instrument. This trade-off also relates to the manageability of teacher assessment; schemes in England such as Assessing Pupil Progress (APP) have been criticised because the workload they place on teachers is disproportionate to the validity of the emerging data, leading to the potential collapse of the system in many schools. Clearly a balance between and optimisation of validity, reliability and manageability is required for any effective approach to teacher assessment. Perhaps, however, we need to give the last word on this subject to the pupils. In a survey of 1000 pupils in primary and secondary schools in England and Wales, Murphy et al. (2012) found that the most popular suggestion (30%) for ‘ideal’ science assessment in Welsh schools was ‘frequent, end-of-topic tests’. This compares with only 17% in English schools, suggesting a somewhat disconcerting greater preference for testing amongst Welsh pupils, despite the prevalence of teacher assessment in Wales. Donaldson acknowledges that tests have their place in the assessment of Successful Futures, but the centrality of teacher assessment clearly needs further scrutiny.

Dan Davies, Dean of Cardiff School of Education, Cardiff Metropolitan University


Murphy, C. Lundy, L., Emerson, L. and Kerr, K. (2012) Children’s perceptions of primary science assessment in England and Wales, British Educational Research Journal, 39(3): 585–606.

Sahlberg, P. (2011) Lessons from Finland, Education Digest, 77(3): 18-24.


Dan Davies joined Cardiff Metropolitan University as Dean of the Cardiff School of Education in January 2015, having previously spent 16 years at Bath Spa University in a variety of roles, including Head of Research in the School of Education, Assistant Dean and Head of Primary Education. He taught in London primary schools before working for The Design Council and Goldsmiths’ College, where he gained his PhD. He is Professor of Science and Technology Education and has published widely in the field. In the field of assessment he worked on project e-scape (e-solutions for creative assessment in portfolio environments) from 2007-10 and directed the TAPS (Teacher Assessment in Primary Science) project from 2013-14.