Investigating variability of scores between different observers, between animals and over time aids the design of valid sampling methodologies for measuring animal welfare. Locomotion scores (0 to 5 scale) were collected: i) from 154 sows in one herd, using three to five observers each time, and scoring sows on up to ten occasions over a 19-month period; and ii) for 123 of these sows, locomotion scoring also took place prior to farrowing and at weaning. The distribution of scores was highly skewed towards low scores (0: 84.8%, 1: 9.5%, 2:4.0%, 3+: 1.7%). Sows showed moderate consistency in their scores over time and later parity sows had higher scores, but there was no effect of stage in the reproductive cycle (days pregnant, pre-farrowing, post-weaning). This suggests that infrequent visits to a farm (eg annual) might provide an accurate estimate of the extent of lameness if a representative range of parities was sampled, although a larger study of more farms would be required to investigate this. The three different types of agreement between observers (absolute differences, matching and association) were assessed as follows: i) analysis of absolute differences between observers showed that the farm manager scored lower than researchers/technicians; ii) exact matching approaches suggested fair or good agreement — agreement was poorest for mild gait abnormalities (score 1 ‘stiff’), and agreement improved if scores were combined into ‘sound’ (0-1) and ‘lame’ (2-5) categories; and iii) measures of association suggested moderate agreement. Inter-observer reliability improved over time until the 5th scoring event. To improve inter-observer agreement, observer training/practice and the use of fewer categories are recommended, and inter-observer agreement should be checked regularly.