Welfare issues relevant to equids working in developing countries may differ greatly to those of sport and companion equids in developed countries. In this study, we test the observer reliability of a working equine welfare assessment, demonstrating how prevalence of certain observations reduces reliability ratings. The assessment included behaviour, general health, wounds, and limb and foot pathologies. In Study 1, agreement between five observers and their trainer (the ‘gold standard’) was assessed using 80 horses and 80 donkeys in India. Intra-observer agreement was later tested on 40 of each species. Study 2 took place in Egypt, using nine observers, their trainer, 30 horses and 30 donkeys, adjusting some scoring systems and providing observers with more detailed guidelines than in Study 1. Percentage agreements, Fleiss kappa (with a weighted version for ordinal scores) and prevalence indices were calculated for each variable. Reliability was similar across both studies, but was significantly poorer for donkeys than horses. Age, sex, certain wounds and (for horses alone) body condition, consistently attained clinically-useful reliability. Hoofhorn quality, point-of-hock lesions, mucous membrane abnormalities, limb-tether lesions, and skin tenting showed poor reliability. Reporting the prevalence index alongside the percentage agreement showed that, for many variables, the populations were too homogenous for conclusive reliability ratings. Suggestions are made for improving scoring systems showing poor reliability, but future testing will require deliberate selection of a more diverse equine population. This could prove challenging given that, in both populations of horses and donkeys studied here, many pathologies apparently showed 90-100% prevalence.