We develop a formal framework for accumulating evidence across studies and apply it to develop theoretical foundations for replication. Our primary contribution is to characterize the relationship between replication and distinct formulations of external validity. Whereas conventional wisdom holds that replication facilitates learning about external validity, we show that this is not, in general, the case. Our results show how comparisons of the magnitude or sign of empirical findings link to distinct concepts of external validity. However, without careful attention to the research design of constituent studies, replication can mislead efforts to assess external validity. We show that two studies must have essentially the same research designs, i.e., be harmonized, in order for their estimates to provide information about any kind of external validity. This result shows that even minor differences in research design between a study and its replication can introduce a discrepancy that is typically overlooked, a problem that becomes more pronounced as the number of studies increases. We conclude by outlining a design-driven approach to replication, which responds to the issues our framework identifies and details how a research agenda can manage them productively.