We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
A directed graph is simply an abstraction of a network of one-way roads between a set of cities. When one travels in such a network, one may return to the starting point. In this chapter, we are interested in special networks that exclude the possibility of ever returning to the starting point or to any city we have already visited. We refer to a network as acyclic.
Suppose we are traveling in an acyclic network of one-way roads. By definition, on each trip, we may visit each city at most once. A very natural question that arises is; what is the maximum number of cities we can visit? In this chapter, we present a simple and efficient algorithm that finds the longest sequence of cities we can visit in the special case of acyclic networks of one-way roads.
Acyclic directed graphs are also an abstraction of assembly instructions of an airplane model. The vertices in this case are not cities but assembly tasks (glue two parts together, paint a part, etc.). An edge from task u to task υ is not a one-way road but a dependence indicating that before task v is started, one must complete task u. Given such assembly instructions, we would like to find an ordering of the tasks that obeys the dependencies. A single worker can then assemble the airplane model by completing one task at a time according to this ordering. Such an ordering is a called a topological ordering. We present a simple and efficient algorithm for topological ordering.
In this chapter we describe a specification of a simple microprocessor called the simplified DLX. The specification of the microprocessor is done by an instruction set architecture (ISA). The ISA is a simple programming language, called machine language, that defines manipulations of data as well as control of the program.
The simplified DLX is a stored-program computer. This term means that both the data and the instructions are stored in the same memory. In 1945, John von Neumann proposed how to build such a computer. This proposal was influenced by the concept of a universal Turing machine. Modern computers are based on the same principles but include many techniques for speeding up the execution of programs. These techniques include cache memories, pipelining, running instructions in parallel and even out of order, predicting branches, and so on. These topics are discussed in books on computer architecture.
WHY USE ABSTRACTIONS?
The term architecture, according to the Collins Dictionary, means the art of planning, designing, and constructing buildings. Computer architecture refers to computers instead of buildings. Computers are rather complicated; even a very simple microprocessor is built from tens of thousands of gates and an operating system spans thousands of lines of instructions. To simplify things, people focus at a given time on certain aspects of computers and ignore other aspects. For example, the hardware designer ignores questions such as which programs will be executed by the computer.
The term a digital circuit refers to a device that works in a binary world. In the binary world, the only values are zeros and ones. In other words, the inputs of a digital circuit are zeros and ones, and the outputs of a digital circuit are zeros and ones. Digital circuits are usually implemented by electronic devices and operate in the real world. In the real world, there are no zeros and ones; instead, what matters is the voltages of inputs and outputs. Since voltages refer to energy, they are continuous (unless quantum physics is used). So we have a gap between the continuous real world and the two-valued binary world. One should not regard this gap as absurd. Digital circuits are only an abstraction of electronic devices. In this chapter, we explain this abstraction, called the digital abstraction.
In the digital abstraction, one interprets voltage values as binary values. The advantages of the digital model cannot be overstated; this model enables one to focus on the digital behavior of a circuit, to ignore analog and transient phenomena, and to easily build larger, more complex circuits out of small circuits. The digital model together with a simple set of rules, called design rules, enable logic designers to design complex digital circuits consisting of millions of gates that operate as expected.
TRANSISTORS
The basic building blocks of digital electronic circuits are transistors. The hierarchy starts with transistors, from which gates are built.
Ever since the beginning of the history of computer systems, the demand for more performance has been the most important driving force for evolution in computer architecture. In particular, many important applications demand more performance than a single (serial) processor core can provide, and historically have pushed parallel architecture technology. A good example is numerical programs used in computer simulation to analyze and solve problems in science and engineering, such as climate modeling, weather forecasting, or computer-aided design. Another example is commercial systems in which a large pool of independent queries must be executed to meet the growing demands of the information age. Over the years, another driving force for parallel architectures has been the fear of impending technological barriers that would eventually stall the performance growth of serial computers. These two forces have fueled from the beginning a keen interest in multiprocessor architecture research. While scientific computing needs made these research efforts relevant early on in the market place, multiprocessor technology hit the mainstream with the shift to multi-core computers at the beginning of the twenty-first century.
This chapter is devoted to design principles of multiprocessor systems. It focuses on two multiprocessor architectural styles: shared-memory and message-passing multiprocessor systems. Both styles use multiple processors with the goal of achieving a linear speedup of computational power with the number of processors. However, they differ in the method by which the processors exchange data. Processors in shared-memory multiprocessors share the same address space and can exchange data through shared-memory locations by regular load and store instructions.
This chapter is dedicated to the correct and reliable communication of values in shared-memory multiprocessors. Relevant correctness properties of the memory system of shared-memory multiprocessors include coherence, the memory consistency model (henceforth also referred to as the memory model), and the reliable execution of synchronization primitives. Since chip multiprocessors are designed as shared-memory multi-core systems, this chapter targets correctness issues not only in symmetric multiprocessors (SMPs) or large-scale cache coherent distributed shared-memory systems (cc-NUMAs and COMAs) covered in Chapter 5, but also in chip multiprocessors with core multi-threading (CMPs) covered in Chapter 8.
The correctness of a shared-memory multi-threaded program must be independent of the relative execution speed of its threads, because of the numerous unpredictable events that can disrupt the execution of any one thread, such as DVFS (dynamic voltage and frequency scaling), thermal emergencies, conflicts for hardware and software resources, interrupts, exceptions, kernel activity, thread scheduling, data allocation delays, and interactions with other running programs. If a multi-threaded program is written for a dedicated machine in which timing is highly predictable and the program is written in a way that takes timing into account for its correctness (such as, possibly, in real-time systems), many conclusions of this chapter should be revised. In other words, the target software throughout this chapter is portable shared-memory multi-threaded programs written for general-purpose or multi-purpose machines and includes the operating system kernel.
In this chapter we sharpen our focus on thread-level parallelism within a single die. Parallelism within a die comes in different forms. Within a single-core, multiple threads can be executed to improve resource utilization, an approach called core multi-threading. There are three approaches to core multi-threading depending on how and when instructions are fetched from multiple ready threads: block multi-threading, interleaved multi-threading and simultaneous multi-threading. We show the hardware additions and modifications necessary for each of these three multi-threading approaches to work within the contexts of traditional (single-threaded) in-order and out-of-order processors. We use example-driven approaches to show the performance advantages of finer-grain multi-threading over coarse-grain multithreading. The performance advantages come at additional hardware cost.
The next paradigm to provide on-die parallelism is exploiting multiple cores on the same chip. Chip multiprocessors (CMPs) are fast becoming ubiquitous in all walks of computing, from cell phones to datacenter servers. We explain the fundamental advantages of CMPs over traditional shared-memory multiprocessors (SMPs) mostly borne from the fact that all cores are tightly integrated on a single die by on-die interconnects.We describe three on-die interconnect topologies common today for building CMPs. When all cores on a CMP are identical, the CMP is said to be homogeneous. The cores in heterogeneous CMPs differ in their capabilities. We describe various heterogeneous CMP designs and the gamut of different performance and functionality possible.
Given the widening gaps between processor speed, main memory (DRAM) speed, and secondary memory (disk) speed, it has become more and more difficult in recent years to feed data and instructions at the speed required by the processor while providing the ever-expanding memory space expected by modern applications. Modern systems rely on a memory hierarchy based on speed, size, and cost, as illustrated in Figure 4.1. Left of the dotted line is the cache hierarchy. Right of the dotted line is the virtual memory hierarchy, which may include a disk cache (not shown).
It has been observed over the years that the speed gap between the processor (clocked at multiple gigahertz and executing multiple instructions per clock) and main memory (with access times in the tens or even hundreds of nanoseconds) is growing exponentially. This problem is commonly referred to as the memory wall. A hierarchy of multiple levels of caches with various sizes and access times are employed to bridge the speed gap. Moreover, caches at every level are becoming more and more complex to help reduce or hide the latency of cache misses. To support OoO dynamically scheduled processors, which may have more than ten memory accesses pending at any time, modern, lockup-free (non-blocking) caches are capable of handling multiple cache hits and misses at a time. Furthermore, data and instructions are prefetched in caches before they are needed. In this chapter we describe these enhancements to cache designs.
For the past 20 years we have lived through the information revolution, powered by the explosive growth of semiconductor integration and of the internet. The exponential performance improvement of semiconductor devices was predicted by Moore's law as early as the 1960s. There are several formulations of Moore's law. One of them is directed at the computing power of microprocessors. Moore's law predicts that the computing power of microprocessors will double every 18–24 months at constant cost so that their cost-effectiveness (the ratio between performance and cost) will grow at an exponential rate. It has been observed that the computing power of entire systems also grows at the same pace. This law has endured the test of time and still remains valid today. This law will be tested repeatedly, both now and in the future, as many people see today strong evidence that the “end of the ride” is near, mostly because the miniaturization of CMOS technology is fast reaching its limit, the so-called CMOS endpoint.
Besides semiconductor technology, improved chip designs have also fueled the phenomenal performance growth of microprocessors over the years. Historically, with each new process generation, the logic switching speed and the amount of on-chip logic have both increased dramatically. Faster switching speeds lead to higher clock rates. Aggressive chip designs also contribute to higher clock rates by improving the design of circuits or by pipelining the steps in the execution of an instruction.
Computer architecture is a fast evolving field, mostly because it is driven by rapidly changing technologies. We have all been accustomed to phenomenal improvements in the speed and reliability of computing systems since the mid 1990s, mostly due to technology improvements, faster clock rates, and deeper pipelines. These improvements have had a deep impact on society by bringing high-performance computing to the masses, by enabling the internet revolution and by fostering huge productivity gains in all human activities. We are in the midst of an information revolution of the same caliber as the industrial revolution of the eighteenth century, and few would deny that this revolution has been fueled by advances in technology and microprocessor architecture.
Unfortunately, these rapid improvements in computing systems may not be sustainable in future. Pipeline depths have reached their useful limit, and frequency cannot be cranked up for ever because of power constraints. As technology evolves and on-chip feature sizes shrink, reliability, complexity, and power/energy issues have become prime considerations in computer design, besides traditional measures such as cost, area, and performance. These trends have ushered a renaissance of parallel processing and parallel architectures, because they offer a clear path – some would say the only path – to solving all current and foreseeable problems in architecture. A widespread belief today is that, unless we can unleash and harness the power of parallel processing, the computing landscape will be very different very soon, and this dramatic change will have profound societal impacts.
Modern computer systems are becoming increasingly complex as more devices and functionalities are integrated. Throughout the entire design cycle of systems, simulation is a crucial tool for computer architecture researchers to evaluate novel ideas and explore the design space. Compared with hardware prototyping and analytic modeling, simulation strikes a better balance between accuracy, cost, flexibility, and complexity. As the design complexity of state-of-theart microprocessors keeps growing and manufacturing costs skyrocket, computer architecture simulation has become critical.
Simulations are pervasive in computer architecture research and design and affect the productivity of these activities to a great extent. Productivity is impacted at two levels: (1) the time and effort spent on developing the simulator and (2) the time consumed on running simulations with representative benchmarks. The dramatic growth of the integration density of microprocessor chips provides computer architects abundant on-chip real estate to enhance computing power with more complex architectural designs. In addition, power and reliability have turned into critical design constraints. Building a simulation infrastructure that allows a designer to consider performance, power, and reliability in a single unified framework leads to significant cost and delays in simulator development. Another direct consequence of a complex infrastructure is that simulation itself slows down, increasing the turnaround time for each design state exploration. Simulation slowdown is becoming particularly acute as computer architecture moves into the chip multiprocessor (CMP) era. The current approach of simulating CMPs with growing numbers of cores in a single thread is not scalable and cannot be sustained over time.
The processor and its instruction set are the fundamental components of any architecture because they drive its functionality. In some sense the processor is the “brain” of a computer system, and therefore understanding how processors work is essential to understanding the workings of a multiprocessor.
This chapter first covers instruction sets, including exceptions. Exceptions, which can be seen as a software extension to the processor instruction set, are an integral component of the instruction set architecture definition and must be adhered to. They impose constraints on processor architecture. Without the need to support exceptions, processors and multiprocessors could be much more efficient but would forgo the flexibility and convenience provided by software extensions to the instruction set in various contexts. A basic instruction set is used throughout the book. This instruction set is broadly inspired by the MIPS instruction set, a rather simple instruction set. We adopt the MIPS instruction set because the fundamental concepts of processor organizations are easier to explain and grasp with simple instruction sets. However, we also explain extensions required for more complex instruction sets, such as the Intel x86, as need arises.
Since this book is about parallel architectures, we do not expose architectures that execute instructions one at a time. Thus the starting point is the 5-stage pipeline, which concurrently processes up to five instructions in every clock cycle. The 5-stage pipeline is a static pipeline in the sense that the order of instruction execution (or the schedule of instruction execution) is dictated by the compiler, an order commonly referred to as the program, thread, or process order, and the hardware makes no attempt to re-order the execution of instructions dynamically.
Technology has always played the most important role in the evolution of computer architecture over time and will continue to do so for the foreseeable future. Technological evolution has fostered rapid innovations in chip architecture. We give three examples motivated by performance, power, and reliability. In the past, architectural designs were dictated by performance/cost tradeoffs. Several well-known architectural discoveries resulted from the uneven progress of different technological parameters. For instance, caches were invented during the era when processor speed grew much faster than main memory speed. Recently, power has become a primary design constraint. Since the invention of the microprocessor, the amount of chip realestate has soared relentlessly, enabling an exponential rise of clock frequencies and ever more complex hardware designs. However, as the supply voltage approached its lower limit and power consumption became a primary concern, chip architecture shifted from high-frequency uniprocessor designs to chip multiprocessor architectures in order to contain power growth. This shift from uniprocessor to multiprocessor microarchitectures is a disrupting event caused by the evolution of technology. Finally, for decades processor reliability was a concern primarily for high-end server systems. As transistor feature sizes have shrunk over time they have become more susceptible to transient faults. Hence radiation-hardened architectures have been developed to protect computer systems from single-event upsets causing soft errors.
These examples of the impact of technology on computer design demonstrate that it is critical for a reader of this book to understand the basic technological parameters and features, and their scaling with each process generation.
Interconnection networks are an important component of every computer system. Central to the design of a high-performance parallel computer is the elimination of serializing bottlenecks that can cripple the exploitation of parallelism at any level. Instruction-level and thread-level parallelisms across processor cores demand a memory system that can feed the processor with instructions and data at high speed through deep cache memory hierarchies. However, even with a modest miss rate of one percent and with 100 cycle miss penalty, half of the execution time can be spent bringing instructions and data from memory to processors. It is imperative to keep the latency to move instructions and data between main memory and the cache hierarchy short.
It is also important that memory bandwidth be sufficient. If the memory bandwidth is not sufficient, contention among memory requests elongates the memory-access latency, which, in turn, may affect instruction execution time and throughput. For example, consider a nonblocking cache that has N outstanding misses. If the bus connecting the cache to memory can only transfer one block every T cycles, it takes N × T cycles to service the N misses as opposed to T cycles if the bus can transfer N blocks in parallel.
The role of interconnection networks is to transfer information between computer components in general, and between memory and processors in particular. This is important for all parallel computers, whether they are on a single processor chip – a chip multiprocessor or multi-core – or built from multiple processor chips connected to form a large-scale parallel computer.
Improve design efficiency and reduce costs with this practical guide to formal and simulation-based functional verification. Giving you a theoretical and practical understanding of the key issues involved, expert authors including Wayne Wolf and Dan Gajski explain both formal techniques (model checking, equivalence checking) and simulation-based techniques (coverage metrics, test generation). You get insights into practical issues including hardware verification languages (HVLs) and system-level debugging. The foundations of formal and simulation-based techniques are covered too, as are more recent research advances including transaction-level modeling and assertion-based verification, plus the theoretical underpinnings of verification, including the use of decision diagrams and Boolean satisfiability (SAT).
This new, expanded textbook describes all phases of a modern compiler: lexical analysis, parsing, abstract syntax, semantic actions, intermediate representations, instruction selection via tree matching, dataflow analysis, graph-coloring register allocation, and runtime systems. It includes good coverage of current techniques in code generation and register allocation, as well as functional and object-oriented languages, that are missing from most books. In addition, more advanced chapters are now included so that it can be used as the basis for two-semester or graduate course. The most accepted and successful techniques are described in a concise way, rather than as an exhaustive catalog of every possible variant. Detailed descriptions of the interfaces between modules of a compiler are illustrated with actual C header files. The first part of the book, Fundamentals of Compilation, is suitable for a one-semester first course in compiler design. The second part, Advanced Topics, which includes the advanced chapters, covers the compilation of object-oriented and functional languages, garbage collection, loop optimizations, SSA form, loop scheduling, and optimization for cache-memory hierarchies.
Device testing represents the single largest manufacturing expense in the semiconductor industry, costing over $40 billion a year. The most comprehensive and wide ranging book of its kind, Testing of Digital Systems covers everything you need to know about this vitally important subject. Starting right from the basics, the authors take the reader through automatic test pattern generation, design for testability and built-in self-test of digital circuits before moving on to more advanced topics such as IDDQ testing, functional testing, delay fault testing, memory testing, and fault diagnosis. The book includes detailed treatment of the latest techniques including test generation for various fault models, discussion of testing techniques at different levels of integrated circuit hierarchy and a chapter on system-on-a-chip test synthesis. Written for students and engineers, it is both an excellent senior/graduate level textbook and a valuable reference.