Software datapaths and pipelines - page 3 Now let's do some magic. Here is the same drawing for reference: // A --->+------+ +----+ 10->+----------+ 17->+------+ // |ADDER1|--->|SQRT|----->|MULTIPLIER|----->|ADDER2| ---> C // B --->+------+ +----+ +----------+ +------+ In a traditional hardware datapath, there would generally be master-slave flip-flop registers between each device or pipeline stage. A clock pulse's first edge locks the previous device's output into its following register, and after a short time to allow the data to settle in, a second clock edge lets the data in each register move on to the next device's input. Then after a delay long enough for all devices to complete their computations, the cycle repeats: the first edge locks all the data (now completed), and a second edge shortly thereafter moves it on to the next stage, and so on. We are going to define four registers, T1 thru T4, as temp registers. These will require no backend device at all, but serve merely as one-word read/write storage. We shall use these to be our registers. Now we change the main loop to the following: // phase I: lock in the output of all 4 devices adder1 => T1 squareRooter => T2 multiplier => T3 adder2 => T4 // phase II: move the locked data forward to the next device in the pipe A => adder1 B => adder1 T1 => squareRooter T2 => multiplier ; (10) => multiplier T3 => adder2 ; (17) => adder2 T4 => C counter => repeat This is a true pipeline. At the end of phase II, ALL FOUR DEVICES ARE RUNNING SIMULTANEOUSLY. The first four times through this loop the adder1, squareRooter, etc, have random "garbage" in them. We'd size the A, B and C arrays 1005 long, with the first 5 items in C unused wile the pipeline is being initialized, and the last 5 items in A and B being unused in order to empty the pipeline at the end. (This is standard procedure on any pipeline, WIZ or not). Each time through, the first instruction, "adder1 => T1", will have to wait. (We only just gave it its two operands during the previous loop's second phase. It will likely not have had time to complete the addition.) But while this instruction is waiting, ALL three other devices are also running, ALL having been started in the previous loop's second phase. Assuming they all take approximately equal time, then once "adder1 => T1" is done, the other three devices will also be done, and the following three instructions will proceed with no extra wait. And even if the delays are unequal, however long we wait on any device will reduce the wait on the rest by that exact amount, so that no matter the various delays and the order in which we access them, the total delay of the loop will equal the time of the single device with the longest delay. As long as we do all the writes to the temp registers first and all the reads second (ie, two phases), the total delay of the loop will be equal to the delay of the slowest device. And so we will be pumping data from A and B through to C at a rate of one datum per loop, each loop taking only the time of the slowest of the four devices. THIS IS A TRUE PIPELINE! And importantly note, a hardware pipeline must be clocked at the worst possible case of all four devices. But on a WIZ, given that each device generates a proper "ready" signal, this pipeline's speed is data dependent. A series of zeroes might pump through very rapidly.