## Memory II, FSMs, Pipelining 6.205 Fall 2025 #### Administrative - Week 4 due last night - Week 5 out after class - More video and now camera too - Final project details by tomorrow ## Memory Example! In the first part of week 05 you're going to be displaying popcat on your FPGA over video. popcat - We need to store popcat - Popcat is a 256X256 24 bit color image. - How to encode with memories? #### Popcat - We could build a *24bit-wide* memory that has 256x256 (65,536) entries (deep) in it (one for each pixel) - Math: 256x256x24 = 1,572,864 bits - That's greater than >50% of our memory on the FPGA....all for one popcat #### Strategy - Images are large and take up lots of memory - Want to save space, and store image using less memory - Many images don't express every one of the 2<sup>24</sup> "true" colors. - Why waste the space storing an unused possibility? - So pick the N most popular and only display them: - You can encode each pixel using $ceil(log_2(N))$ bits (save space) - Then use a color table to look up what full color (24 bit value) that corresponds to! #### Color Lookup Table - So use two memories: - 8-wide memory with 65536 entries for each pixel encoding one of 256 colors (using 8 bits) - 24-wide memory with 256 entries entries encoding those colors Colors are still 24 bit, but the pixels are encoded using only 8 bits red, green, blue – 24 bit ### Storage Savings 24 bit - 16M colors 256 X 256 image @ 24 bits per pixel is: 256 X 256 X 24 bits = 1.572 Mbits (196.6 kBytes) 8 bit - 256 colors 256 X 256 image with 8 bit color map: 256 X 256 X 8 bits + 256 X 24 = 0.530432 Mbits (66.304 kBytes) # If need more savings...keep going as needed 24 bit - 16M colors 4 bit – 16 colors 8 bit - 256 colors 2 bit – 4 colors 6 bit - 16 colors 1 bit - 2 colors # Color Lookup Table Open up image.mem #### Open up palette.mem ### Additional Tricks Can be Played • Dithering, in particular can help with this problem of using limited colors more wisely, but we'll go into that in a future week. (form of "noise shaping") ## What about Memory Latency? - Yes What about latency. - These things we just described are memories! - Memory of any scale usually has latency involved with it - Xilinx BRAM has how many cycles latency on a read? - 1 technically, though... - 2 cycles recommended, dare I say mandated in 6.205 - Lab/Week 5 will investigate ## FSM Modularity #### Toward FSM Modularity Consider the following abstract FSM: - Suppose that each set of states $a_x...d_x$ is a "sub-FSM" that produces exactly the same outputs. - Can we simplify the FSM by removing equivalent states? No! The outputs may be the same, but the next-state transitions are not. This situation closely resembles a procedure call or function call in software...how can we apply this concept to FSMs? Acknowledgements: Rex Min #### The Major/Minor FSM Abstraction - Subtasks are encapsulated in minor FSMs with common reset and clock - Simple communication abstraction: - START: tells the minor FSM to begin operation (the call) - BUSY: tells the major FSM whether the minor is done (the return) - The major/minor abstraction is great for... - Modular designs (always a good thing) - Tasks that occur often but in different contexts - Tasks that require a variable/unknown period of time - Event-driven systems #### Inside the *Major* FSM #### Variations: - Usually don't need both Step 1 and Step 3 - One cycle "done" signal instead of multi-cycle "busy" #### Optimizing the *Minor* FSM Good idea: de-assert BUSY one cycle early with caveats!!! #### Bad idea #1: T<sub>4</sub> may not immediately return to T<sub>1</sub> #### Bad idea #2: **BUSY** never asserts! So make sure you if you do this, that last state always happens and always happens for one cycle #### A Four-FSM Example #### Four-FSM Sample Waveform ### Alternative to Busy Signals - As an alternative to busy signals sometimes just having a single-cycle "valid" signals is sufficient. - You have an implied "business" until valid shows up - If the downstream systems involved are stateful enough to be able to keep track of various system's this can work - Or you can do both. Depends on your design Alternative to Busy Signals (Single-cycle asserts) #### A Divider - The Divider from when we first talked about FSMs is an example of a system which might be a minor FSM in part of a larger major's algorithm - Many things need division, but it would suck to have to rewrite it repeatedly. - We want you to get practice with that in Week 5's lab ``` module divider #(parameter WIDTH = 32) input wire clk, input wire rst, input wire[WIDTH-1:0] dividend, input wire[WIDTH-1:0] divisor, input wire data in valid. output logic[WIDTH-1:0] quotient, output logic[WIDTH-1:0] remainder, output logic data_out_valid, output logic error, output logic busy logic [WIDTH-1:0] quotient_g; logic [WIDTH-1:0] dividend_h; logic [WIDTH-1:0] divisor_h; enum {RESTING, DIVIDING} state: always_ff @(posedge clk)begin quotient_q <= 0; dividend h <= 0; divisor h <= 0; remainder <= 0; busy <= 1'b0; error <= 1'h0: state <= RESTING: data_out_valid <= 1'b0; case (state) RESTING: begin if (data_in_valid)begin state <= DIVIDING;</pre> quotient_g <= 0; dividend h <= dividend: divisor_h <= divisor; busy <= 1'b1; error <= 1'b0; data_out_valid <= 1'b0; DIVIDING: begin if (dividend_h<=0)begin state <= RESTING; //similar to return statement remainder <= dividend h: quotient <= quotient_g; busy <= 1'b0; //tell outside world i'm done data out valid <= 1'b1; //good stuff! end else if (divisor_h==0)begin state <= RESTING: remainder <= 0; quotient <= 0: busy <= 1'b0: //tell outside world i'm done error <= 1'b1: //ERROR data_out_valid <= 1'b1; //valid ERROR end else if (dividend_h < divisor_h) begin state <= RESTING: remainder <= dividend h: quotient <= quotient q; busy <= 1'b0; error <= 1'b0: data_out_valid <= 1'b1; //good stuff! end else begin //state staying in. state <= DIVIDING;</pre> quotient_g <= quotient_g + 1'b1; dividend_h <= dividend_h-divisor_h; 75 endmodule ``` # Center of Mass Calculation in Lab05 - You will write a center-of-mass calculator that is best thought of as an FSM. - For each frame of video: - Sum the x location and y location of every active pixel you come across - Keep track of how many pixels you've encountered - At end of frame (or beginning of next one, divide the two sums by the number of active pixels - This will give an average X,Y - Division takes time! - Need to create a major/minor FSM system #### Lab 05 Center of Mass - C.O.M. will be in a "data collection state" during the active portion of a video frame - When the frame's active part is done, it needs to calculate the average x,y position of the "hot" pixels it has observed. - To divide, the C.O.M. module hands the values it needs divided off to two separate dividers. - C.O.M. waits on them monitoring their BUSY signals - They can do division separately (in parallel) - When done, they report back to the C.O.M with their result - C.O.M. reports to outside world its calculation ## Pipelining How to make sure signals are balanced going through a sequence of operations. #### Performance Metrics - Latency (L): - time between arrival of new input and generation of corresponding output. - For purely combinational circuits this is just t<sub>PD</sub>. - Throughput (T): - Rate at which new outputs appear. - For purely combinational circuits this is just 1/t<sub>PD</sub> or 1/L. #### Performance of Combinational F & G are "idle", just holding their outputs stable while H performs its computation ## Retiming: A useful transform Propagation delays indicated by numbers: ## Retiming: A useful transform - Add in Flops - Run clock as fast as possible given the presence of the flops Assuming ideal registers: i.e., $t_{PD} = 0$ , $t_{SFTLIP} = 0$ ### Pipeline Diagrams The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle. ## Pipeline Conventions • a K-Stage Pipeline ("K-pipeline") is an acyclic circuit having exactly K registers on every path from an input to an output. • a COMBINATIONAL CIRCUIT is thus a 0-stage pipeline. ### Pipeline Conventions #### CONVENTION: Every pipeline stage, hence every K-Stage pipeline, has a register on its OUTPUT (not on its input). #### • ALWAYS: The CLOCK common to all registers must have a period sufficient to cover propagation over combinational paths PLUS (input) register t<sub>PD</sub> PLUS (output) register t<sub>SFTUP</sub>. $$t_{PD,reg1} + t_{PD,logic} + t_{SETUP,reg2} \le t_{CLK}$$ The LATENCY of a K-pipeline is K times the period of the clock common to all registers. The THROUGHPUT of a K-pipeline is the frequency of the clock. ## ILL-formed Pipeline For what value of K is the following circuit a K-Pipeline? \_\_\_\_none #### Problem: Successive inputs get mixed: e.g., $B(A(X_{i+1}), Y_i)$ . This happened because some paths from inputs to outputs have 2 registers, and some have only 1! This CAN'T HAPPEN on a well-formed K pipeline! ## Pipelining #### Let's say we want $t_{clk}$ to be 8ns ## Pipelining Let's say we want $t_{clk}$ to be 8ns - Step 1: Add a register on the output. - Step 2: From register. Draw a contour backwards that includes as much of the circuit that will fit inside required period. Add registers. Make sure path for every signal is balanced - Repeat until satisfied with T. Look for redundant registers #### STRATEGY: Focus your attention on placing pipelining registers around the slowest circuit elements (BOTTLENECKS). Assuming this interfaces with other modules that have registered outputs the input will chain will be ok (<= 8ns) #### Another Pipeline Example | | LATENCY | THROUGHPUT | |---------|---------|------------| | 0-pipe: | 4ns | 1/4ns | | 1-pipe: | 4ns | 1/4ns | | 2-pipe: | 4ns | 1/2ns | | 3-pipe: | 6ns | 1/2ns | #### **OBSERVATIONS:** - 1-pipeline improves neither L or T. - T improved by breaking long combinational paths, allowing faster clock. - Too many stages cost L, don't improve T. - Back-to-back registers are often required to keep pipeline well-formed. Better throughput here means we can run at higher clock rate #### Pipelining in Verilog ``` logic x,y,z; //G and F are purely combinational modules. G myg (.d_in(x), .d_out(y)); F myf (.d_in(y), .d_out(z)); ``` ``` logic x,y1,y2,z1,z2; //G and F are purely combinational modules. G myg (.d_in(x), .d_out(y1)); F myf (.d_in(y2), .d_out(z1)); always_ff @(posedge clk)begin y2 <= y1; z2 <= z1; end </pre> ``` ## How often should you be adding FlipFlops in your FPGA? - This comes with experience and getting to know your system. - AND by using the feedback that the tool gives you after place and route. - Most of what you want to do really is some form of math. - So knowing how much math you can do in a clock cycle is useful # The Complexity of Math Operations Let's look at some basic math circuits: ### "Full Adder" building block The "half adder" circuit has only the A and B inputs (no carry) Full adders handle carry bits | A | В | CI | S | СО | |---|---|----|---|----| | 0 | 0 | 0 | 0 | 0 | | 0 | 0 | 1 | 1 | 0 | | 0 | 1 | 0 | 1 | 0 | | 0 | 1 | 1 | 0 | 1 | | 1 | 0 | 0 | 1 | 0 | | 1 | 0 | 1 | 0 | 1 | | 1 | 1 | 0 | 0 | 1 | | 1 | 1 | 1 | 1 | 1 | #### Adder: a circuit that does addition Here's an example of binary addition as one might do it by "hand": Adding two N-bit numbers produces an (N+1)-bit result $\begin{array}{c|c} & & & \text{Carries from previous column} \\ & & 1101 \\ & & & 1101 \\ & & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & \\ & & & \\ & & & \\ & & & \\ & &$ If we build a circuit that implements one column: we can quickly build a circuit to add two 4-bit numbers... #### Subtraction: A-B = A + (-B) So let's build an arithmetic unit that does both addition and subtraction. Operation selected by control input: ### "Full Adder" building block The "half adder" circuit has only the A and B inputs What's the critical path through this circuit? | A | В | CI | ធ | CO | |---|---|----|---|----| | 0 | 0 | 0 | 0 | 0 | | 0 | 0 | 1 | 1 | 0 | | 0 | 1 | 0 | 1 | 0 | | 0 | 1 | 1 | 0 | 1 | | 1 | 0 | 0 | 1 | 0 | | 1 | 0 | 1 | 0 | 1 | | 1 | 1 | 0 | 0 | 1 | | 1 | 1 | 1 | 1 | 1 | t<sub>pd</sub> dictated by carry path! Can also rewrite the carry path as: $c_{out} = (a \& c_{in}) | (b \& c_{in}) | (a \& b)$ #### Speed: t<sub>PD</sub> of Ripple-carry Adder $$C_O = AB + AC_I + BC_I$$ Worst-case path: carry propagation from LSB to MSB, e.g., when adding 11...111 to 00...001. $$t_{PD} = (N-1)*(t_{PD,OR} + t_{PD,AND}) + t_{PD,XOR} \approx \Theta(N)$$ CI to CO $$CI_{N-1} \text{ to } S_{N-1}$$ $$t_{adder} = (N-1)t_{carry} + t_{sum}$$ Θ(N) is read "order N": means that the latency of our adder grows at worst in proportion to the number of bits in the operands. #### The Carry Path Becomes Limiting Solution is the Carry-Look-ahead Adder: $$c_1 = G_0 + P_0.c_0$$ $$c_2 = G_1 + P_1.G_0 + P_1.P_0.c_0$$ $$c_3 = G_2 + P_2.G_1 + P_2.P_1.G_0 + P_2.P_1.P_0.c_0$$ $$c_4 = G_3 + P_3.G_2 + P_3.P_2.G_1 + P_3.P_2.P_1.G_0 + P_3.P_2.P_1.P_0.c_0$$ Can do some factoring/redesign and cut-down on tpd of the carry path $$G_i = a_i.b_i$$ $$P_i = a_i \oplus b_i$$ https://www.ece.uvic.ca/~fayez/courses/ceng465/lab\_465/project1/adders.pdf #### Logic Slices in FPGA Can Add/Subtract Can synthesize the addition of two 4 bit numbers with fast carry > $A_3A_2A_1A_0$ $+B_{3}B_{2}B_{1}B_{0}$ **Fast Carry-Chain** Series 7 Logic Slice #### Add/Subtract on the FPGA - + and can be done combinationally very quickly: - 32 bit add can be done in a clock cycle (<10 ns) pretty easily - Several smaller adds (A+B+C+D) can be done in clock cycle as well (10 ns) - CLBs (the generic function generators, of which we have a lot) are capable of being chained together to allow large adds. # Slices can stack to give more bits $A_{11}A_{10}A_{9}A_{8}A_{7}A_{6}A_{5}A_{4}A_{3}A_{2}A_{1}A_{0} \\ +B_{11}B_{10}B_{9}B_{8}B_{7}B_{6}B_{5}B_{4}B_{3}B_{2}B_{1}B_{0}$ • Can Actually see everything that get's synthesized... • Zoom in... • Zooooom in... Zoooooom in. Individual wire highlighted in white Connecting one fast carry chain to the next up in the column • Zoooooommm...in..... • Zoooooooooooommm...in..... Individual LUT 64 bit program • For that 6input LUT #### + or - in Verilog Generally + or – on its own will get synthesized using logic slices unless specified Very large additions or subtractions may start to take too long!\* But doing a couple 32 bit adds in a 10 ns cycle should be possible... <sup>\*&</sup>quot;too" is really with respect to a clock. If you're running on a 10 MHz clock, then things are different! ### But also the stuff around it matters too! - Keep track of the stuff before and after your math. - If you have a ton of if/else/ifs...or if you have a super-deeply nested if/if/if/ chain, all that stuff requires logic too. #### Also Case Statements are Good If/elses and even parallel if's as shown on the previous page get encoded as priority logic https://www.kevnugent.com/2020/10/22/verilog-blogpost\_002/ #### Also Case Statements are Good • If logic can be structured without priority, then do it! Can yield simpler underlying logic. #### The stuff around it matters too! - Keep track of the stuff before and after your math. - If you have a ton of if/else/ifs...or if you have a super-deeply nested if/if/if/ chain, all that stuff requires logic too. - Also think about the stuff being used to calculate the if/else stuff. #### I potentially violate timing! #### Example... ``` logic [31:0] a,b,y,z,q,s,t,r; always_ff @(posedge clk)begin if (b >q;)begin end end always_comb begin q = s + t; end always_comb begin t = r > 98?r + 100:a + 11; end ``` Path that needs to be calculated #### Multiplication on the FPGA - Multiplication can be done on the FPGA on 2's complement numbers - Takes more time: - Depending on size of operands may/may not be doable in one clock cycle - Where possible try to get away with bit shifts and adds. #### Multiplications with shifts - <<1 is multiply by 2</p> - >>1 is divide by 2 - Can do a lot with this if get creative ``` logic [7:0] x; logic [7:0] y; //want this to be seven times X assign y = (x<<2) + (x<<1) + x;</pre> ``` Vivado can be pretty good at figuring these things out for you, but largely only for constants. #### Generic Digital Multiplication In base 2 multiplication these are all very simple calculations done with XOR Some really cool factoring can be done to make the overall propagation delay of a multiplier relatively short, though there's a lot of logic in it\* <sup>\*</sup>Lecture on Multiplier architectures: https://inst.eecs.berkeley.edu/~eecs151/sp18/files/Lecture21.pdf #### **DSP Blocks** - Add-then-multiply is a common operation chain in many things, particularly Digital Signal Processing - FPGA has dedicated hardware multiplier modules called DSP48 blocks on it - 150 of them on our FPGA - Capable of single-cycle multiplies - Can get inferred from using \* in your Verilog that isn't a power of 2: - x\*y, for example, will likely will result in DSP getting used - May take a full clock cycle so would need to budget timing accordingly - Can infer multiple for larger bit multiplies #### DSP48 Slice (High Level) Figure 1-1: Basic DSP48E1 Slice Functionality Much of the benefit/speed of this module comes from the hardwired internal routing, keeping it very fast. This device is not as generalized as a LUT/logic cell. It can only do a subset of math operations. Located equally-spaced over the device like BRAMs https://www.xilinx.com/support/documentation/user\_guides/ug479\_7Series\_DSP48E1.pdf #### How much multiply can one do? - At 100 MHz on these boards, I'd aim for maybe maybe one 32 bit multiply per clock cycle (it'll use several DSP blocks to achieve that) - Anything more is pushing it - If you run out of DSP blocks, it'll revert to using the generic logic...and this will become a harder problem to satisfy - Smaller multiply-adds you can maybe get away with in one clock cycle. #### Division - The outlier in the + \* / set... - Division is a significantly harder math operation to do compared to multiplication - Where possible try to avoid - Try to divide by powers of 2 (use right shift)! - If you can't avoid we must do it. (week 05) #### Lab 05 Center of Mass - C.O.M. will be in a "data collection state" during the active portion of a video frame - When the frame's active part is done, it needs to calculate the average x,y position of the "hot" pixels it has observed. - To divide, the C.O.M. module hands the values it needs divided off to two separate dividers. - C.O.M. waits on them monitoring their BUSY signals - They can do division separately (in parallel) - When done, they report back to the C.O.M with their result - C.O.M. reports to outside world its calculation #### One "Bad" Attempt at Division - In previous lecture looked at \*what\* this actually builds - We can ask Vivado to synthesize division logic for us, and it actually will do it. - This code constrains the act of division to having to exist between two clock edges.: ``` module top level( input wire clk 100mhz, //clock @ 100 mhz input wire [15:0] sw, //switches input wire btnc, //btnc (used for reset) input wire btnu, //btnc (used for reset) input wire btnl, //btnc (used for reset) output logic [15:0] led //just here for the funs logic old_btnl; logic old btnu; logic old btnc; logic [15:0] guotient; logic [15:0] dividend; logic [15:0] divisor; assign led = quotient; always_ff @(posedge clk_100mhz)begin old btnl <= btnl; old btnu <= btnu; old btnc <= btnc; end always_ff @(posedge clk_100mhz)begin if (btnu & ~old btnu)begin quotient<= dividend/divisor; //divide</pre> end if (btnc & ~old btnc)begin dividend <= sw; //divide //load dividend</pre> if (btnl & ~old_btnl)begin divisor <= sw; //divide //load divisor</pre> end end endmodule ``` ### Circuit Built: ### Build the Bad Divider #### Violates timing! ``` Phase 22 Post Router Timing INFO: [Route 35-20] Post Routing Timing Summary | WNS=-21.399| TNS=-129.552| WHS=0.090 | THS=0.000 | ``` | Site Type | Used | Fixed | Prohibited | +<br> Available | ++<br> Util% | |-----------------------------------------|------------|---------|------------|------------------|---------------| | Slice | +<br> 100 | l 0 | <br> 0 | +<br>l 15850 | +<br> 0.63 | | SLICEL | l 89 | 0 | | l | | | I SLICEM | l 11 | 0 | | | l l | | LUT as Logic | 1 274 | 0 | 0 | l 63400 | l 0.43 l | | l using 05 output only | 0 | | | | l l | | I using 06 output only | 1 274 | | | l | l l | | l using 05 and 06 | 0 | | | l | l l | | I LUT as Memory | 0 | 0 | 0 | 19000 | l 0.00 l | | l LUT as Distributed RAM | 0 | l 0 | | l | l I | | l LUT as Shift Register | 0 | 0 | | l | l l | | Slice Registers | l 55 | l 0 | 0 | 126800 | l 0.04 l | | I Register driven from within the Slice | l 16 | l | | l | l l | | Register driven from outside the Slice | l 39 | l | | l | l l | | LUT in front of the register is unused | l 26 | | | l | l l | | LUT in front of the register is used | l 13 | | | l | l l | | Unique Control Sets | l 4 | | 0 | l 15850 | l 0.03 l | | + | + | <b></b> | <b></b> | + | ++ | ### Now Do It Again With 32 bits: ``` if (pmod_pin & ~old_pmod_pin) begin quotient <= dividend/divisor; end</pre> ``` \*See lecture code for full implementation and build. (divider0) ### Build the Stupider Divider Phase 20 Post Router Timing INFO: [Route 35-20] Post Routing Timing Summary | WNS=-72.004| TNS=-1004.354| WHS=0.227 | THS=0.000 | Phase 20 Post Router Timing | Checksum: 1d10fc4d8 #### 2. Slice Logic Distribution ----- | <b>4</b> | | | L | L | | |------------------------------------------|-----------|---------|------------|-----------|------------| | Site Type | l Used | Fixed | Prohibited | Available | Util% | | Slice | <br> 301 | <br> 0 | l 0 | l 8150 | <br>l 3.69 | | SLICEL | 1 225 | 0 | [ | [ | | | l SLICEM | l 76 | l 0 1 | | | l | | LUT as Logic | l 944 | 0 | l 0 | l 32600 | 1 2.90 | | l using 05 output only | l 0 | | | | l | | l using 06 output only | l 922 | | | | l | | l using 05 and 06 | l 22 | | | | l | | LUT as Memory | l 0 | l 0 | l 0 | l 9600 | 0.00 | | l LUT as Distributed RAM | l 0 | l 0 1 | | | l | | l LUT as Shift Register | l 0 | l 0 1 | | | l | | Slice Registers | l 131 | l 0 | l 0 | l 65200 | 0.20 | | I Register driven from within the Slice | l 67 | | | | l | | I Register driven from outside the Slice | l 64 | | | | l | | I LUT in front of the register is unused | l 28 | | | | | | LUT in front of the register is used | l 36 | | | | | | Unique Control Sets | 1 7 | | l 0 | l 8150 | 0.09 | | + | + | | + | + | + | ### A Better Divider? <sup>\*</sup>See lecture 5 code for full implementation and build. (divider0) that is an FSM. ### So conclusions - +, -, \* can be done in a clock cycle with exceptions - Watch out for flow-control logic...that can start to stack up - / will never happen in one clock cycle. Accept that. - Similar other things like square root, cosine, etc...those need clock cycles...or if you absolutely need those in one/two cycles, you do a lookup table of pre-computed values (takes huge amounts of memory) ## Now...Back on pipelining...Preview... • If we have time... A white blip • In lab 05, early on you may see an artifact on popcat # (Two registers coming from delay in memory access/read) - Monitor drawing based on v\_sync, h\_sync, blank, - But what image rom is giving it is 5 clock cycles behind - At start of PopCat nothing in the "pipeline" yet ### How to Fix? Delay the other signals so everybody is the same Turn the whole thing into a 5-stage pipeline! ## Pipelining - Pipeline in Verilog! - Make sure other things are protected too! ``` logic hs_pip[4:0]; logic vs_pip[4:0]; logic b_pip[4:0]; always_ff@(posedge clk_in)begin hs_pip[0] <= hsync_in;</pre> vs_pip[0] <= vsync_in;</pre> b pip[0] <= blank in;</pre> for (int i=1; i<5; i = i+1)begin hs_pip[i] <= hs_pip[i-1]; vs_pip[i] <= vs_pip[i-1];</pre> b pip[i] <= b pip[i-1];</pre> end end assign hsync_out = hs_pip[4]; assign vsync_out = vs_pip[4]; assign blank out = b pip[4]; ``` ## Final Project Ideas Things with video and/or related topics are very "relevant" to FPGAs You have to move and process very large amounts of data with demanding timing. • This is something software often cannot on its own. # Live Pong # Glow Trails # DigiEyes ### PacMan Extreme # Final Project Info released by tomorrow - Start Teaming! - Teams of 2 or 3!