



### Pipelining



6.205

All happy popcats are alike; each unhappy popcat is unhappy in its own way

-Leo Tolstoy Anna Karenina (1877)

#### Admin

- Week 05: due tomorrow
- Week 06 out on Thursday
- Final project teaming preferences due tonight! See piazza. No extensions

### Performance Metrics



- Latency (L):
  - time between arrival of new input and generation of corresponding output.
  - For purely combinational circuits this is just t<sub>PD</sub>.
  - For sequential circuits, it is the number of flops you travel through times the clock period
- Throughput (T):
  - Rate at which new outputs appear.
  - For purely combinational circuits this is just  $1/t_{PD}$  or 1/L.
  - For fully-pipelined circuits it is 1/1

#### Division

- The outlier in the + \* / set...
- Division is a significantly harder math operation to do compared to multiplication
- Where possible try to avoid
- Try to divide by powers of 2 (use right shift)!
- If you can't avoid we must do it.

#### One "Bad" Attempt at Division

- In previous lectures looked at \*what\* this actually builds
- We can ask Vivado to synthesize division logic for us, and it actually will do it.
- This code constrains the act of division to having to exist between two clock edges.:

```
module top level(
      input wire clk 100mhz, //clock @ 100 mhz
      input wire [15:0] sw, //switches
      input wire btnc, //btnc (used for reset)
      input wire btnu, //btnc (used for reset)
      input wire btnl, //btnc (used for reset)
      output logic [15:0] led //just here for the funs
      );
  logic old_btnl;
  logic old btnu;
  logic old btnc;
  logic [15:0] guotient;
  logic [15:0] dividend;
  logic [15:0] divisor;
  assign led = guotient;
  always_ff @(posedge clk_100mhz)begin
    old btnl <= btnl;</pre>
    old btnu <= btnu;</pre>
    old btnc <= btnc;
  end
  always_ff @(posedge clk_100mhz)begin
    if (btnu & ~old btnu)begin
      quotient<= dividend/divisor; //divide</pre>
    end
    if (btnc & ~old btnc)begin
      dividend <= sw; //divide //load dividend</pre>
    end
    if (btnl & ~old_btnl)begin
      divisor <= sw; //divide //load divisor
```

end end endmodule

#### Circuit Built:



#### Build the Stupid Divider

Violates timing!

| Phase 22 Post Router Timing                     |                |             | 1             |           | _ |
|-------------------------------------------------|----------------|-------------|---------------|-----------|---|
| INFO: [Route 35-20] Post Routing<br>  THS=0.000 | Timing Summary | WNS=-21.399 | TNS=-129.5521 | WHS=0.090 | I |
|                                                 |                |             |               |           |   |

| Site Type                              | l Used     | Fixed | Prohibited | Available | Util% |
|----------------------------------------|------------|-------|------------|-----------|-------|
| Slice                                  | +<br>  100 | 0     | 0          | 15850     | 0.63  |
| I SLICEL                               | 89         | 0     |            |           | I     |
| I SLICEM                               | 11         | 0     |            |           |       |
| LUT as Logic                           | 274        | 0     | 0          | l 63400   | 0.43  |
| l using 05 output only                 | 0          |       |            |           |       |
| l using O6 output only                 | 274        |       |            |           |       |
| l using 05 and 06                      | 0          |       |            |           |       |
| l LUT as Memory                        | 0          | 0     | 0          | l 19000   | 0.00  |
| I LUT as Distributed RAM               | 0          | 0     |            |           | I     |
| l LUT as Shift Register                | 0          | 0     |            |           | I     |
| Slice Registers                        | l 55       | 0     | 0          | l 126800  | 0.04  |
| Register driven from within the Slice  | l 16       |       |            |           |       |
| Register driven from outside the Slice | 39         |       |            |           | I     |
| LUT in front of the register is unused |            |       |            |           |       |
| LUT in front of the register is used   | 13         |       |            |           |       |
| Unique Control Sets                    | 4          |       | 0          | l 15850   | 0.03  |

#### Now Do same Thing With 32 bits:



\*See lecture code for full implementation and build. (divider0)

#### Build the Stupider Divider

Phase 23 Post Router Timing INFO: [Route 35-20] Post Routing Timing Summary | WNS=-72.962| TNS=-1017.985| WHS=0.166 | THS=0.000 |

Phase 23 Post Router Timing | Checksum: 23fd227b7

2. Slice Logic Distribution. (from post\_place\_util.rpt)

-----

| L                                      | L          |          |            | L         | L     |
|----------------------------------------|------------|----------|------------|-----------|-------|
| Site Type                              | Used I     | Fixed    | Prohibited | Available | Util% |
| Slice                                  | +<br>  303 | +<br>ا 0 | 0          | 8150      | 3.72  |
| SLICEL                                 | 202        | 0        |            |           |       |
| SLICEM                                 | 101        | 0        |            |           |       |
| LUT as Logic                           | 941        | 0        | 0          | 32600     | 2.89  |
| l using 05 output only                 | 0          | I        |            |           |       |
| l using O6 output only                 | 919        | I        |            |           |       |
| l using 05 and 06                      | 22         | I        |            |           |       |
| l LUT as Memory                        | 0          | 0        | 0          | 9600      | 0.00  |
| l LUT as Distributed RAM               | 0          | 0        |            |           |       |
| l using 05 output only                 | 0          | I        |            |           |       |
| l using O6 output only                 | 0          | I        |            |           |       |
| l using 05 and 06                      | 0          | I        |            |           |       |
| l LUT as Shift Register                | 0          | 0        |            |           |       |
| l using 05 output only                 | 0          | I        |            |           |       |
| l using O6 output only                 | 0          |          |            |           |       |
| l using 05 and 06                      | 0          |          |            |           |       |
| Slice Registers                        | 126        | 0        | 0          | 65200     | 0.19  |
| Register driven from within the Slice  | 51         | I        |            |           |       |
| Register driven from outside the Slice | 75         | I        |            |           |       |
| LUT in front of the register is unused | 42         | I        |            |           |       |
| LUT in front of the register is used   | 33         | I        |            |           |       |
| Unique Control Sets                    | 7          | I        | 0          | 8150      | 0.09  |

10/8/24



\*See lecture code for full implementation and build. (divider0)

Division (an example of an algorithm that takes an unknown amount of time)

```
def divider (dividend, divisor):
    count = 0
    if divisor==0:
        return -1
    while dividend>=divisor:
        dividend -= divisor
        count += 1
    return (count, dividend)
```

Super efficient divider \s

### A Divider (#1)

```
def divider (dividend, divisor):
    count = 0
    if dividend <=0:
        return (0,divisor)
    if divisor==0:
        return -1
    while dividend>=divisor:
        dividend -= divisor
        count += 1
    return (count, dividend)
```

- This is a Verilog FSM example of the division algorithm above which will run an unknown number of times given a set of inputs
- This is how the functionality of a while loop could be developed in your modules
- Will not handle negative, or 0 or other things
- Give you this 32 bit one in week05

```
module divider #(parameter WIDTH = 32) (input wire clk_in,
                input wire rst_in,
                input wire[WIDTH-1:0] dividend in.
                input wire[WIDTH-1:0] divisor_in,
                input wire data_valid_in,
                output logic[WIDTH-1:0] quotient_out,
                output logic[WIDTH-1:0] remainder_out,
                output logic data_valid_out,
                output logic error_out,
                output logic busy_out);
 logic [WIDTH-1:0] quotient, dividend;
 logic [WIDTH-1:0] divisor;
 enum {RESTING, DIVIDING} state;
 always_ff @(posedge clk_in)begin
   if (rst_in)begin
     quotient <= 0;</pre>
     dividend <= 0:
     divisor <= 0;
     remainder_out <= 0;</pre>
     busy out <= 1'b0;</pre>
     error_out <= 1'b0;
     state <= RESTING;</pre>
     data_valid_out <= 1'b0;</pre>
   end else begin
     case (state)
        RESTING: begin
         if (data_valid_in)begin
            state <= DIVIDING;</pre>
            quotient <= 0;</pre>
            dividend <= dividend_in;</pre>
            divisor <= divisor_in;
            busy_out <= 1'b1;</pre>
           error_out <= 1'b0;
          end
         data_valid_out <= 1'b0;</pre>
        end
       DIVIDING: begin
         if (dividend<=0)begin
            state <= RESTING; //similar to return statement</pre>
            remainder_out <= dividend;</pre>
            quotient out <= quotient:
            busy_out <= 1'b0; //tell outside world i'm done</pre>
            error_out <= 1'b0;
            data valid out <= 1'b1: //good stuff!</pre>
          end else if (divisor==0)begin
            state <= RESTING;</pre>
            remainder_out <= 0;</pre>
            quotient out <= ∅:</pre>
            busy_out <= 1'b0; //tell outside world i'm done</pre>
            error_out <= 1'b1; //ERROR
            data valid out <= 1'b1: //valid ERROF</pre>
          end else if (dividend < divisor) begin
            state <= RESTING;</pre>
            remainder_out <= dividend;</pre>
            auotient out <= auotient;
            busy_out <= 1'b0;</pre>
            error out <= 1'b0;
            data_valid_out <= 1'b1; //good stuff!</pre>
          end else begin
            //state staying in.
           state <= DIVIDING;</pre>
            quotient <= quotient + 1'b1:</pre>
            dividend <= dividend-divisor;
          end
       end
     endcase
   end
 end
endmodule
```

#### Build divider1

Phase 12 Post Router Timing INFO: [Route 35-20] Post Routing Timing Summary | WNS=4.533 | TNS=0.000 | WHS=0.164 | THS=0.000 |

#### 2. Slice Logic Distribution

-----

| +                                        | +         | +4    | +          | +         | ++    |
|------------------------------------------|-----------|-------|------------|-----------|-------|
| I Site Type                              | Used      | Fixed | Prohibited | Available | Util% |
| Slice                                    | +<br>I 52 | 0     | 0          | 8150      | 0.64  |
| SLICEL                                   | 39        | 0     |            |           |       |
| SLICEM                                   | 13        | 0     |            |           |       |
| LUT as Logic                             | 140       | 0     | 0          | 32600     | 0.43  |
| I using OS output only                   | 0         |       |            |           |       |
| I using 06 output only                   | 107       |       |            |           |       |
| l using 05 and 06                        | 33        |       |            |           |       |
| LUT as Memory                            | 0         | 0     | 0          | 9600      | 0.00  |
| LUT as Distributed RAM                   | 0         | 0     |            |           |       |
| <pre>using 05 output only</pre>          | 0         |       |            |           |       |
| I using 06 output only                   | 0         |       |            |           |       |
| l using 05 and 06                        | 0         |       |            |           |       |
| I LUT as Shift Register                  | 0         | 0     |            |           |       |
| <pre>using 05 output only</pre>          | 0         |       |            |           |       |
| I using 06 output only                   | 0         |       |            |           |       |
| l using 05 and 06                        | 0         |       |            |           |       |
| Slice Registers                          | 192       | 0     | 0          | 65200     | 0.29  |
| Register driven from within the Slice    | 85        |       |            |           |       |
| I Register driven from outside the Slice | 107       |       |            |           |       |
| I LUT in front of the register is unused | 49        |       |            |           |       |
| I LUT in front of the register is used   | 58        |       |            | l         |       |
| Unique Control Sets                      | 9         |       | 0          | l 8150    | 0.11  |
| +                                        | +         | +4    | +          | +         | ++    |

## For divider1, what is the Good, the Bad, the Ugly?

- What are some nice features?
- What are some not-nice features?



#### Aside...



Original Italian poster 1967



Americanized poster with the Ugly and the Bad characters swapped



American DVD menu with the artwork appropriately reordered to match American name of movie

## For divider1, though...what are good and bad?

- Good:
- •
- Meets timing
- Resource usage is maybe small?
- Bad:
- ...
- Blocking Implementation (low-throughput)
- Variable throughput

#### So How to Fix...?

#### A Better Algorithm?

- This can't be how computers actually do division in real-life right?
- No there are actual algorithms that are base-2 friendly that we can use instead.
- Further more, there are algorithms that operate in a fixed number of cycles.

#### Divider (Fixed # of Steps)

- Assume the Dividend (A) and the divisor (B) have N bits.
- Build a sequential circuit that processes a single subtraction at a time and *then cycle the circuit N times*.
- This circuit works on unsigned operands; for signed operands one can remember the signs, make operands positive, then correct sign of result.

```
Init: P←0
Load A and B
Repeat N times {
   shift P,A left one bit
   temp = P-B
   if (temp >= 0){
      P←temp
      A_{LSB} \leftarrow 1
   }else{
        A_{LSB} \leftarrow 0
   }
}
Done: Q in A, R in P
```

#### Divider (Fixed # of Steps)

Assume the Dividend (A) and the divisor (B) have N bits. we can build a sequential circuit that processes a single subtraction at a time and *then cycle the circuit N times*. This circuit works on unsigned operands; for signed operands one can remember the signs  $Init: P \leftarrow 0$  Load A and B





#### Divider

| Р    | Α                  | P-B | 7/3 0111/11 B=0011                |
|------|--------------------|-----|-----------------------------------|
| 0000 | 0111               |     | Initial value                     |
| 0000 | 1110               |     | Shift                             |
| 0000 |                    | -3  | Subtract                          |
| 0000 | 111 <mark>0</mark> |     | Restore, set A <sub>Isb</sub> = 0 |
| 0001 | 1100               |     | Shift                             |
| 0001 |                    | -2  | Subtract                          |
| 0001 | 110 <mark>0</mark> |     | Restore, set A <sub>Isb</sub> = 0 |
| 0011 | 1000               |     | Shift                             |
| 0011 |                    | 0   | Subtract                          |
| 0000 | 100 <mark>1</mark> |     | Subtact, set A <sub>lsb</sub> = 1 |
| 0001 | 0010               |     | Shift                             |
| 0001 |                    | -2  | Subtract                          |
| 0001 | 0010               |     | Restore, set A <sub>lsb</sub> = 0 |
| R    | Q                  |     |                                   |



Init: P $\leftarrow 0$ Load A and B
Repeat N times {
 shift P,A left one bit
 temp = P-B
 if (temp >= 0){
 P $\leftarrow$ temp
 A<sub>LSB</sub> $\leftarrow 1$  }else{
 A<sub>LSB</sub> $\leftarrow 0$  }
}

Done: Q in A, R in P

#### divider2

- This is an FSM implementation of the "smarter" algorithm just shown:
- Latency:
  - 32 clock cycles (one for each bit)
- Throughput:
  - 1/32 clock cycles
- This is "blocking" implementation, meaning that when it is running it cannot accept new inputs.
- Even with some sort of FIFO, this will never process more than 1 division per 32 cycles.
- Simulate to verify it works.

```
module divider2 #(parameter WIDTH = 32) (input wire clk_in,
                 input wire rst in.
                 input wire[WIDTH-1:0] dividend_in,
                 input wire[WIDTH-1:0] divisor_in,
                 input wire data_valid_in,
                 output logic[WIDTH-1:0] guotient out.
                 output logic[WIDTH-1:0] remainder_out,
                 output logic data valid out,
                 output logic error out,
                output logic busy_out);
 logic [WIDTH-1:0] quotient, dividend;
 logic [WIDTH-1:0] divisor;
  logic [5:0] count;
 logic [31:0] p;
 enum {RESTING, DIVIDING} state;
 always_ff @(posedge clk_in)begin
   if (rst_in)begin
     quotient <= 0;</pre>
     dividend <= 0:
     divisor <= 0;
      remainder_out <= 0;</pre>
      busy_out <= 1'b0;</pre>
     error out <= 1'b0;
     state <= RESTING;</pre>
     data_valid_out <= 1'b0;</pre>
     count <= 0;</pre>
    end else begin
     case (state)
       RESTING: begin
         if (data_valid_in)begin
            state <= DIVIDING;</pre>
            auotient <= 0;</pre>
            dividend <= dividend in:
            divisor <= divisor in:
            busy_out <= 1'b1;</pre>
            error_out <= 1'b0;
            count <= 31;//load all up</pre>
            p <= 0;
          end
          data_valid_out <= 1'b0;</pre>
        end
       DIVIDING: begin
         if (count==0)begin
            state <= RESTING:</pre>
            if ({p[30:0],dividend[31]}>=divisor[31:0])begin
              remainder_out <= {p[30:0],dividend[31]} - divisor[31:0];</pre>
               quotient_out <= {dividend[30:0],1'b1};</pre>
            end else begin
              remainder_out <= {p[30:0],dividend[31]};</pre>
              quotient_out <= {dividend[30:0],1'b0};</pre>
             end
            busy_out <= 1'b0; //tell outside world i'm done</pre>
            error_out <= 1'b0;
            data_valid_out <= 1'b1; //good stuff!</pre>
           end else begin
            if ({p[30:0],dividend[31]}>=divisor[31:0])begin
              p <= {p[30:0],dividend[31]} - divisor[31:0];</pre>
              dividend <= {dividend[30:0],1'b1};</pre>
             end else begin
              p <= {p[30:0],dividend[31]};</pre>
              dividend <= {dividend[30:0],1'b0};</pre>
            end
           count <= count-1:
          end
       end
     endcase
   end
 end
endmodul
```

#### Build divider2:

Phase 12 Post Router Timing INFO: [Route 35-20] Post Routing Timing Summary | WNS=5.214 TNS=0.000 | WHS=0.167 | THS=0.000

2. Slice Logic Distribution

| Site Type                              | l Used    | Fixed | Prohibited | l Available | Util%       |
|----------------------------------------|-----------|-------|------------|-------------|-------------|
| Slice                                  | +<br>I 55 | 0     | 0          | +<br>  8150 | +<br>  0.67 |
| SLICEL                                 | 40        | Ő     | l C        |             |             |
| SLICEM                                 | 1 15      | 0     |            |             |             |
| LUT as Logic                           | 125       | Ő     | 0          | I 32600     | 0.38        |
| using 05 output only                   | 0         |       | -          |             |             |
| using O6 output only                   | 67        |       |            |             |             |
| using 05 and 06                        | 58        |       |            | l           |             |
| LUT as Memory                          | 0         | 0     | 0          | l 9600      | 0.00        |
| LUT as Distributed RAM                 | 0         | 0     |            | l           | l           |
| using O5 output only                   | 0         |       |            | l           | l           |
| using O6 output only                   | 0         |       |            |             | l           |
| using 05 and 06                        | 0         |       |            | l           | l           |
| LUT as Shift Register                  | 0         | 0     |            | l           | l           |
| using O5 output only                   | 0         |       |            |             | l           |
| using O6 output only                   | 0         |       |            |             | l           |
| using 05 and 06                        | 0         |       |            |             | l           |
| Slice Registers                        | l 197     | 0     | 0          | l 65200     | 0.30        |
| Register driven from within the Slice  | 94        |       |            |             | l           |
| Register driven from outside the Slice | 103       |       |            |             |             |
| LUT in front of the register is unused | l 46      |       |            |             |             |
| LUT in front of the register is used   | l 57      |       |            | l           | l           |
| Unique Control Sets                    | 10        |       | 0          | l 8150      | 0.12        |

### For divider2, ...what are good and bad?

- Good:
- •
- Meets timing
- Nominally the same resource usage as before
- Runs in fixed number of cycles!
- Bad:
- ...
- Blocking Implementation (low-throughput)

#### Can We Make it Better?

Phase 12 Post Router Timing INFO: [Route 35-20] Post Routing Timing Summary WNS=5.214 | TNS=0.000 | WHS=0.167 | THS=0.000 |

- We have a lot of slack with this current design.
- Currently kinda doing something like this (but 32 cycles rather than 4 cycles):



#### Could We Instead...

• Instead of this:



• Do this:

# Divider2b: Wedge a second iteration into each clock cycle:

- Did not mix-n-match blocking/non-blocking in my always\_ff because I realize that to err is human and this will lead to my downfall, if not today, then tomorrow
- Instead made an always\_comb with some "temp" variables to hold the result of the first iteration

```
DIVIDING: begin
          if (count==1)begin
            state <= RESTING:</pre>
            if ({p_temp[30:0],div_temp[31]}>=divisor[31:0])begin
              remainder_out <= {p_temp[30:0],div_temp[31]} - divisor[31:0];</pre>
              quotient_out <= {div_temp[30:0],1'b1};</pre>
            end else begin
              remainder_out <= {p_temp[30:0],div_temp[31]};</pre>
              quotient_out <= {div_temp[30:0],1'b0};</pre>
            end
            busy_out <= 1'b0; //tell outside world i'm done</pre>
            error_out <= 1'b0;
            data_valid_out <= 1'b1; //good stuff!</pre>
          end else begin
            if ({p_temp[30:0],div_temp[31]}>=divisor[31:0])begin
              p <= {p_temp[30:0],div_temp[31]} - divisor[31:0];</pre>
              dividend <= {div_temp[30:0],1'b1};</pre>
            end else begin
              p <= {p_temp[30:0],div_temp[31]};</pre>
              dividend <= {div temp[30:0],1'b0};</pre>
            end
            count <= count-2:
          end
        end
      endcase
    end
  end
 //extra:
  logic [31:0] p_temp;
  logic [31:0] div_temp;
  always_comb begin
   if ({p[30:0],dividend[31]}>=divisor[31:0])begin
      p temp = {p[30:0],dividend[31]} - divisor[31:0];
      div_temp = {dividend[30:0],1'b1};
    end else begin
     p_temp = {p[30:0],dividend[31]};
      div temp = {dividend[30:0],1'b0};
   end
 end
endmodule
```

#### Build divider2b:

Phase 12 Post Router Timing INFO: [Route 35-20] Post Routing Timing Summary WNS=1.561 TNS=0.000 | WHS=0.160 | THS=0.000

#### 2. Slice Logic Distribution

| Site Type                              | l Used    | Fixed    | Prohibited | Available | Util% |
|----------------------------------------|-----------|----------|------------|-----------|-------|
| <br>Slice                              | +<br>  67 | +<br>  0 |            | 8150      | 0.82  |
| SLICEL                                 | 44        | 0        |            |           |       |
| SLICEM                                 | l 23      | 0        |            |           |       |
| LUT as Logic                           | 185       | 0        | 0          | 32600     | 0.57  |
| using OŠ output only                   | 0         |          |            |           |       |
| using 06 output only                   | l 125     |          |            |           |       |
| using 05 and 06                        | l 60      |          |            |           | l     |
| LUT as Memory                          | l 0       | 0        | 0          | 9600      | 0.00  |
| LUT as Distributed RAM                 | l 0       | 0        |            |           |       |
| using 05 output only                   | l 0       |          |            |           | l     |
| using O6 output only                   | l 0       |          |            |           | l     |
| using 05 and 06                        | l 0       |          |            |           | l     |
| LUT as Shift Register                  | l 0       | 0        |            |           |       |
| using O5 output only                   | l 0       |          |            |           |       |
| using O6 output only                   | l 0       |          |            |           |       |
| using 05 and 06                        | l 0       |          |            |           |       |
| Slice Registers                        | l 197     | 0        | 0          | 65200     | 0.30  |
| Register driven from within the Slice  | l 92      |          |            |           |       |
| Register driven from outside the Slice | 105       |          |            |           |       |
| LUT in front of the register is unused | I 35      |          |            |           |       |
| LUT in front of the register is used   | 70        |          |            |           |       |
| Unique Control Sets                    | 9         |          | 0          | 8150      | 0.11  |

## For divider2b, ...what are good and bad?

- Good:
- •
- Meets timing
- Improved, fixed throughput (2X)
- Latency improved (1/2X)
- Bad:
- ...
- Blocking Implementation (low-throughput)
- Resource usage a little bit higher

#### Make it Better?

- Easiest thing to try is to shove *three steps or four steps* of the algorithm into one clock cycle?
- Maybe?
- lunno
- Maybe?

#### Attempt at divider2c...try to do three layers of the algorithm on one clock cycle

- Used some more poorly named variables to act as intermediaries
- But should work "in theory"\*

\*to use the terminology of a student trying to convince me that they achieved what they set out to do on their final project when they did not.



10/8/24

#### Build divider2c:

| Phase 25 Post Router Timing<br>INFO: [Route 35-20] Post Routing | g Timing Summary | WNS=-2.343  | TNS=-66.953  WHS=0.159   THS=0.000        |
|-----------------------------------------------------------------|------------------|-------------|-------------------------------------------|
| 2. Slice Logic Distribution                                     | Timing Failed    | got greedy. | tried to fly too close to the sun, Icarus |

| Site Type                              | l Used l  | Fixed    | Prohibited | Available | Util%       |
|----------------------------------------|-----------|----------|------------|-----------|-------------|
| Slice                                  | +<br>  88 | +<br>ا 0 | 0          | 8150      | +<br>  1.08 |
| SLICEL                                 | I 66 I    | 0        |            |           | I           |
| SLICEM                                 | 22        | 0        |            |           | I           |
| LUT as Logic                           | 234       | 0        | 0          | 32600     | 0.72        |
| using O5 output only                   | 0         | I        |            | l         |             |
| using O6 output only                   | 160       | I        |            |           |             |
| using 05 and 06                        | 74        | I        |            |           |             |
| LUT as Memory                          | 0         | 0        | 0          | 9600      | 0.00        |
| LUT as Distributed RAM                 | 0         | 0        |            |           |             |
| using O5 output only                   | 0         | I        |            |           |             |
| using O6 output only                   | 0         | I        |            |           |             |
| using 05 and 06                        | 0         | I        |            |           |             |
| LUT as Shift Register                  | 0         | 0        |            |           |             |
| using 05 output only                   | 0         | I        |            |           |             |
| using O6 output only                   | 0         | I        |            |           |             |
| using 05 and 06                        | 0         | I        |            |           |             |
| Slice Registers                        | 197       | 0        | 0          | 65200     | 0.30        |
| Register driven from within the Slice  | I 96 I    | I        |            |           | I           |
| Register driven from outside the Slice | 101       | I        |            |           |             |
| LUT in front of the register is unused | 31        | I        |            |           |             |
| LUT in front of the register is used   | I 70 I    | I        |            |           |             |
| Unique Control Sets                    | 9         | I        | 0          | 8150      | 0.12        |

#### Another interesting feature

• Notice this number:

Phase 25 Post Router Timing INFO: [Route 35-20] Post Routing Timing Summary | WNS=-2.343 | TNS=-66.953| WHS=0.159 | THS=0.000 |

Whereas a design that worked earlier is this:

Phase 12 Post Router Timing INFO: [Route 35-20] Post Routing Timing Summary | WNS=1.561 | TNS=0.000 | WHS=0.160 | THS=0.000 |

- Designs which fit timing easily will go through fewer phases of optimization
- Vivado will give up after too many phases and can't achieve

#### Summary so far...

- So we've made some gains by:
  - picking a better algorithm (something suited to base 2)
  - Shoving more iterations of the cycle between the clock edges...



Latency still bad though :/

#### Can We Make it Better?

- We have an algorithm that takes a fixed amount of cycles per divide (32 in our case)
- Because of this we know exactly how many calculations we need to do.
- This allows us to set up a fully-pipelined system (can't easily do in a variable-run-time algorithm)

#### What?

• Currently we're doing something like this:



• What if we instead did this ("unwrap the loop"):



Latency: 3\*T<sub>clk</sub> Throughput: 1/ T<sub>clk</sub> Uses more logic,flops

## divider3

- Fully pipelined 32 step division
- Each step is carried out and results placed in registers which are used by next step in pipeline
- Latency still 32 cycles
- Throughput is now 1/1 cycle
  - Assembly line! Stage 0 can always have something to do
- Simulate it (it works)
- Now build...

```
module divider3 #(parameter WIDTH = 32) (input wire clk in,
                 input wire rst_in,
                 input wire[WIDTH-1:0] dividend_in,
                 input wire[WIDTH-1:0] divisor_in,
                input wire data_valid_in,
                output logic[WIDTH-1:0] quotient_out,
                 output logic[WIDTH-1:0] remainder_out,
                output logic data_valid_out,
                output logic error out,
                 output logic busy_out);
  logic [31:0] p[31:0]; //32 stages
  logic [31:0] dividend [31:0];
 logic [31:0] divisor [31:0];
 logic data_valid [31:0];
 assign data_valid_out = data_valid[31];
 assign guotient_out = dividend[31];
 assign remainder_out = p[31];
 always_ff @(posedge clk_in)begin
    data_valid[0] <= data_valid_in;</pre>
    if (data_valid_in)begin
      divisor[0] <= divisor_in;</pre>
      if ({31'b0,dividend_in[31]}>=divisor_in[31:0])begin
        p[0] <= {31'b0,dividend_in[31]} - divisor_in[31:0];</pre>
        dividend [0] \leq \{ \text{dividend in } [30:0], 1'b1 \};
      end else begin
        p[0] <= {31'b0,dividend_in[31]};</pre>
        dividend[0] <= {dividend_in[30:0],1'b0};</pre>
      end
    end
    for (int i=1; i<32; i=i+1)begin</pre>
      data_valid[i] <= data_valid[i-1];</pre>
      if ({p[i-1][30:0],dividend[i-1][31]}>=divisor[i-1][31:0])begin
        p[i] <= {p[i-1][30:0],dividend[i-1][31]} - divisor[i-1][31:0];</pre>
        dividend[i] <= {dividend[i-1][30:0],1'b1};
      end else begin
        p[i] <= {p[i-1][30:0],dividend[i-1][31]};</pre>
        dividend[i] <= {dividend[i-1][30:0],1'b0};</pre>
      end
      divisor[i] <= divisor[i-1];</pre>
    end
 end
endmodule
```



Compare to the FSMbased approach before where there was one p variable, for example

output logic busy\_out); logic [WIDTH-1:0] quotient, dividend; logic [WIDTH-1:0] divisor; logic [5:0] count; logic [31:0] p; enum {RESTING, DIVIDING} state; always ff @(posedge\_clk\_ip)begip

10/7/24





## Nice...fully unrolled

• Were previously doing this:





• Now instead do this ("unwrap the loop"):



#### Build divider3

Phase 12 Post Router Timing INFO: [Route 35-20] Post Routing Timing Summary | WNS=3.738 | TNS=0.000 | WHS=0.054 | THS=0.000 |

#### 2. Slice Logic Distribution

-----

| + Site Type                                                                              | +<br>  Used         | ⊦+<br>  Fixed     | Prohibited | ⊦<br>  Available | +<br>  Util%    | +<br>      |
|------------------------------------------------------------------------------------------|---------------------|-------------------|------------|------------------|-----------------|------------|
| SLICEL                                                                                   | +<br>  492<br>  336 | ++<br>  0 <br>  0 | 0          | 8150             | +<br>  6.04<br> | +<br> <br> |
| SLICEM<br>  LUT as Logic                                                                 | 156<br>  1504       | 0                 | 0          | 32600            | <br>  4.61      |            |
| I using O5 output only                                                                   | I 0                 |                   | Ŭ          |                  |                 | ,<br> <br> |
| <pre>using 06 output only using 05 and 06</pre>                                          | 999<br>  505        |                   |            |                  |                 | <br>       |
| LUT as Memory<br>  LUT as Distributed RAM                                                | 58<br>  0           | 0 <br>  0         | 0          | 9600 I           | 0.60<br>        | <br>       |
| <pre>using 05 output only using 06 output only</pre>                                     | 0<br>  0            | <br>              |            |                  | <br>            | Resource   |
| l using 05 and 06<br>LUT as Shift Register                                               | 0<br>  58           | <br>  0           |            |                  |                 | usage went |
| <pre>using 05 output only using 06 output only</pre>                                     | 17<br>  41          |                   |            |                  |                 | way up!    |
| l using 05 and 06                                                                        | I 0                 |                   | 0          | 65200            | <u> </u>        | Why?       |
| <pre>Slice Registers Register driven from within the Slice</pre>                         | 1632<br>  1052      |                   | 0          | 65200            | 1 2.50          |            |
| <pre>Register driven from outside the Slice LUT in front of the register is unused</pre> | 580<br>  194        | <br>              |            |                  | <br>            | <br>       |
| <pre>LUT in front of the register is used Unique Control Sets</pre>                      | 386<br>  7          | <br>              | 0          | 8150             | <br>  0.09      | <br>       |
| -                                                                                        | /fpga.mit           | .edu/6205         | /F24       |                  |                 | 41         |

## We should expect increased resource usage!!

- We've traded resources for throughput
- Now can do 100 million divisions per second as opposed to ~3 million per second from before

## For divider3, ...what are good and bad?

- Good:
- •
- Meets timing
- Improved, fixed throughput (32X) compared to v2
- Latency the same (compared to v2)
- Bad:
- •
- Resource usage significantly higher

#### Do it Better?

• Can I get my result faster?

 $L = N^* t_{clk}, T = 1/t_{clk}$ 



Iteration #0 Heration #1 Heration #2 Heration #2 Heration #N-1 Heration #N-1

 $L = 0.5 * N * t_{clk}$ ,  $T = 1/t_{clk}$  And maybe use fewer registers!!!:



#### divider4

- Improved Pipeline
- Shove two stages of our algorithm between each register pair.
- Therefore this should allow the same throughput of division but a halving of latency!
- In theory anyways.
- Simulate it (it works)
- Now build...

```
module divider4 #(parameter WIDTH = 32) (input wire clk in,
                input wire rst_in,
                input wire[WIDTH-1:0] dividend_in,
                input wire[WIDTH-1:0] divisor_in,
                input wire data_valid_in,
                output logic[WIDTH-1:0] quotient_out,
                output logic[WIDTH-1:0] remainder out,
                output logic data_valid_out,
                output logic error_out,
                output logic busy_out);
 logic [31:0] p[31:0]; //32 stages
 logic [31:0] dividend [31:0];
 logic [31:0] divisor [31:0];
 logic data_valid [31:0];
 assign data_valid_out = data_valid[31];
 assign quotient_out = dividend[31];
 assign remainder_out = p[31];
 always @(*) begin
   data_valid[0] = data_valid_in;
   divisor[0] = divisor_in;
   if (data_valid_in)begin
     if ({31'b0,dividend_in[31]}>=divisor_in[31:0])begin
        p[0] = {31'b0,dividend in[31]} - divisor in[31:0];
        dividend[0] = {dividend_in[30:0],1'b1};
     end else begin
       p[0] = {31'b0,dividend_in[31]};
        dividend[0] = {dividend_in[30:0],1'b0};
     end
   end
    for (int i=2; i<32; i=i+2)begin</pre>
     data valid[i] = data valid[i-1];
     if ({p[i-1][30:0],dividend[i-1][31]}>=divisor[i-1][31:0])begin
       p[i] = {p[i-1][30:0],dividend[i-1][31]} - divisor[i-1][31:0];
        dividend[i] = {dividend[i-1][30:0],1'b1};
     end else begin
       p[i] = {p[i-1][30:0],dividend[i-1][31]};
        dividend[i] = {dividend[i-1][30:0],1'b0};
     end
     divisor[i] = divisor[i-1];
   end
 end
 always_ff @(posedge clk_in)begin
   for (int i=1; i<32; i=i+2)begin</pre>
     data_valid[i] <= data_valid[i-1];</pre>
     if ({p[i-1][30:0],dividend[i-1][31]}>=divisor[i-1][31:0])begin
        p[i] <= {p[i-1][30:0],dividend[i-1][31]} - divisor[i-1][31:0];</pre>
       dividend[i] <= {dividend[i-1][30:0],1'b1};</pre>
     end else begin
       p[i] <= {p[i-1][30:0],dividend[i-1][31]};</pre>
       dividend[i] <= {dividend[i-1][30:0].1'b0};</pre>
     end
     divisor[i] <= divisor[i-1];</pre>
   end
 end
endmodule
```

#### Build divider4

#### Pass timing by 143 picoseconds

| Phase 12 Post Router Timing<br>INFO: [Route 35-20] Post Routing Timing Su | ımmary l    | WNS=0.1 | 73   TNS=0.0 | 00   WHS=0.      | 057   T      | HS=0.000     |
|---------------------------------------------------------------------------|-------------|---------|--------------|------------------|--------------|--------------|
| 2. Slice Logic Distribution                                               | Flip        | flop us | sage dropp   | ed by a lo       | t! (prev     | v 2.50%) Why |
| Site Type                                                                 | +           | Fixed   | Prohibited   | +<br>  Available | +<br>  Util% | +<br>        |
| Slice                                                                     | +<br>  460  | 0       | 0            | +<br>  8150      | +<br>  5.64  | +            |
| SLICEL                                                                    | 321         | -       |              | 1                |              |              |
| SLICEM                                                                    | 139         | -       |              |                  |              |              |
| LUT as Logic                                                              | 1504        | 0       | 0            | l 32600          | 4.61         |              |
| using O5 output only<br>using O6 output only                              | 0 <br>  999 |         |              | 1                | 1            |              |
| using 05 and 06                                                           | I 505       |         |              | ,<br>            |              |              |
| LUT as Memory                                                             | 54          | 0       | 0            | I 9600           | 0.56         |              |
| LUT as Distributed RAM                                                    | 0           | 0       |              | I                |              |              |
| using 05 output only                                                      | 0           |         |              | I                |              | l            |
| using O6 output only                                                      | 0           |         |              | l                |              |              |
| using 05 and 06                                                           |             |         |              |                  |              |              |
| LUT as Shift Register                                                     | 54  <br>  0 | 0       |              | 1                | 1            |              |
| using O5 output only<br>using O6 output only                              | 54          |         |              | 1                | 1            | 1            |
| using 05 and 06                                                           | 0           |         |              | I <u> </u>       | I<br>I       | ·<br>        |
| Slice Registers                                                           | 919         | 0       | 0            | i 65200          | 1.41         |              |
| Register driven from within the Slice                                     | 574         |         |              |                  |              | ┢╾┛          |
| Register driven from outside the Slice                                    | l 345       |         |              | l                |              | l            |
| LUT in front of the register is unused                                    |             |         |              |                  |              |              |
| LUT in front of the register is used                                      | 232         |         |              |                  |              |              |
| Unique Control Sets                                                       | 7           |         | 0            | l 8150           | 0.09         |              |

10/7/24

https://fpga.mit.edu/6205/F24

#### Fewer Flip Flops

• Should be expected!







#### Summary of the Journey

| Divider                 | Resource Usage<br>%LUT/%FF | Latency            | Throughput |
|-------------------------|----------------------------|--------------------|------------|
| 32 bit /                | 3.72/0.29                  | FAIL ( -72.004 ns) | FAIL (1/L) |
| divider 1 (lec06/week5) | 0.64/0.29                  | Variable           | Variable   |
| divider 2               | 0.67/0.30                  | 32                 | 1/32       |
| divider 2b              | 0.82/0.30                  | 16                 | 1/16       |
| divider 2c              | 1.08/0.30                  | FAIL (-2.3ns)      | FAIL (1/L) |
| divider 3               | 6.04/2.50                  | 32                 | 1/1        |
| divider 4               | 5.64/1.41                  | 16                 | 1/1        |

which to use?

### Conclusions

- First: Use a good algorithm!
  - Doing things stupidly can only work out so well (not well)
- Second:
  - Figure out what we (you, customer) actually need...
  - Need to divide every clock cycle?
  - Need to divide every million clock cycles?

#### More Conclusions

- Some tasks can be parallelized:
  - (adding an array up...See Lecture 02 with big\_adder)
- Some tasks cannot be parallelized and steps must be done sequentially:
  - 10 violinists cannot play a violin solo ten times as fast
  - Division is an iterative process inherently
- If must be done sequentially:
  - Variable-length or Fixed-length Algorithm?

### Algorithms

- Variable-length algorithm are generally implemented as type of state machine
- Fixed-length algorithms can be more flexible:
  - FSM (blocking)
  - Fully pipeline (assembly-line)
  - Mixture in between

### **Optimize for Need!**

 All those options allow one to vary between amounts of pipelining and iterative behavior

Init:  $P \leftarrow 0$ , load A and B Repeat N times { shift P,A left one bit temp = P-Bif  $(temp \ge 0)$ { P←temp  $A_{ISB} \leftarrow 1$ }else{ A<sub>LSB</sub>←0 } } Done: Q in A, R in P

 $L = N^* t_{clk}$ ,  $T = 1/t_{clk}$  At small  $t_{clk}$  **But use lots of resources:** 



 $L = 0.5 * N * t_{clk}$ ,  $T = 1/t_{clk}$  At larger  $t_{clk}$  But uses slightly fewer of resources:



Honestly minimal benefit to this at least for divider since it now barely passes timing 10/7/24

https://fpga.mit.edu/6205/F24

### Optimize for Need!

 All those options allow one to vary between amounts of pipelining and iterative behavior

```
Init: P←0, load A and B
Repeat N times {
    shift P,A left one bit
    temp = P-B
    if (temp >= 0){
        P←temp
        ALSB←1
    }else{
        ALSB←0
    }
}
Done: Q in A, R in P
```

 $L = N^{*}t_{clk}$ ,  $T = 1/(N^{*}t_{clk})$  At small  $t_{clk}$  But use very few resources:



 Takes N cycles to divide and can't accept new inputs during that time

Takes N cycles to divide but can take a new input every ✓ N/M cycles

 $L = N^* t_{clk}$   $T = 1/(N/M^* t_{clk})$  At small  $t_{clk}$  **But uses more of resources:** 



### A lot of Algorithms are Repetition-Based though

• Let's say we need to compute F(F(F(X))). Do we build our hardware like this?:



Latency: 3\*T<sub>clk</sub> Throughput: 1/ T<sub>clk</sub> Uses more resources

• Or like this:?



Latency: 3\*T<sub>clk</sub> Throughput: 1/ (3\*T<sub>clk</sub>) <u>Likely</u> uses fewer resources

#### This is the Great Tradeoff!



• Base on what you need for the design!

## Most of what you may need to do can be framed in this way

- What about the "other" math operations?
- Square root?
- Trig functions?
- Exponents?
- Anything else?
- There's usually a "smart" way to do it.

### CORDIC

- <u>Coordinate</u> <u>Rotation</u> <u>Digital</u>
   <u>Computer</u>
- Super versatile class of iterative algorithms that are used widely in hardware because they are relatively simple to implement (mostly just shifts and adds and compares)
- Can operate quite efficiently using a minimal amount of resources



https://www.remcycles.net/blog/cordic.html

### Generalized CORDIC

The three equations are iterated



#### Different Modes



- 1/K = 0.607252935009...
- K' = 0.8281593609602...
- 1/K' = 1.207497067763...

#### CORDIC

#### • What can you compute with CORDIC?

#### Directly computable functions [edit | edit source]

| $\sin z$         | $\cos z$                  |
|------------------|---------------------------|
| $	an^{-1} z$     | $\sinh z$                 |
| $\cosh z$        | $	anh^{-1}z$              |
| y/x              | xz                        |
| $\tan^{-1}(y/x)$ | $\sqrt{x^2+y^2}$          |
| $\sqrt{x^2-y^2}$ | $e^z = \sinh z + \cosh z$ |

#### Indirectly computable functions [edit | edit source]

In addition to the above functions, a number of other functions can be produced by combining the results of previous computations:

$$\begin{aligned} \tan z &= \frac{\sin z}{\cos z} & \cos^{-1} w = \tan^{-1} \frac{\sqrt{1 - w^2}}{w} \\ \tanh z &= \frac{\sinh z}{\cosh z} & \sin^{-1} w = \tan^{-1} \frac{w}{\sqrt{1 - w^2}} \\ \ln w &= 2 \tanh^{-1} \frac{w - 1}{w + 1} & \log_b w = \frac{\ln w}{\ln b} \\ w^t &= e^{t \ln w} & \cosh^{-1} &= \ln \left( w + \sqrt{w^2 - 1} \right) \\ \tan^{-1}(y/x) & \sinh^{-1} &= \ln \left( w + \sqrt{w^2 + 1} \right) \\ \sqrt{x^2 - y^2} & \sqrt{w} &= \sqrt{(w + 1/4)^2 - (w - 1/4)^2} \end{aligned}$$

10/7/24

6.S965 Fall 2024

# Often Use more "Primitive" algorithms on an FPGA

- Along with things like <code>srli</code> or <code>add</code>, modern processors will often have:
  - 32-bit integer multiply instructions
  - Floating-point instructions
  - Etc...
- If both Srli and Mult cost the same in terms of instructions, then you might as well use a Mult if it gets you more performance
  - And many algorithms for certain things can be done more quickly using mult than just srli

#### In an FPGA, accelerator, etc...

- If you have the freedom to not use Mult, and it has a benefit (perhaps in terms of resource usage)...
- Then you should consider it as another degree in which to optimize.
- Have finite number of multiplier blocks on FPGA...
- Spending some on an algorithm that doesn't need it could hurt you elsewhere



## So make sure you explore your algorithms

• Don't necessarily do it the software way or even the "C-way" since those are often optimized to a different set of constraints.

#### Data Type Sizes

- In a traditional processor, instructions are optimized for particular data type sizes:
  - 32 or 64 bit integers
  - 32 or 64 bit floats
- Don't need to do that necessarily anymore
- Can be the difference between making timing and not making timing

## The Ongoing 8-bit debate in the ML field

#### FP8 FORMATS FOR DEEP LEARNING

Paulius Micikevicius, Dusan Stosic, Patrick Judd, John Kamalu, Stuart Oberman, Mohammad Shoeybi, Michael Siu, Hao Wu NVIDIA

{pauliusm, dstosic, pjudd, jkamalu, soberman, mshoeybi, msiu, skyw}@nvidia.com

Neil Burgess, Sangwon Ha, Richard Grisenthwaite Arm n.ha, richard.grisenthwaite}@arm.com

s Cornea, Alexander Heinecke, Pradeep Dubey
Intel
ea, alexander.heinecke, pradeep.dubey}@intel.com

#### ABSTRACT

erating deep learning training inference beyond the 16-bit In this paper we propose an 8-bit floating point (FP8) binary

#### FP8 Quantization: The Power of the Exponent

022

Andrey Kuzmin<sup>\*</sup>, Mart Van Baalen<sup>\*</sup>, Yuwei Ren, Markus Nagel, Jorn Peters, Tijmen Blankevoort Qualcomm AI Research<sup>†</sup> {akuzmin,mart,ren,markusn,jpeters,tijmen}@qti.qualcomm.com

#### Abstract

When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale instead. This paper in-depth investigates this benefit of the floating point format for neural network inference. We detail the choices that can be made for the FP8 format, including the important choice of the number of bits for the mantissa and exponent,

TU/ 0/ 24

And of course...remember memory is often a limiting factor!

- Few years ago...team built a fully-pipelined search implementation.
- Could search 1024 elements 100 million times per second.
- But we couldn't give it data fast enough to take advantage of it.



#### Project Video



#### Project Video

#### Mirror Me 6.205 Final Project Fall 2022

#### Janette (Jan) Park and Thienan Nguyen

# Final Project Teams/prefernces by tonight

- Must submit tonight
- Do it
- See Piazza announcement