Checkoff 01: HD Camera

HD Camera

The questions below are due on Wednesday October 16, 2024; 11:59:00 PM.

You are not logged in.
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

Last week, we had our cameras giving us 1280x720 pixels of data for every frame, but we had to cut it down to a much smaller size to fit it all into our BRAM frame buffer. Pretty unsatisfying, right? Today, we're going to fix that problem! Your design for this week will start from a basis of what you had last week, and swap out the frame buffer BRAM for a data pipeline that can stick our memory into a much larger form of memory, the off-chip DRAM!

To get started, make a copy of your week05 final camera design: you'll need much of the data pipeline, but especially the center_of_mass module; we'll make some great use of it in the second half!

Then, download a handful of files:

base.zip: HDL files, which includes:
- top_level.sv which is very similar to last week, but hooks up all the new modules we'll need in order to interact with the DRAM. It skips the data pipelining you added: you don't need to worry about properly pipelining and removing artifacts this week. You will need to add your susbampling and scaling logic back in!
- stacker.sv and unstacker.sv: deserializes and serializes pixel data to be written/read as 128-bit messages.
- traffic_generator.sv: A partially finished generator of commands for the MIG, which is finished off.
- video_mux.sv: Only a slight change from last week, so that the mode sw[7:6] both high displays camera output instead of an orange screen. We'll use that switch setting for the new zoomed-in mode later.
top_level.xdc: Only a slight change from last week's,
ip.zip: The IP directories needed for this project. Most notably, the MIG! We'll need to use everything that was in the week 5 system, since we're augmenting that system directly, but in particular in the second half we'll be relying on a decent center_of_mass result.

Don't forget to bring in the files in your data folder! One of them is extremely important since it programs the camera!

SUMMARY: In your top_level:
Copy in your camera subsampling and 4X scaling logic from your week5 top_level. You can remove the 1X and 2X scaling modes now: just copy in the scaling logic from each row. Also, if you previously referred to 319-hcount, just use hcount instead; to match how we'll build our DRAM pipeline, we don't want out camera output to be flipped on the X-axis.

Starting this week, we're building some really hefty IP alongside our own designs. When you start vivado, or lab-bc, the first thing that happens is that IP gets generated, for about a minute and a half and 1200 lines of infos and warnings, before your code gets opened. So, if you try and build something with a silly syntax error, IT WILL NOW TAKE YOU A MINUTE AND A HALF TO FIND OUT! So, check your syntax locally! For The Love Of God! The build server will start to throttle your submissions if you repeatedly use it to just figure out you're missing a semicolon. You need to check things locally.

Replacing the BRAM

Getting things into and out of the off-chip memory is way more complicated than using the nice, friendly BRAM blocks inside the FPGA fabric. If we tried to do it all from scratch, we'd be be fighting a seriously losing battle. Instead, we're going to let all the necessary tasks for maintaining data DRAM memory chip be handled by some IP (intellectual property) provided by Xilinx. The Memory Interface we'll use comes out of Xilinx's MIG, or Memory Interface Generator. We've configured the memory interface to work with the specific chip on the Urbana board for you, and the version of the memory interface you need is provided in the IP zip file above.

Of course, by introducing the MIG, we've just replaced the problem of "how do I work with a memory chip?" with the problem of "how do I work with the MIG?" It's a better problem to have, but we'll still need some infrastructure in order to set it up properly. There are a couple major constraints of working with the MIG that we need to design around:

Clock Speeds The spec of the DRAM chip demands that we run it somewhere in the 300-400MHz range, and our specific MIG configuration has it running at 325MHz. There are 16 parallel data wires between the FPGA and the DRAM chip, so 16 bits of data are sent at a time. Since the chip is DDR, or Double Data Rate, we can send data on the rising and falling edge of that clock, so we can send 16 bit messages at an effective rate of 650MHz. That's really fast--probably too fast for the logic we'd write to send messages to the DRAM. So the MIG instead gives us a clock that's 4 times slower than the chip's operating clock, of 81.25MHz. We need to give the MIG our data on the rising edge of this "UI Clock"¹.
Data Width Since we're operating at this clock that's 4 times slower than the DDR chip, the MIG can send more messages on the 16-wide bus each clock cycle. Specifically, since we feed it data at 81.25MHz and it can double pump data at 325MHz, there are (4*2) = 8 chances to send a 16-bit message between each UI clock cycle, so it wants us to give it 128-bit messages on each cycle.

Now these constraints are annoying, considering we have two other clock speeds we need to produce and consume data, and each of our pixels needs 16 bits of storage, not 128. But we have tools at our disposal to surmount these challenges!

Stacking, Unstacking, and FIFOs

In order to get data into our 81.25MHz UI clock domain, we need something that carries our data from one clock domain to another. For that, we'll use a FIFO structure provided by Xilinx. We'll write data on one side of the FIFO in one clock domain, and we'll read data on the other side of it in another clock domain. This is both how we'll get pixel data into the UI clock domain from our pixel reconstructor, and how we'll get pixel data from our DRAM frame buffer to the video pipeline and HDMI output. We can trust that the clock domain crossing is being handled appropriately inside of the Xilinx structure.

Using a FIFO here also lets us worry less about the exact timing of the DRAM. Sometimes a request to get memory from the DRAM chip could take 30 clock cycles, and sometimes it could take 300, but that manifests itself only as the FIFO being more or less full at individual moments. As long as the average time it takes to acquire the data we need is less than how much data we need to consume in a given period of time, the FIFO should keep us covered!

Our FIFO, like the MIG, will deal in 128-bit messages. In order to accomodate getting data into and out of these FIFOs, we've also provided you with two modules: a stacker and an unstacker. The stacker takes sequential 16-bit messages and stacks them into one larger message, and adds a message to the FIFO when a full 128-bit message is ready. The unstacker does the inverse--it takes a 128-bit message out of the FIFO and serializes it into 16-bit messages.²

All three of these components, the FIFO, the stacker, and the unstacker, speak to one another and to you using AXI-Stream, which is discussed in depth in Lecture 11.

AXI-Stream: Bursts of transactions

AXI-Stream is a protocol by which we can carry data between two modules, and completely specify unambiguously when data is actively being transferred (if followed, no data samples should ever get "dropped" in AXI). Many of the modules you've already written have used some version of a ready and a valid signal; one of the most valuable things that AXI-Stream provides us with is a way to formalize that! And since the FIFOs, stackers, and unstackers rely on this protocol to understand our requests. You need to properly set your tready, tvalid, and tlast signals to tell these modules what you want to be doing with your data. These three signals together enable the unambiguous handing off and characterization of our data.

Make sure to review lecture where the meaning and purpose of READY/tready and VALID/tvalid are discussed. This is important.

When streaming data over AXIS (AXI4-Streaming), only the data gets sent. Unlike with more complicated versons of the AXI protocol (like Full AXI or AXI-Lite), there is nothing such as address values that are sent along with the data. As a result, the order of the data is very important and specifies where/when a particular datum "lives". Furthermore since we'll often be sending a set of data, and then want to send a different set of data, so we'll need the ability to specify the "boundary" between those different sets so things don't flow together incorrectly. That's the purpose of a LAST/tlast signal. It can allow the long sequence of pixels we're sending to our FIFO to have an accompanying alignment marker.

The tlast signal is a single bit, that we want to set high when we're transferring the last pixel in our frame buffer: aka, the bottom right corner of the screen. It accompanies that piece of data as it flows through the system. There are a couple places at the that we need to specify and interact with the tlast signal to properly generate an output:

TLAST into Stacker: We need to generate a tlast signal at the output of our pixel reconstructor/input to the stacker. This should be a pretty simple piece of logic.
TREADY into Unstacker: We need to properly react to a tlast signal on the output of the unstacker, when we're receiving our frame buffer. Specifically, we want to use it to help interpret whether or not we should have our tready be high. So let's think more completely about how to lay out that tready logic!

The data pipeline we're using here has a FIFO that we can rely on always having some amount of data in it. So, we can expect that the tvalid signal out of our unstacker will always be high, but we only sometimes actually need to consume a pixel out of this stream of data; other times, we might be busy hsyncing or vsyncing. Every cycle where tvalid and tready are both high is a successful ready/valid handshake, or a cycle where the stacker will think you've listened to it and sent a pixel on its way to be drawn on the screen. So you should only let tready be true if you actually are sending a pixel to be drawn on the screen. If you "lie" to the stacker, and don't store it when you said you got it, you're out of compliance.

If you see that the tlast signal coming out of the unstacker is high, but you're not about to draw the pixel in the bottom-right corner of the screen, you know that something is out of sync. What can you do about this? You can wait until things are right again! If your tlast signal is high and you're not about to draw the bottom-right corner of the screen, you can make your tready stay low, which makes sure that the pixel you need to draw in the bottom-right corner will stay in place in the unstacker until it's actually time to be true. So, if you see a tlast signal, your tready should only be true if you're drawing the bottom-right corner.

All of that should be expressable in a single line of combinational logic; and you need to set your tready signal combinationally, to ensure that it's not a cycle delayed. If your tready updates a cycle too late, the unstacker might think you consumed a piece of data that you never actually did, leaving it lost forever!

Check Yourself 1

Write out an expression for your frame_buffer_tready signal, based on the values of active_draw_hdmi, frame_buffer_tlast, hcount_hdmi, and vcount_hdmi. If you're not sure how to write out that expression, talk to a friend or a staff member about it.

SUMMARY: In your top_level:

Fill in the logic indicating how to generate a tlast for the burst of reconstructed pixels
Fill in the logic of what your tready signal should be at the output of the unstacker, when you receive your pixel values.

Generating Traffic (You Are the Traffic)

Now that you have the outside logic of your modules laid out, there's just one piece missing: the module that takes your data needs, and turns them into commands for the MIG. This type of module is referred to by the MIG documentation as a traffic generator: it creates the commands that need to be fed into the MIG. Here, our traffic generator is connected to the data output from both of our AXI-Stream FIFOs: it reads the data from the FIFO that we used to feed data in from the camera, and writes data into the FIFO that we use to draw our HDMI output. It's the only module that we have which exists on the UI clock of the MIG.

Most of this traffic generator is already written for you: it has a state machine that switches back and forth between issuing read commands and write commands in bursts, to avoid having too many dead cycles. And it sets the signals for the FIFOs it's connected to to indicate when it actually consumes a message from the input FIFO with a tready signal, or writes a message to the output FIFO with a tvalid signal. The part you need to add in is a way to determine the address that should be associated with these commands.

How many 128-bit memory address slots do we need in order to store our full 1280 by 720 pixel frame buffer?

There are three addresses in memory we need to keep track of:

write_address, The memory address that the next piece of data consumed from the write AXI-Stream FIFO needs be written into. This should increment every time a ready/valid handshake happens with the write AXI-Stream FIFO, and should reset every time there's a ready/valid handshake where tlast is high.
read_request_address, The next memory address that we need to submit a read request for in order to read out the FIFO in order. There's no actual AXI-Stream associated with this signal, since no data needs to accompany a read request other than the address, but we've defined read_request_valid and read_request_ready which behave the same as a normal AXI-Stream, and a cycle where they're both true indicates that a read request has been submitted and your address should increment.
read_response_address, The memory address associated with the next read response. This should increment every time a ready/valid handshake happens on the AXI-Stream between the traffic generator and the read-output FIFO. ³ This address won't get fed into a MIG resuest, but we need it in order to know when the tlast signal should come! (The state-machine logic we provide you also needs it, to know how many read requests are still outstanding.)

I wonder if we might have a friendly little module sitting around somewhere that does a great job at counting, events? Oh I don't know, maybe an evt_counter? Make sure to use your evt_counter from lab 03 since it must be able to work with the number you deteremined in the question above. Set up some event counters that track the ready/valid handshakes that happen on each of these AXI-Streams!

Again!!!...Make sure that you are using the evt_counter you created that can count up to an arbitrary number and which does not need to rely on natural overflow! This is very important! Also note that you may need to modify the size/capacity of your evt_counter to work as you need for this lab! (16 bits may not be enough).

SUMMARY: In your traffic_generator, add:

Three event counter-type modules to specify the addresses we associate with write reqeuests, read requests, and read responses.
A definition of the tlast signal, based on the address associated with the current read-response transaction.

Debugging this stuff in hardware can really suck. Your build times take 3 minutes now, and it's hard to determine where in this whole pipeline your error has come from by looking at a garbled camera output. If you're stuck, Write a testbench for your addressing evt_counter, and make sure you're confident about the events that define your addressing! If a staff member gets called over to help you and you don't have a testbench to help us help you then, you'll be dropped from the queue and will need to write one.

A completed data stream

At this point, you've fully hooked up the addressing that needs to be done to set up your data pipeline! Switch your sw[0] to 1, and if all your addressing and AXI-Stream signals are configured properly, you'll see your camera output in the true 720p quality it was destined to be! Flip back and forth between your subsampled BRAM buffer and the DRAM buffer--really take it in. A thing to look at switching between the two modes is text (like on your phone or a piece of paper). The 2D spatial frequencies of text are high enough that when improperly downsampled (as we did last week and this week) without anti-aliasing filtering, you can get really bad looking stuff. At full resolution, things should be ok enough.

Checkoff 1:

Show a staff member your full-quality image buffer! Be ready to discuss:

How the AXI-Stream data pipelines are giving your data to the DRAM chip and getting them out, and why we need the modules in that system.
How you configured your address-generating modules inside the traffic generator.

[OPTIONAL] Zoomed-in Object Tracking

Everything below here is optional. It's cool, but optional. Anything marked "part 2" in the top_level.sv file is relevant to this zoom-tracking stuff, and is stuff you won't need for your checkoff. (If you do it, feel free to show it off in your checkoff!)

Now that we've got this higher-quality ⁴ image to work with, what do we want to do with it? Look at it CLOSER! Last week, we calculated the location of an object in our camera feed based on object tracking, and made popcat show up on top of it. Now, let's build a new way to track our object: by zooming our camera feed in, centered around the object we've detected!

So how do we go about seeing a zoomed-in view of our frame buffer? It'll involve a little bit of a coordinate transformation: for each displayed hcount and vcount value, we need to know the [x,y] coordinate from our original camera image we need to draw, based on the zoom multiplier and the [center_x,center_y] coordinate we want the zoom to focus on. For the sake of friendly bit-shift-able multiplications, we'll choose to implement a 2X zoom.

Check Yourself 2

In order to draw the coordinate [display_x,display_y] in a 2X zoomed-in view that centers on [center_x,center_y], what coordinates [x,y] do we need to access from the original image?

Having this formula gets us a lot of the way there for displaying the zoomed-in output! But there is one hiccup: we stored all our pixels in 128-bit messages, and when reading out from memory we have to grab the 8 pixels that message represents in a row! That means that if two adjacent display_x coordinates need to read the same pixel, our current sequence of 8 pixels won't be able to handle the new pattern.

To remedy this, we can rely on the fact that every pixel drawn on our output needs to be drawn twice in a row! And outside of this duplication, we're still reading our pixel data in the order it appears in the frame buffer. So, let's instead change the way we define our tready signal at the end of our DRAM access stream; we only want to see the data of a new pixel on every other cycle of clk_pixel now, so that we draw the pixel data we see each time twice in a row. Thus, every 128-bit message, carrying 8 pixels worth of data, will be used to draw 16 pixels on the output display.

SUMMARY: In your top_level:
Update your frame_buff_tready signal so that if zoom_view is high, we only assert tready every other cycle, so that each pixel will be drawn on the output twice in a row.

Now, we need to actually change the memory addresses we access to draw our zoomed in view! We can do this by replacing the evt_counter we use for our read requests with a new zoomed_address_generator. It should still increment on some event we provide it with, but instead of plainly incrementing addresses, it should output the next address that should be used when drawing the output. We've provided you with a skeleton for this module: complete it with some logic! Some things to think about as you write this:

You're drawing the output image by progressing in order through your display_x and display_y coordinates, so it may be helpful to store local variables that keep track of what output coordinate you're trying to draw with the current memory access, incrementing each time your valid-transfer event occurs.
Each memory access is being used to draw 16 pixels on the output, so considering that, your display_x counter should increment appropriately on each evt_in.
The change of coordinate systems can be used to determine the [x,y] coordinate you need to draw out.
Your output address should address a 128-bit wide memory slot in the DRAM: if you calculate the address of an individual pixel, you may need to divide the address you you derive by 8.
Alongside your address generation, generate a tlast signal that is high when you reach the last memory address in a given frame.
Using sequential logic here is chill, and since you might have a multiply in your logic it may well be worth in order to make sure you meet timing constraints.

As you write this, use a testbench that tests on different zoom_center coordinates to ensure you're generating the appropriate memory addresses and tlast signals.

Implement your zoomed_address_generator, based on the skeleton of the module we've provided you. It should generate the next address you need to draw in a zoomed view each time an evt comes in.
Instantiate a couple copies of the module in your traffic_generator to manage your read request and read response addresses. Wire up your events in each case to be ready/valid handshakes again. Make sure to uncomment the input signals related to zoom_view_en, zoom_view_x, and zoom_view_y in the traffic_generator.

Finding our Center

Finally, we need to determine the actual x and y coordinates to use here! We want to use our center of mass calculation, but there's one complication: while we're in the zoomed-in view, we've calculated our center of mass based on the pixels we draw on the screen, so that means it's based on the zoomed-in version of the view! We need to convert that back to the original coordinate system: you can use the x_com_transform and y_com_transform variables to perform the transformation back to the original coordinate system. Each time your center of mass calculation finishes, we'll store the appropriate version of the coordinates to our long-term storage x_com and y_com.

We also have some pre-configured logic to choose a center of your zoom coordinate based on a weighted average between the current zoom center and the newly calculated center of mass. This helps our zoom field move a little more smoothly, so your frame of reference doesn't constantly jitter with slight changes in your calculated center of mass. Feel free to change it to update the center of mass as you'd like!You might want to update this logic to also make sure your zoom center doesn't get too close to the edges! What choices of center coordinates would be too close to the edge?

Finally, in order to access this value in the traffic_generator, we need to do some clock domain crossing! We generate this value in the clk_pixel domain, but we need it in the clk_ui domain. So as to avoid worrying about metastability and the like, we'll let a Xilinx macro handle this safely for us, just like we did with the clock-domain crossing FIFO. Uncomment its definition near the top of the top_level.

SUMMARY: In your top_level:

Define x_com_transform and y_com_transform to convert back to the original coordinate system.
[optional] clamp the values of zoom_center_x and zoom_center_y to not go too close to the edge of the screen to successfully draw our zoom field.
Uncomment the clock-domain crossing module to access these variables in the UI clock domain. Uncomment the connections for zoom_en and the zoom center coordinates in the traffic_generator as well.

At this point, your data handling to draw a zoomed in view should be up and running! Settting sw[7:6] to 2'b11 will now switch you into the zoomed view; it might be worth it to tune your mask to catch the object you want to trace first, so that your zoomed view knows where to go.

Footnotes

¹UI Meaning user interface. The user of the MIG IP, is you! and your modules (click to return to text)

²In literature, a structure like this is typically called a gearbox-FIFO. (click to return to text)

³ This is distinct from the read request address because we can send in multiple read requests before we get responses! But we know that the memory responses will come back in order, so the read requests that we put in in order will also come back in order! (click to return to text)

⁴I know, it's not as fancy as your iPhone, but oh my god it's so much better than last year's cameras (click to return to text)