Checkoff 02: Camera I

Working Camera and Object Tracking

The questions below are due on Wednesday October 09, 2024; 11:59:00 PM.
 
You are not logged in.

Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

You are not to remove a camera module from lab. We do not have enough for everyone to have one simultaneously. They need to be shared. If we catch you taking one from lab space, there will be consequences. When you are not working with the camera, please return it to the front of the lab area into its area.

Ok now we have a couple components that we can piece together:

  • we can draw a sprite from memory,
  • we have the design to capture data from a camera.

Here we'll assemble a system to display our camera data to the HDMI output, and process some of the pixel data so we can find the center of a pink object in the camera field, like the foam from your lab kit, and have Popcat follow the foam. Here's a quick demo of what this'll look like at the end:

The video is from two or three years ago. We now have different boards, a nicer camera, and a slightly different seven segment display readout, among a handful of other minor changes. But the overall behavior is roughly the same so I'm not going to make a new video. They did a good job with this one.

Going from the raw camera pixels to controlling popcat with the pink foam will require:

  • Reading camera data from the OV5640. You just wrote the module to handle this in the last exercise.
  • Subsampling the OV5640 data so it can fit into the BRAM memory of the FPGA, and scaling up the BRAM storage to fill the HDMI output.
  • Splitting these frames into their RGB and YCrCb components.
  • Selecting pixels from the frame based on their values in a particular channel.
  • Running a simple threshold on the selected channel to generate a mask.
  • Finding the centroid of our selected pixels using a center of mass method.
  • Outputting an image sprite at the coordinates of the centroid.
  • Adjusting the overall system pipeline so that latency is consistent across all data paths.

All of these steps stack up to a linear image processing pipeline that's conceptually pretty simple, but has a very nuanced implementation! Only a handful of the operations we'll perform will be possible with purely combinational logic - we'll need a fair bit of sequential logic, which will add delay to our pipeline. Properly compensating for these delays as we implement our pipeline is the focus of this lab.

Top-Level Schematic

Ok time to grow and build our checkoff 1 project. Don't make a new project (or if you want to do that, make a copy of where you ended after popcat) since we need to just add more files into the mix. Save your top_level.sv file from the previous checkoff as some backup name (backup_top_level.sv) and move it out of your hdl folder (perhaps make a backup_hdl folder that isn't used during builds). Replace your active top_level.sv with the one found in this zip.

In addition, this zip has the following files you should also add to you hdl folder:

  • camera_registers.sv: Module for configuring settings on the camera.
  • channel_select.sv: A module to pick from different color channels.
  • video_mux.sv: A module that combines mask and drawing information for different outputs.
  • threshold.sv: A module to mask the selected color channels.
  • rgb_to_ycrcb.sv: A module for converting RGB colorspace to YCrCb colorspace.
  • cw_fast_clk_wiz.v and cw_hdmi_clk_wiz.v: two pre-built clocking wizards for getting the different clocks we need for camera interpretation + HDMI output.
  • xilinx_true_dual_port_read_first_2_clock_ram.v: A two-port BRAM module we'll use for a frame buffer; it's the same as the one from your audio buffer in week 3.
  • divider.sv: A standard divider discussed in class (a multi-cycle, state-machine based approach to division).
  • lab05_ssc.sv: A modified seven-segment display useful for this lab (that displays information from the mask thresholds and the selected channel.

It also includes the following starter skeleton that you'll complete in this lab:

  • center_of_mass.sv: A module to calculate the center of mass of the thresholded/masked pixels in our image so we can use them for tracking.

Alongside the files you had for the first checkoff, make sure to include your pixel_reconstruct.sv, which you just wrote and tested in the last exercise.

Then, two more files:

  • To go along with the camera_registers.sv module, please also download this rom.mem here and add it to your data folder.
  • Replace your current top_level.xdc file with this new top_level.xdc file. It names all the connections for the camera connector board we're using now.

Finally, download this zip, unzip it, and put it in your project directory (there should be a folder called ip now). We have to use this for a frame buffer because Vivado broke something in 2024.1 and Xilinx doesn't want me to be able to have a weekend.

The image below is the almost-full schematic of the rest of lab as it stands. The yellow items are pieces of code/modules you need to either finish or write and then integrate. The orange pieces are ones you've already written and just need to integrate. Everything else is present, but should be studied.

lab_setup

The approximate block diagram for the rest of lab 05. Yep its quite big...tell me something I don't know. It would probably be good to open this in another window. The things in yellow are what you are designing and/or finishing in this portion of the lab. The things in orange, you've already designed (this week or previously). Also note some things (like the downstream TMDS/HDMI conditioning, as well as the camera controller are not shown for the sake of simplicity, but also exist in the design.

Information flows in the schematic from left to right, which shows a few of the steps mentioned previously, as well as a LOT of intermediary steps. The code for top_level is pretty heavily commented so you should study how the entire system works together. Everything is connected, but some data paths aren't used sequentially while others are. Read through the top level and its comments! It'll help you parse where your pieces fall in place in the data pipeline!!

Camera Clock Domain: saving camera data

The camera and its output are completely disconnected from how/what is getting drawn on the monitor at any given point in time. Sure the camera pixels do show up on the screen, but the two halves of the system are separated from one another via a frame buffer which is a two-port memory structure. The camera writes data in one side at its own pace and the actual video-render stream pulls data out from that memory at its own separate pace; just like with the audio received over UART back in week 3. This allows us to not have to worry about pipelining or syncing the camera to the video stream.

Having this separation is especially important so that our camera capture infrastructure can operate at a different clock speed! We're going to clock our pixel_reconstruct module at 200MHz. The pixel clock from the camera provides us data at 50MHz, and as you saw when writing your logic, we need to be able to sample it frequently enough that we know when we're seeing a rising edge of the clock, so sampling at a rate a few times faster than the PCLK provides us with confidence that we'll catch every data frame the camera provides us.

The one other component existing on this clock domain is the system that sets the settings on our camera. In order to get the camera up and running, we need to put it in the proper mode to give us properly formatted data that matches the operating mode we expected when writing our pixel reconstructor. The camera settings are set over the I2C protocol, and we have that fully implemented for you.1 You don't need to worry about that system--when you click reset, it'll start up the camera in the proper mode.

Our camera doesn’t use the 24-bit {R, G, B} full-color format as we saw in week 04. Instead it uses a 16-bit color scheme called RGB565 (which we've seen in week 1), where 5 bits are given for the red channel, 6 for the green, and 5 for the blue. We'll store the 16 bits we get for each pixel as-is, but when we output our pixels to the display we'll need to zero-pad our values to get 8-bit colors for each channel.

Pixel Clock Domain: generating display output

For the actual video stream pipeline (which is the bulk of the design), things of course start with the video_sig_gen module like before. This generates all the appropriate timing signals and of particular importance is the hcount and vcount values which get passed to a lot of modules. We want to render the camera's images on the screen starting in the upper left corner and filling as much of the screen as possible. In order to accomplish this we first need to scale things up (you'll write this module in a little bit). All this involves is modifying the x,y values we use to look up pixels in our frame buffer from the camera. After we scale our coordinates to match our BRAM buffer, a corresponding pixel address is generated and used to grab a value from the camera's frame buffer. Here, since we're working with our video output pipeline, we'll be using the second port of our BRAM and clocking it on our 74.25MHz pixel clock. The pixel data is then passed along to the rest of our data pipeline.

First thing to do is take all of our color components from each color space - R, G, B which come from the camera natively, but also Y, Cr, Cb (review lecture 7 where we went over video for an explanation of what Y Cr and Cb are). These six different color channels are what we'll try to track objects on. Next we choose one of these six channels using the channel_select module.

This selected channel is then handed off to a module that performs a thresholding operation. What this means is that value of the selected channel we're focussing on (Red, Green, Blue, Chroma Red, Chroma Blue, Luminance) is compared against a lower bound and upper bound. If it is between those, it is "valid" and a 1 is outputted. If it isn't, a 0 is outputted. The result is this module generates a threshold mask signal.

These masked values of every pixel is then handed off to a module to determine the center of mass of all pixels in a given frame falling between the thresholds. The centroid of the object is then estimated by the center_of_mass module, which you'll implement. It works by calculating the average position of every pixel in the mask, and using that as the mask's centroid. 2 You'll implement this module a bit later.

  • The x, y coordinates of the centroid are passed to an image_sprite, which is drawn with its center at the mask's centroid. A crosshair that tracks the centroid is also drawn from the same coordinates.

  • The mask, image sprite, crosshair, and mask are all fed to video_mux module, which determines what's shown on the display, which allows a huge amount of combinations and choices about what is displayed for the purposes of debugging.. The full-color camera output is also fed to the module so you can check the image input to the pipeline. We've also routed the Y-channel (greyscale) camera output to the mux so you can see the mask/crosshair/image sprite over a greyed-out frame.

There's a lot of steps in this pipeline! Take some time to get comfy with it. We've provided most of the boilerplate connections in the top_level, but you'll be implementing the scale and center_of_mass modules and you just impolemented the pixel_reconstruct.

Switch Reference

The onboard switches control the image pipeline and have the following meanings.

  • sw[0] and btn[1] control the scale of the image, where
    • btn[1]==1 scales the image by a factor of 1
    • btn[1]==0, sw[0]==0 scales the image by a factor of 2
    • btn[1]==0, sw[0]==1 scales the image by a factor of 4
  • sw[3:1] controls the color channel used to produce our mask, where:
    • 000 selects the red channel
    • 001 selects the green channel
    • 010 selects the blue channel
    • 011 just outputs zeros
    • 100 selects the Y (Luminance) channel
    • 101 selects the Cr (Chroma Red) channel
    • 110 selects the Cb (Chroma Blue) channel
    • 111 just outputs zeros
  • sw[5:4] controls how the color mask is displayed, where:
    • 00 selects the raw camera output
    • 01 selects the color channel being used to produce the mask as a grayscale image. For example, if the blue channel was selected with sw[3:1]=010, then we'd output the 24-bit color {b, b, b} to the screen.
    • 10 displays the mask itself (a black/white only image)
    • 11 turns the mask pink, and overlays it against a greyscale'd camera's chroma channel.
  • sw[7:6] controls what's done with the CoM information:
    • 00 nothing
    • 01 crosshair
    • 10 sprite on top
    • 11 magenta background for testing
  • sw[11:8] set the lower bound on the pixel channel that generates the mask.
  • sw[15:12] set the upper bound on the color channel that generates the mask.

We have limited cameras given the size of the class. You are not to remove the cameras from the lab space. When you are done in lab for the day, you need to leave the camera in lab at the front.

Before you get started, run to the front of lab and grab a camera board, and plug it into the PMODA and PMOD+ pins. Make sure you don't have the lens cap on.

You'll also use

  • The pink foam from inside your lab kit, which we'll use as our target.
  • An HDMI monitor at your station, just like in week 04.

Your setup should look like this once you have everything.

lab_setup

Camera Subsampling

The OV5640 camera sensor is giving us 720p camera data, which is pretty great! Unfortunately, we don't have the space to store all that camera data in our BRAM. Soon, we'll introduce higher-capacity memory paths to let us truly use all that image data, but for now we'll need to cut that data to a more reasonable size, so we can stick it all in a BRAM buffer.

We'll cut the camera data down to size by only paying attention to a subset of the hcount and vcount coordinates we're working with; specifically, every fourth pixel in either dimension. By cutting down our data, we'll end up storing a 320x180 image in memory. This process is known as subsampling3.

Find the instantiation of your pixel_reconstruct module, and following it, use some sequential logic and the camera_hcount, camera_vcount, camera_pixel, and camera_valid signals to set the addra, valid_camera_mem, and camera_mem signals that are feeding the "write" side of our frame buffer. You should only be writing to memory if the valid pixel from your reconstructor has both coordinates divisible by 4.

Image Scaling

The frames we write into our framebuffer BRAM are 320x180 for the sake of storage, and are being updated at around 30 frames per second. As you've probably noticed by now, this doesn't look very big on our 1280x720 monitors! We'd like to enlarge it so that it's easier to see what's going on. We'd like to be able to scale our camera output by a factor of 1, 2, or 4. We'll be doing this by making it such that multiple pixels on the display obtain their color values from the same originating pixel - this is otherwise known as upscaling and can be done by addressing the BRAM intelligently.

With that out of the way, let's turn to actually implementing the up-scaling logic. It really should be the opposite of the down-scaling logic with some variability in it.

  • If btn[1]==1: corresponds to 1X scaling, and produces a 320x180 image
  • else if...
    • sw[0] = 0: corresponds to 2X scaling, and produces a 640x360 image
    • sw[0] = 1: corresponds to 4X scaling, and produces a 1280x720 image

camera_scaling

Camera Scaling. The image here is from an old camera, your camera will look way nicer.

This hdl should be relatively simple and have one clock cycle of delay total. It may help you to calculate the scaled hcount and vcount values internally, and use those to calculate the proper address. There is some starter structure provided for you. By default the system only scales by a factor of 1X, so make sure you can get the 2X and 4X as specified above.

Masking

Now that we've got a camera feed that's large enough to work with, we can move onto the masking step in our image pipeline. We've provided the Verilog to produce the mask based on the values in the selected color channel - that's all in the threshold module. Your job here is to mess with the switches to see what can and can't be well filtered and tracked with your system, starting with the pink foam from your lab kit.

Tweaking the switches gives you sufficient flexibility to see pretty much anything in the video processing pipeline, feel free to check your reference sheet if you need a refresher on what each one does. Setting sw[5:4] = 2'b11 is particularly useful as it draws the mask in pink, while leaving everything that's unmasked in greyscale. The greyscale here is taken from the luminance of the camera image, which is the Y channel is YCrCb space.

The seven segment LEDs will display your current upper and lower thresholds (in binary) as well as your color channel selection (r,g,b,y,Cr, or Cb). Using this in combination with the video out on the screen should guide your intuition towards the mask generation here. Start by trying to detect the pink foam, and then move on to other kinds of objects. Find another object that's reasonably well detected by thresholding on one particular channel. See what works and what doesn't.

image_types

Cr mask values superimposed on Chroma channel (left). Green Mask only (right)

Checkoff 2A:
Demonstrate your working scale module in hardware. We'll want to see the image size change based on the values of sw[0] and btn[1]. Also be prepared to explain to us how the masking works. Show us how to track the pink foam, and another object of your choice. If you run across anything else interesting, show us that too!

Center of Mass

Now that we've generated our mask, we'll want to find it's approximate center, or centroid. We'll compute this once per frame, and it'll work by taking the center of mass (CoM) of the mask.

...wait, physics? In my digital design class? No. I purposely chose EECS so I could avoid math. Too bad.

Yep! Just like we can compute the CoM for any object in the real world, we can compute the CoM of our mask. And just like how in real life the CoM of an object is approximately at its center, the CoM of our mask will be at approximately at the center of our mask. And since our object is being selected out of the scene by our mask, the mask's CoM should be pretty close to the center of the object we're tracking.

We can think of our mask as a collection of pixels that all have some 'mass', and are all connected together. Back in 8.01 we had a formula for finding the center of mass (x_{CoM}, y_{CoM}):

m_{total} = \sum_{n} m_n

x_{CoM} = \frac{\sum_{n} m_n * x_n}{m_{total}}

y_{CoM} = \frac{\sum_{n} m_n * y_n}{m_{total}}

Where m_{total} is the total mass of our object, calculated as the sum of masses of all the smaller objects that comprise it. x_{CoM} and y_{CoM} are then computed by taking the weighted average of each mass in the object, and then dividing by the total mass of the object.

This works for physical objects in the real world, but since our mask can only be one or zero (it either lies within the thresholds we've set, or not), we can treat every pixel as having the same mass, which reduces the formula to:

m_{total}= \sum_{n} 1

x_{CoM} = \frac{\sum_{n} x_n}{m_{total}}

y_{CoM} = \frac{\sum_{n} y_n}{m_{total}}

If you notice, this has actually just reduced to taking the average position of each pixel's x and y coordinate! This is super simple, and is sufficient to estimate the center of our object.

Let's turn to implementing this in Verilog. We've given you a skeletonized version of it in hdl/ that you'll fill out with the proper logic. This module has the following inputs:

  • clk_in: system clock, in this case the 74.25 MHz pixel clock
  • rst_in: system reset
  • [10:0] x_in: horizontal position of the current pixel being provided to the module
  • [9:0] y_in: vertical position of the current pixel being provided to the module
  • valid_in: indicates a valid (x,y) or (horizontal, vertical) point to be added.
  • tabulate_in: Used to trigger the final calculation of the average horizontal and vertical pixel position (will be a single-cycle assert). You can use the "new frame" signal for this.

And the following outputs:

  • [10:0] x_out: Calculated average 11 bit horizontal position
  • [9:0] y_out: Calculated average 10 bit vertical position
  • valid_out: Indicates a valid (x,y) or (horizontal, vertical) point has been produced.

The module should work by taking the average of all "valid" pixels on a frame-by-frame basis. It should do this by summing the x and y locations of every valid pixel (indicated by the valid_in signal); along with this it should tally how many pixels are recorded. When tabulate_in is triggered, the system should enter a dividing state where it uses the total sum of x positions along with the total number of pixels tallied to calculate an average x position for all valid pixels; similarly it should do this for the y dimension as well. We have provided a 32 bit variant of the adder from a few lectures this term that you can use with this module (called divider.sv in the lab starter code). The latency of the divider module is variable, depending on what is being divided, so you will need to build your center of mass FSM to account for this. You cannot use an alternate divider. The point is to get some good practice with major/minor FSM design. Review Lecture 06 for discussion on the major/minor FSM abstraction. When the average has been calculated for BOTH dimensions, your center_of_mass module should make sure to have those two values on the x_out and y_out outputs and indicate their validity via a single-cycle assertion of valid_out. Downstream logic will be looking for this and will update the x_com and y_com variables in top_level appropriately.

Your job is to not only write the center_of_mass module to spec, but also to verify it with a testbench. This is for your own good. This is not a module you want to be debugging on the hardware. Trust us. For a first test case, consider feeding the module 700 valid pixels, each time with x_in and y_in increasing from 0 to 699 on each successive pixel. Once you trigger tabulation you should expect a value of like 348 or 349 for both dimensions. Next try the same thing, but feed in only the value 10 to y_in. It should return 348 for x and 10 for y. Make sure your module provides only ONE valid_out signal per tabulate_in trigger. There is no requirement on the latency between tabulate_in and valid_out.

We will use this module in conjunction with the new_frame signal, which you'll remember from Lab 04 fires very early into the end-of-frame vertical and horizontal syncing period so we have thousands of clock cycles to play with while waiting for the division to come back.

A few other specifications that you should test for:

  • Your system must work repeatedly on one frame after the other, so make sure this is tested
  • Make sure your system works properly with at little as one valid pixel in a frame
  • Make sure your system works properly with as many as 1024\times 768 valid pixels in frame (this will influence sizes of registers)
  • If tabulate_in is asserted and no valid pixels have been recorded, your module must respond appropriately (with no valid_out signal every being asserted).
  • make sure your system works robustly for instances where x and y are fed different values since this is 99.999% what will be happening on the actual device.

Your Checkoff 2B will REQUIRE that you show the staff your testbench (cases discussed above) and results. So help me, if you "verify" your center_of_mass by giving it (1,1) and then (1,1) and then act like it outputting (1,1) is sufficient, you'll not only not get the checkoff, but lose prior checkoffs you have earned.

Once you've verified your code with testbenching, deploy it on your actual build. Put your switch settings so that the crosshair is enabled. You should be able to track something red quite easily on Cr channel with appropriate mask settings.

Sprite-Time

What we'd like to do is allow the option of super-imposing a 256x256 image sprite as well. Thankfully you can just pull that code over from the last section, and it should (largely) be an easy thing to integrate once you find in the code where it is expecting the image sprite.

crosshair

The crosshair tracking (pretty well I might add) a piece of red. Masking channel set to Cr..

Make sure you position the center of your image_sprite over your center of mass point (not the upper left corner!!). The code should take care of that for us already, but just double check.

When this is done, time for the checkoff! Show Popcat tracking something. Be prepared to explain your center_of_mass module.

Checkoff 2B:
Show your center of mass module Integrate your Center of Mass Calculator Into your Circuit. Demonstrate both the Crosshair and your image sprite existing on screen and following your center of mass! The Pink foam from your 6.205 kits should work really well for Chroma keying on Red Cr. (or a pink phone background). For blue, a blue phone screen or a window has been found to work well for Blue Cr.


 
Footnotes

1Specifically, we have a ROM set up and we have an I2C controller write the contents of the ROM to the camera. The ROM is initialized in rom.mem to set the white balance, encoding, (non)compression, image quality, etc, and changing its contents could get the camera running in new and exciting ways! (click to return to text)

2Imagine if you printed out a frame from the camera, and used scissors to cut out everything but the masked part of the image. Using the center-of-mass method, we take the point on the mask where it would balance on a point as the centroid. (click to return to text)

3We're implementing a pretty quick and dirty subsampling here; ideally, we would consider all the pixels given to us, and store an average of all pixels that fit into the "bin" of each value we're storing. This is absolutely possible to do, and would help us avoid aliasing artifacts, but for this lab we'll implement the simplest possible option. (click to return to text)