Name(s): ________________________________________________

15-494/694 Cognitive Robotics Lab 7:
Convolutional Neural Networks

I. Software Update and PyTorch Setup

At the beginning of every lab you should update your copy of the vex-aim-tools package. Do this:

$ cd ~/vex-aim-tools
$ git pull

In addition, you will need to install some packages. First, activate the same Python virtual environment you use to run simple_cli. Then do the following:

$ pip install torch torchvision ultralytics

Note: if you're using Virtual Andrew and run out of disk space, you may need to create a Python environment in C:\Users\myuserid instead of in the default Desktop location in the andrew.ad.cmu.edu file system.

II. Experiments with the MNIST Dataset and Linear Models

You can do this part in teams of two if you wish. When answering the questions below, you are encouraged to refer back to the lecture slides.

Make a lab7 directory.
Download the file mnist.zip into your lab7 directory.
Unzip the mnist.zip file and look inside your mnist directory.
Skim the mnist1.py source code. This is a linear neural network with one layer of trainable weights.
Have a look at the PyTorch documentation, and specifically the documentation for torch.nn.Linear.
Run the model by typing "python3 -i mnist1.py". The "-i" switch tells python not to exit after running the program. Press Enter to see each output unit's weight matrix, or type control-C and Enter to abort that part.
Try typing the following expressions to Python:
- model
- params = list(model.parameters())
- params
- [p.size() for p in params]
The first parameter is the 784x10 weight matrix; the second one is the 10 biases.
How long did each epoch of training take, on average? ________________
If your laptop has a GPU, modify the model to use the GPU instead of the CPU. (You just have to uncomment one line and comment out another.)
Run the model on the GPU if you can. How long does each epoch take now? ________________
Are you surprised? GPUs don't help for small models. A few thousand weights is small.
If you run mnist1 a second time, you won't get exactly the same result. Give two reasons for this: ________________________________________________
________________________________________________________________
Skim the code for the mnist2 model. This model has a hidden layer with 20 units. Each hidden unit is fully connected to the input and output layers.
Run the mnist2 model on the CPU. How long does each epoch of training take, on average? ________________
You can use the show_hidden_weights() and show_output_weights() functions to display the learned weights.
If you have a GPU available, modify the mnist2 code to run on the GPU. How long does each epoch take now? ________________

III. Experiments with the MNIST Dataset and a Convolutional Model

You can do this part in teams of two if you wish.

Skim the code for the mnist3 model.
Run the model on the CPU. Look at some of the kernels the model learns.
How many parameters does this model have, where each parameter is a tensor? ________________
What are each of the parameters of this model? Describe them in English. ________________________________________________
________________________________________________________________
Note that two of the parameters are batch normalization values (means and variances) created by the BatchNorm2D layer. The rest are weights. (Biases are considered to be weights.) Looking at the sizes of the various weight and bias tensors, how many total weights does this model have? Show your calculation. ____________________________________
A convolutional neural network is a "virtual" network where each kernel is replicated many times, but we don't actually build out all the units and connections as individual data structures, since they share the same weights. When running data through the network, though, we still have to do all the multiply and accumulate operations as if we had built out the network, so the number of "effective" weights is many times the number of weight parameters. How many effective weights are in the mnist3 model? Show your calculation. ________________________________________________
If you are able to run this model on the GPU, how long does each epoch of training take, on average? ________________

IV. Object Recognition with MobileNet

You can do this part in teams of two if you wish.

Run the MobileNet demo on the robot. Note: to install this demo you must download both MobileNet.fsm and the labels.py file found in the same directory.
Use your cellphone to call up a picture of a cat and show it to the robot.
Type "tm" to tell the program to proceed with recognition. Did it recognize the cat?
Try some dog breeds, and some other object classes such as airplanes or cars.
What is the most obscure dog breed you got it to recognize? ____________________
How does the model behave when shown right-side-up vs. upside-down pictures of sportscars? ________________________________________________________________
How does the model behave when shown right-side-up vs. upside-down pictures of tabby cats? ________________________________________________________________

V. Homework Part A

In this assignment you will train a CNN to recognize dominoes. A domino is described by the number of pips in each half, with the larger number always written first, e.g., 5-2, never 2-5. We're using a "double six" domino set, which means the highest domino is the 6-6 and the lowest is the 0-0. There are a total of 28 unique combinations. Therefore you will be solving a 28-way image classification task. Our dominoes have colored pips, which makes the classfication easier.

We've already done a lot of the work for you: (1) we assembled a dataset of 2000 domino images with class labels, (2) we wrote code to load the dataset and augment it by introducing small shifts, small rotations, and random flips, and (3) we wrote code to train a CNN classifier on this dataset, holding out some examples for a validation set used for early stopping, and some more for a test set to measure the trained model's performance. We used a "stratified split" strategy to split up the 2000 images in a way that ensures that each class is proportionally represented in the training, validation, and test sets.

The performance of the trained model is awful. This is because we're using a lame CNN with only one convolutional layer. It needs more. Your job will be to design a better CNN.

Download these files into your lab7 directory: domino_dataset.pt, train_domino_cnn.py, utils.py, DominoCNN.py.
Browse the code to get an idea of what it does.
Run the model by doing "python -i train_domino_cnn.py". Study the output.
Type "show_grid(train_dataset)" to see a random sample of training data. You can repeat this several times. You can also use show_grid to examine val_dataset and test_dataset. The display looks like this:
Design a better version of DominoCNN that uses more convolutional layers to do a better job. You can ask ChatGPT for advice on how layers to use and how many kernels each layer should have. It is possible to achieve more than 99% accuracy on the test set if you choose a good architecture.

VI. Homework Part B

In this part of the homework you will use a YOLO ("You Only Look Once") domino detector to find a domino in the current camera image. We've already trained the detector for you. Its output looks like this:

Your job will be to use the bounding box from the domino detector to extract the domino from the image, pad and rescale it so it looks like the CNN training data, and then run it through the CNN to recognize the domino.

To try the YOLO domino detector, download the file yolo_best_weights.pt and run DominoYOLO.fsm. The result obtained from the domino detector has an object bounding box (obb) that can be read in several formats; try result.obb.xywhr for a simple axis-aligned rectangle. You can also try out DominoRealTime.fsm for continuous real-time domino detection without having to type "tm".

At the end, you should have a state machine program called DominoClassifier.fsm that waits for you to type a "tm" and then grabs a camera image, locates the domino, classifies it, and says something like "I see a 6-3 domino!" or "I don't see a domino", and then waits for the next "tm".

To solve this problem you will need to learn how to load the saved model that you trained in Part A, and then how to use that model to classify a single image and find the class id string, e.g., "6-3".

We will have a domino set in the lab that you can use for live testing.

Hand In

Hand in your written responses to the MNIST questions at the end of today's lab. Be sure to put your name on the sheet

Hand in a zip file in Canvas containing: your modified DominoCNN.py, your saved best weights file, a text file contining the output of your training run, and your DominoClassifier.fsm file. Do not include all the training data in your zip file!

Dave Touretzky

15-494/694 Cognitive Robotics Lab 7:Convolutional Neural Networks