The next neural network that I'm going to try is a variant of Tiny-YOLO.  The You Only Look Once (YOLO) architecture was developed to create a one step process for detection and classification.  The image is divided into a fixed grid of uniform cells and bounding boxes are predicted and classified within each cell.  This architecture enables faster object detection and has been applied to streaming video.


The network topology is shown below.  The pink colored layers have been quantized with 1 bit for weights and 3 bit for activations, and will be executed in the HW accelerator, while the other layers are executed in python.


The image processing is performed within Darknet by using python bindings.




The neural network has been trained on the PASCAL VOC (Visual Object Classes) and is able to identify 20 classes of objects in 4 categories

  1. Person: person
  2. Animal: bird, cat, cow, dog, horse, sheep
  3. Vehicle: airplane, bicycle, boat, bus, car, motorbike, train
  4. Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor


The steps for detection and classification are similar to the previous network as this network also uses the Multi-layer offload architecture.


Initialize the network

  1. Import libraries
  2. Instantiate classifier
  3. Perform other initializations in the Darknet framework


Code for initialization:

import sys
import os, platform
import json
import numpy as np
import cv2
import ctypes

from PIL import Image
from datetime import datetime

import qnn
from qnn import TinierYolo
from qnn import utils 
from darknet import *

from matplotlib import pyplot as plt
%matplotlib inline

classifier = TinierYolo()
net = classifier.load_network(json_layer="/usr/local/lib/python3.6/dist-packages/qnn/params/tinier-yolo-layers.json")

conv0_weights = np.load('/usr/local/lib/python3.6/dist-packages/qnn/params/tinier-yolo-conv0-W.npy', encoding="latin1")
conv0_weights_correct = np.transpose(conv0_weights, axes=(3, 2, 1, 0))
conv8_weights = np.load('/usr/local/lib/python3.6/dist-packages/qnn/params/tinier-yolo-conv8-W.npy', encoding="latin1")
conv8_weights_correct = np.transpose(conv8_weights, axes=(3, 2, 1, 0))
conv0_bias = np.load('/usr/local/lib/python3.6/dist-packages/qnn/params/tinier-yolo-conv0-bias.npy', encoding="latin1")
conv0_bias_broadcast = np.broadcast_to(conv0_bias[:,np.newaxis], (net['conv1']['input'][0],net['conv1']['input'][1]*net['conv1']['input'][1]))
conv8_bias = np.load('/usr/local/lib/python3.6/dist-packages/qnn/params/tinier-yolo-conv8-bias.npy', encoding="latin1")
conv8_bias_broadcast = np.broadcast_to(conv8_bias[:,np.newaxis], (125,13*13))
file_name_cfg = c_char_p("/usr/local/lib/python3.6/dist-packages/qnn/params/tinier-yolo-bwn-3bit-relu-nomaxpool.cfg".encode())

net_darknet = lib.parse_network_cfg(file_name_cfg)



Classify image

  1. Open image to be classified
  2. Execute the first convolutional layer in Python
  3. Compute HW Offload of the quantized layers
  4. Normalize using fully connected layers in python


Code for classification:

img_folder = './yoloimages/'
img_file = os.path.join(img_folder, random.choice(os.listdir(img_folder)))
file_name = c_char_p(img_file.encode())

img = load_image(file_name,0,0)
img_letterbox = letterbox_image(img,416,416)
img_copy = np.copy(np.ctypeslib.as_array(, (3,416,416)))
img_copy = np.swapaxes(img_copy, 0,2)

im =

start =
img_copy = img_copy[np.newaxis, :, :, :]
conv0_ouput = utils.conv_layer(img_copy,conv0_weights_correct,b=conv0_bias_broadcast,stride=2,padding=1)
conv0_output_quant = conv0_ouput.clip(0.0,4.0)
conv0_output_quant = utils.quantize(conv0_output_quant/4,3)
end =
micros = int((end - start).total_seconds() * 1000000)
print("First layer SW implementation took {} microseconds".format(micros))
print(micros, file=open('timestamp.txt', 'w'))

out_dim = net['conv7']['output'][1]
out_ch = net['conv7']['output'][0]

conv_output = classifier.get_accel_buffer(out_ch, out_dim)
conv_input = classifier.prepare_buffer(conv0_output_quant*7);

start =
classifier.inference(conv_input, conv_output)
end =

conv7_out = classifier.postprocess_buffer(conv_output)

micros = int((end - start).total_seconds() * 1000000)
print("HW implementation took {} microseconds".format(micros))
print(micros, file=open('timestamp.txt', 'a'))

start =
conv7_out_reshaped = conv7_out.reshape(out_dim,out_dim,out_ch)
conv7_out_swapped = np.swapaxes(conv7_out_reshaped, 0, 1) # exp 1
conv7_out_swapped = conv7_out_swapped[np.newaxis, :, :, :] 

conv8_output = utils.conv_layer(conv7_out_swapped,conv8_weights_correct,b=conv8_bias_broadcast,stride=1)  
conv8_out = conv8_output.ctypes.data_as(ctypes.POINTER(ctypes.c_float))

end =
micros = int((end - start).total_seconds() * 1000000)
print("Last layer SW implementation took {} microseconds".format(micros))
print(micros, file=open('timestamp.txt', 'a'))



Draw detection boxes using Darknet

   The image postprocessing (drawing the bounding boxes) is performed in darknet using python bindings


Code for image postprocessing:

tresh = c_float(0.3)
tresh_hier = c_float(0.5)
file_name_out = c_char_p("/home/xilinx/jupyter_notebooks/qnn/detection".encode())
file_name_probs = c_char_p("/home/xilinx/jupyter_notebooks/qnn/probabilities.txt".encode())
file_names_voc = c_char_p("/opt/darknet/data/voc.names".encode())
darknet_path = c_char_p("/opt/darknet/".encode())
lib.draw_detection_python(net_darknet, file_name, tresh, tresh_hier,file_names_voc, darknet_path, file_name_out, file_name_probs);

#Print probabilities
file_content = open(file_name_probs.value,"r").read().splitlines()
detections = []
for line in file_content[0:]:
    name, probability = line.split(": ")
    detections.append((probability, name))
for det in sorted(detections, key=lambda tup: tup[0], reverse=True):
    print("class: {}\tprobability: {}".format(det[1], det[0]))



Sample image (horses)

The first image that I going to use is a provided sample image of horses (773 x 512 pixels)


Execution time:

  • First layer SW implementation took 594523 microseconds
  • HW implementation took 593735 microseconds
  • Last layer SW implementation took 68420 microseconds



class: cow probability: 84%

class: horse probability: 74%

class: horse probability: 68%


Object detection bounding boxes:

The example shows the issues that occur with multiple overlapping objects.



IP camera images

The application that I would like to use neural networks for is object identification in video streams from surveillance cameras.  As an example, I have an PTZ IP camera at the front of my house that is primarily used to alert me to deliveries (mail, Amazon, UPS, etc).  It is normally pointed at the driveway and mailbox, but the pan/tilt capability allows me to look up and down the street and also at my front door (270 degrees of coverage).  Currently, image motion detection and PIR sensing tell me when something is detected but I need to look at the camera video to determine if it is something of interest.  And needless to say, there are a lot of false detections.  I have 2 video sources that I'd like to analyze, the live fed from the camera and also stored video from a network video recorder (NVR).  I have multiple cameras, but I think it would be okay to require that each camera have dedicated processing hardware.


The PYNQ notebook examples that I've found either use the HDMI input or a webcam as a streaming video source.  For my application I need the ability to process an RTSP (Real Time Streaming Protocol) stream over ethernet.  I had hoped that I could just use the VideoCapture function in OpenCV, but I can't seem to get that to work.  I'm sure that I'll be able to get something to work, but for the purposes of this roadtest I'm just going to use static images from the camera (actually from the NVR).  I currently stream 2 resolutions from this particular camera (1280x720 and 640x480).  I'd like to use the lower resolution stream for processing if it doesn't degrade the accuracy too much.  I'm going to test that with the image captures from the NVR (the lower resolution captures from the NVR are actually only 320x176 - to allow for faster searching).  It turns out that because the detection grid is a fixed ratio to the image that the large and small images have about the same execution time.



Night image (1280x720)

class: car probability: 30%

The car to the right is not detected.


Day image (1280x720)

class: car probability: 86%

class: car probability: 34%

Multiple bounding boxes for the same image


Truck image (320x176)

class: car probability: 79%    -- no separate class for truck


Truck image (1280x720)

class: car probability: 96%  -- improved classification with larger image size (better resolution?)


Different truck (320x176)

class: car probability: 63%


Multiple cars (320x176)

class: car probability: 60%

class: car probability: 47%

class: car probability: 33%


did okay with the shadows


Me (320x176)

class: person probability: 42%


Multiple objects (1280x720)

class: car probability: 79%

class: car probability: 75%

class: person probability: 51%


Seems to have a harder time with people


Amazon and Mail trucks (1280x720)

class: car probability: 99%

class: car probability: 35%




   So, I've got a few challenges ahead of me after this roadtest.

  1. Figure out how to capture the RTSP stream (BTW, I do this successfully with a Raspberry Pi)
  2. Quantify usable frame rate (currently taking over a second to execute)
  3. Figure out to train with something that allows me to differentiate vehicles