Project Overview

My project is inspired by the work I did during my Masters degree. Back then I got to experiment with implementing several variations of Evolutionary Algorithms to solve some problems in the Telecommunications field. At that point, everything was implemented in C with some additional compiler directives to parallelise the algorithm execution by running parts of it on multiple threads (using OpenMP). Already back then I realised that some operations of the algorithm can be parallelised even further and found some papers, where it was done on FPGAs.

I really wanted to give it a try as well, however, due to the time constraints never actually got to do it. And then I found out about Path 2 Programmable and thought it would be a perfect chance to revisit Evolutionary Algorithms.

Original project plan looked as follows:

1. Implement basic algorithm variation in PS only;
2. Identify components that can be offloaded to the PL accelerator;
3. Set up high speed communication layer between PS and PL;
4. Implement accelerated version of the algorithm;
5. Compare results.

I must admit that I underestimated the scope of work slightly. I therefore decided to split the project blog in 2 parts. The first will cover steps 1 and 2 and will set the scene for the more exciting bit.

Having done some research I thought it would be interesting to have a look at how some parts of the algorithm can be accelerated in Xilinx SDSoC that has not been covered in the training course.

Here is a brief structure of this post. I will first try to give a very brief explanation of what Evolutionary algorithms are. This can be discussed for a very long time so I will leave some links to more detailed insights. Next, I will discuss the application I want to look at. Finally, I will go through some steps on how to set up SDx for Ultra96-V2 and some code profiling to identify algorithm bottlenecks to be accelerated.

Some background: what are Evolutionary Algorithms?

Evolutionary Algorithms (EAs) are a type of heuristic search optimisation algorithms meaning they are trying to find an optimal (or close to optimal) solution to a given problem without exploring the whole search space but trying to predict the regions, where this solution might be. As you can tell from the name, EAs use some concepts inspired by natural process of evolution to optimise the search. Therefore, the main drivers of the search are such operations as Mutation and Competitive Selection (the strongest survives). I guess that going into too much detail is a bit out of scope of this programme but these are two great books that explain it really well:

1. Floreano and C. Mattiussi, Bio-Inspired Artificial Intelligence: Theories, Methods and Technologies, Cambridge: MIT Press, 2008.
2. Brabazon, M. O'Neill and S. McGarraghy, Natural Computing Algorithms, Berlin, Heidelberg: Springer Berlin Heidelberg, 2015.

And here is a link to a very nice article that got me up and running at some point: https://towardsdatascience.com/introduction-to-genetic-algorithms-including-example-code-e396e98d8bf3

I will focus here on Genetic algorithms (GA) that are just a commonly used type of EAs.

The picture below summarises the main algorithm stages.

Source: Floreano and C. Mattiussi, Bio-Inspired Artificial Intelligence: Theories, Methods and Technologies, Cambridge: MIT Press, 2008.

In short:

1. Create a set of randomly generated solutions to your problem (Chromosomes) each consisting of a set of parameters/coefficients (Genes).
2. Evaluate the function to be optimised using that set of parameters for each solution (Fitness Calculation). For example, we can be trying to optimise the hyperparameters for training a Neural Network as suggested here: https://towardsdatascience.com/gas-and-nns-6a41f1e8146d
3. Select the chromosomes that allow the function to evaluate to the highest values (if the function needs to be maximised).
4. Modify values of the randomly selected Genes (Mutation).
5. Mix up the “fittest” chromosomes (Crossover)
6. Evaluate again.

Steps 2-6 are normally iterated several times which represents multiple Generations.

How it will be used?

EAs became quite popular in the field Networks Analysis as they can be applied quite easily to solving Graph theory problems. This is useful for a data set with a lot of members that also has information about how those members are related or connected to each other. One interesting problem that can be solved with a GA is called Maximum Clique. It is easy to understand visually:

In the picture above, members 1, 2, 4, 5, 7 and 11 for a Maximum Clique because every member is connected to the other (each one has 5 out of 5 possible connections). Finding the biggest set like that in a very large graph is a very computationally heavy task. And this is where algorithms like GA come useful. This particular search is used quite a lot when analysing social networks, for instance.

To sum up, the operation of the algorithm I am about to implement is as follows:

1. Load a test graph into memory and store it as a Matrix (Adjacency Matrix);
2. Generate a random set of subgraphs each represented as a binary array (chromosome). 1 means member is included in the subgraph;
3. Extract a clique from each subgraph and try to "grow" it by randomly adding more members.

Practical bit

Step 1: Starting an SDx Project

Xilinx SDx interface is actually almost identical to the one we are used to in the Xilinx SDk, which is convenient and saves you time. There are 3 types of projects you can create:

1. Application project – the actual application you will be running on your hardware;
2. Platform project – this is similar to the BSP for the board you need to create for and SDK project – contains info of the hardware platform used, OS, etc.;
3. Library project – allows you to a library of C-callable hardware IP blocks.

The first thing to do is to download the SDSoC files specifically for Ultra96-V2, this contains a ready Platform project, on which our Application project will be based (http://zedboard.org/support/design/28476/181). At the moment of writing, only Baremetal platform is available.

When creating an application project, you will be asked to select one of the platforms. Ultra96-V2 does not appear on the list by default, you need to add the path to the folder you downloaded which contains the required files by pressing the “Plus” icon. The list will be updated and should look as below:

You will then end up with the following system configuration. As I mentioned earlier, only Standalone can be selected as “Domain”.

The last step is to select the application template. The empty project will do just fine. However, there are plenty of examples to show case various features available through SDSoC from Xilinx. They can be imported straight from the project setup tool or you can find then on GitHub.

And then it is all done, your SDx project is ready! There will be a window with some settings for your project. You can also note that only one A53 CPU core seems to be available for Ultra96.

I used on open-source repository to obtain a copy of one of the so called DIMACS graphs – MANN-a9. It comes in a text format and has a list of graph connections written one connection per row. As we covered in the SW course pf the Path 2 Programmable training, Xilinx provides a library to interface with a FAT filesystem on the SD card. The "ff.h" header needs to be included to be able to use the library. Also, only binary file format is supported so I wrote a simple code to convert the text file storing the graph info to a binary file.

Overall, the process of reading a file from and SD card is very similar to the code you would write for a normal Linux platform: mout the filesystem, open the file and get a file pointer and then read from it line by line. My code to do that is below. I also combined it with filling in the graph Adjacency Matrix.

{
// variable to store the FAT filesystem object
FATFS def_drive;
FRESULT rc;
FIL fp1;
UINT bytesToBeRd = sizeof(edge_t);
edge_t data;

// mount the SD card
if((rc = f_mount(&def_drive,"0:/",0)) != FR_OK)
{
print("ERROR : f_mount failed\n\r");
return -1;
}

// open file from SD card for reading
if((rc = f_open(&fp1, file, FA_OPEN_EXISTING | FA_READ)) != FR_OK)
{
print("ERROR : f_open failed\n\r");
xil_printf("Error code: %d\n", rc);
return -1;
}

// read file line by line
for( int i = 0; i < TOTAL_SIZE; i++)
{
{
return -1;
}

// if data is valid, add it to the Graph Matrix
else
{
/* set matrix element with coordinates v1:v2 to 1
to show they are connected
*/
if ( data.v1 > data.v2)
{
}
else
{
}
}
}

// close the file once done
if((rc = f_close(&fp1)) != FR_OK)
{
print("ERROR : f_close failed\n\r");
return -1;
}

return 0;
}

Step 3 - Running the algorithm

I managed to reuse some of the code I had already. However, it took me some time to perform the necessary modifications. I had to replace some libraries and change the method of allocating memory for large arrays. For the latter, I decided to use the allocator recommended for SDSoC designs: sds_alloc() that is included in the sds_lib.h library. It allocates a physically contiguous area of memory to allow faster access to it from the hardware functions.

SDx generates a bootable image for the SD card by default. However, I started with launching the program in the “Debug on Hardware” mode over the JTAG adapter. And, surprisingly, it analysed the graph correctly from the first attempt and gave the expected Maximum Clique almost very quickly.

There is another useful function included in the SDS library - sds_clock_counter(). It can be used to measure the number of processor ticks to perform some operation. It took around 4.8 million cycles to perform one iteration of the algorithm.

Step 4: Profiling the application

Now it is time to see in which functions the program spends most of the execution time. The recommended way to profile your code is to use a tool included in the SDx – TCF (Target Communication Framework) profiler. To access it, start a debug session as you would normally do (Debug As -> Launch on Hardware) and switch to the debug perspective. Then, click on Window -> Show View -> Other -> TCF Profiler. This will adda new window to the debug view.

I profiled my code as well. To exclude the time taken to a load the file from SD card info memory, I set a couple of breakpoints to only analyse one algorithm iteration. I also had to set the sample refresh interval to 10ms instead of default 4000ms to get some sensible data. Here are the results I obtained:

As expected, the most computationally heavy operations are extracting and then extending the Clique.

In the next post I will describe how these two functions had to be modified to be accelerated in hardware and what speed up I obtained in the end.