Previous postsDescription
BIBoP 1 - Introduction, game plan and rationaleIntroduction to the project and overview, creators bios
BIBoP 2 - Writing Arduino code with C++ and C (Makefiles)Using a custom Makefile for development of Arduino code, flashing and debugging code

 

Next posts
Description
BIBoP 4 - AWS Lambda deployment and MQTT communicationDeploying the trained model on the AWS Lambda and sending secure requests from the Arduino
BIBoP 5 - Power efficency and interruptsImplementing SAMD21 deepsleep mode and external interrupt handling
Assembly and debuggingAssembly, 3D designing and debugging of the project
Galvanometer creationProcess of creating my own galvanometer
Sensor processing algorithmsOverview of various medical algorithms for detecting abnormalities in the cardiovascular activities

 

Introduction

 

Hello! In case this is your first time with BIBoP it is a smart armband designed for remote patient monitoring and overall health assessment. It is based on Arduino Nano 33 IoT and some external sensors (PPG, Galvanometer and IMU).

Be sure to regularly check the official repository of this project!

 

After reading this post you will be more familiar with creating machine learning models and cleaning and exploring your data - don't worry if you are not familiar with ML, reading this post should help you grasp the practical basics!

If you are confused about all the medical terms, such as systole or diastole, be sure to check this article!

 

It seems like Raw HTML has troubles with longer code segments (especially indented), so some variables may appear double. I tried fixing it but it behaves oddly! Just ignore doubly spelled variables in the code sections below .

 

Why use ML models for inference?

Maybe you remember from the BIBoP 1 - Introduction, game plan and rationale post that one of key values this product will bring is Blood Pressure monitoring - an important indicator of a person's cardiovascular fitness and health. It is estimated that over 1.13 billion people worldwide have hypertension, of which two-thirds live in low/mid-income countries. It is even called silent killer as it does not have any warning signs or symptoms - most people who have it are oblivious to it! Thus it is vital to have a kind of device which will monitor it periodically and alert the user if they have unusual blood pressure. Even better, for people who are already aware of it, carrying a cuff device is cumbersome and not feasible. This band is not an ideal solution but is much easier to carry around and perform non-invasive measurements of BP via photoplethysmography.

 

You may say "All good, but why use ML models and not smart algorithms for estimation?" The answer is quite simple - it turns out that having a model is both computationally cheaper and easier to implement - the implementer does not need to learn all signal processing techniques and can create a good-enough solution on their own. Obviously this is NOT a medical-grade device and just an estimator (but it could be one day ). In a commercial solution one would use more complex models/in-house algorithms for processing PPG, but more about that later.

 

This post will go through the notebook available here and discuss the most important points in the data cleaning, training and validation process. The next post will go more in-depth into deploying the model to AWS Lambda and sending the data from Arduino to it.

 

Data obtaining and cleaning

The data which will be the basis of the model was obtained from a publicly available data from UCI Machine Learning Repository which contained PPG, ECG and invasive BP measurements. The ground truth for our estimation were extracted SBP and DBP (systolic and diastolic BP), that were extracted from the ABP (Arterial Blood Pressure). The sampling frequency used was 125 Hz and this is the frequency with which we will be collecting our data by the BIBoP. The signals were of course not ideal and they had to be cleaned beforehand (this was done by the creators of the dataset).

 

After downloading the dataset, extracting it and loading to memory, the process of cleaning and data reshaping can commence!

 

This is visible in this chunk of code:

 

# download and load the data

chunk_size = 4096
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00340/data.zip"
req = requests.get(url, stream = True)
total_size = int(req.headers['content-length'])

# Faster downloads in chunks
with open("data.zip", "wb") as file:
    for data in tqdm(iterable=req.iter_content(chunk_size=chunk_size), total = total_size/chunk_size, unit='KB'):
        file.write(data)

# extract the data
if not os.path.exists(DATA_FOLDER):
    os.mkdir(DATA_FOLDER)
        
import zipfile
with zipfile.ZipFile("data.zip", "r") as ref:
    ref.extractall(DATA_FOLDER)

test_samples = mat73.loadmat(f"{DATA_FOLDER}/Part_1.mat")['Part_1']


 

We want to divide the data into 125 samples segments (or longer depending on the segment length - more on that later!):

 

# the sampling speed - 125 Hz
FS = 125
# SAMPLE_SIZE = 125 # the frequency in Hz (1 second samples)
# segments of 10 seconds
SAMPLE_SIZE = 125 # or longer!
NUM_PERIODS = SAMPLE_SIZE // FS
# partition the data into equal length pgg segments
ppg = []
for i in range(len(test_samples)):
    l = test_samples[i][0].size
    for j in range(l // SAMPLE_SIZE):
        ppg.append(test_samples[i][0][j * SAMPLE_SIZE : (j + 1) * SAMPLE_SIZE])

 

Since machine learning (regression) in a BIG approximation is "Given x calculate y, for y = f(x)", we are preparing our x's (independent variables) - our y's (dependent variables) will be predicted and tested against ground truth -> blood pressure already in the dataset.

*regression is in smart words prediction

 

We now have to extract both systolic and diastolic blood pressure, and we do it by obtaining minimums and maximums of consecutive periods:

sbp = []
dbp = []
bp = []

for i in range(len(test_samples)):
    l = test_samples[i][1].size
    for j in range(l // SAMPLE_SIZE):
        temp_bp = test_samples[i][1][j * SAMPLE_SIZE : (j + 1) * SAMPLE_SIZE]
        tmp_sbp = []
        tmp_dbp = []
        for k in range(NUM_PERIODS):
          tmp_bp_small = temp_bp[k * FS : (k + 1) * FS]
          tmp_sbp.append(np.max(tmp_bp_small))
          tmp_dbp.append(np.min(tmp_bp_small))
        # SBP will be maximum and DBP will be minimum of 1 such sampling period (or averaged if more periods)
        bp.append(temp_bp)
        sbp.append(np.mean(tmp_sbp))
        dbp.append(np.mean(tmp_dbp))

 

A comparison of the ABP and PPG signal is visible below:

 

Inspecting the SBP and DBP shows they are good enough:

The PPG signal is surely non-ideal and that is why we will need to perform some further cleaning:

 

For now we can create a baseline model and see how it fares. We did not extract any features yet so we do the prediction based only on the PPG signal:

 

# since we want to predict both SBP and DBP we will pack them for each sample
target_bp = []
for i in range(len(ppg)):
    target_bp.append((sbp[i], dbp[i]))

x_train, x_test, y_train, y_test = train_test_split(ppg, target_bp, test_size=0.3)

model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
mean_absolute_error(y_test, y_pred)

 

The achieved Mean Absolute Error was 13.8 which gives us some room for improvement. We need to extract features from the signal which may help us identify potential important characteristics for models.

 

Temporal and Spectral features extraction

Temporal:

Since classic Machine Learning usually benefits from hand-picking some features from the input data, some features are extracted from the signal periods and assessed for validity. Features were chosen based on most popular ones in the recent scientific papers.

     These include:

  • Cycle duration time
  • Time from cycle start to systolic peak
  • Time from systolic peak to cycle end
  • Time from systolic peak to dicrotic notch
  • Time from dicrotic notch to end
  • Ratio between systolic and diastolic amplitude

These features are explained in more detail in the graph underneath:

PPG temporal features shown

The code for extraction of both features is available in the notebook.

 

Spectral:

Additionally, we can extract spectral features, which could help the models:

  • Three largest magnitudes (both values and frequencies)
  • Normalized energy
  • Entropy
  • Histogram - Binned distribution from 0 to 60 Hz (10 bins)
  • Skewness
  • Kurtosis

 

These features require some periodicity (remember our old friend Joseph Fourier?), so they have to be taken from ~10 periods of PPG.

 

Exploratory Data Analysis

 

After we obtained the features, we assemble the dataframes and clean them slightly.

rows = []

for i in range(len(ppg)):
    cycle_len, t_start_sys, t_sys_end, t_sys_dicr, t_dicr_end, ratio = extract_features_long_seg(ppg[i], ppg_ii[i])
    freq_1, mag_1, freq_2, mag_2, freq_3, mag_3, energy, entro, bins, skewness, kurt = extract_spectral_features(ppg[i])
    rows.append((cycle_len, t_start_sys, t_sys_end, t_sys_dicr, t_dicr_end, ratio,
                freq_1, mag_1, freq_2, mag_2, freq_3, mag_3, energy, entro, skewness, kurt, *bins))
    
rows = np.array(rows)

# make a df from the data to clean it
bins = [f"bin_{i}" for i in range(10)]
col = ["cycle_len", "t_start_sys", "t_sys_end", "t_sys_dicr", "t_dicr_end", "ratio", 
       "freq_1", "mag_1", "freq_2", "mag_2", "freq_3", "mag_3", "energy", "entropy", "skewness", "kurtosis", *bins]
df = pd.DataFrame(rows, columns=col)

# drop rows with wrong values
idxs = df.loc[df['t_sys_end'] < 0.].index
inf_idxs = df.loc[df.values >= np.finfo(np.float64).max].index
indices = idxs.append(inf_idxs)
df = df.drop(indices)

# fill all NaN's
df.fillna(0, inplace=True)


# clean also the target variables
col = ["SBP" , "DBP"]
df_target = pd.DataFrame(target_bp, columns=col)
df_target= df_target.drop(indices)

 

Let's take a quick look at the df.describe() which will tell us how the dataframe fares:

 

cycle_lent_start_syst_sys_endt_sys_dicrt_dicr_endratiofreq_1mag_1freq_2mag_2freq_3mag_3energyentropyskewnesskurtosisbin_0bin_1bin_2bin_3bin_4bin_5bin_6bin_7bin_8bin_9
count64456.00000064456.00000064456.00000064456.00000064456.00000064456.00000064456.00000064456.00000064456.00000064456.00000064456.00000064456.00000064456.00000064456.00000064456.00000064456.00000064456.00000064456.00000064456.00000064456.064456.064456.064456.064456.064456.064456.0
mean0.4421440.1277840.3143600.0938090.2205512.04995014.718056342.62433320.163569186.22490021.539624133.61778436308.785487-5.54305035.5665651259.2213801249.9999690.0000160.0000160.00.00.00.00.00.00.0
std0.1453700.0427920.1171960.0206340.1097384.4656084.221277121.5929329.51861568.86861111.80487155.55692713447.11321821.8625513.219641151.5494790.0055700.0039390.0039390.00.00.00.00.00.00.0
min0.0680000.0115560.0192000.017600-0.0432001.0010151.00000034.3655091.00000023.1689511.0000009.1010541691.572260-89.05356026.378352879.1032771249.0000000.0000000.0000000.00.00.00.00.00.00.0
25%0.3120000.0968000.2104000.0800000.1264001.62907712.000000288.34774513.000000153.29787313.00000097.18300036032.668051-20.53926133.0609481139.8087141250.0000000.0000000.0000000.00.00.00.00.00.00.0
50%0.4515560.1224000.3224000.0920000.2256001.92778214.000000360.51882818.000000191.26983121.000000138.41464639596.804411-5.39209335.4507671250.9866391250.0000000.0000000.0000000.00.00.00.00.00.00.0
75%0.5680000.1536000.4160000.1056000.3136002.26127416.000000430.11851328.000000232.09027630.000000173.34580143575.1056017.44150037.9544301370.3357451250.0000000.0000000.0000000.00.00.00.00.00.00.0
max0.9424000.4977780.7544000.2136000.654400410.78115646.000000593.14663556.000000425.82933260.000000345.32319065454.45626885.18222045.2650061696.1061931250.0000001.0000001.0000000.00.00.00.00.00.00.0

Briefly exploring the data, it can be seen there are some outliers which should be eliminated in order to make the models' predictions better. For this we will use IQR elimination - removing all the values which lie outside of the Q1 and Q3 percentile quarters in the normal distribution of the data:

# only some columns are striking, remove only rows where outliers are present in these columns
sus = ["ratio", "mag_1", "mag_2", "mag_3", "energy", "entropy", "skewness", "kurtosis"]
to_remove = set()

indices = set()
for x in sus:
  q25, q75 = np.percentile(df.loc[:,x], [25, 75])
  intra = q75 - q25

  max = q75 + intra * 1.5
  min = q25 - intra * 1.5

  idxs_1 = df.loc[df[x] < min, x].index
  idxs_2 = df.loc[df[x] > max, x].index
  to_remove = to_remove.union(idxs_1).union(idxs_2)

df.drop(to_remove, inplace=True)
df_target.drop(to_remove, inplace=True)

 

Now we can finally plot our data and see if there are any striking relations we should be aware of:

# correlation plots
plt.rcParams['figure.dpi'] = 100
plt.rcParams["figure.figsize"] = (20,15)
scatter_var = ["cycle_len", "t_start_sys", "t_sys_end", "t_sys_dicr", "t_dicr_end", "ratio", 
               "freq_1", "mag_1", "freq_2", "mag_2", "freq_3", "mag_3", "energy", "entropy", "skewness", "kurtosis"]
correlation_matrix = df[scatter_var].corr()
sns.heatmap(correlation_matrix, annot=True)

 

It can be seen that time variables are quite correlated, which is expected. Also the energy is somewhat correlated with the time variables. There are strong anticorrelations in frequency domain features - frequency/magnitude associations.

 

Apart from confirmation, this plot did not give as much insight as it can in other cases, so let's move on!

 

Machine Learning and Results

Having the data prepared, we can now proceed with the training of various models to assess which one is the best. The most promising models in the literature are a Random Forest and Linear Regression so we are using them here as well. Let's stop here to discuss them in more detail.

 

Linear Regression

Linear Regression is one of basic ML models. It is a method which finds a relationship between a dependent variable and a set of independent variables. It can be illustrated as fitting a scatter plot to a line:

Credit: Davi Frossard

 

This page explains it in more detail.

 

Random Forest

Random Forest consists of decision trees - structures consisting of nodes and branches. Trees are created by splitting the data in each node on some branching conditions. A single Decision Tree is generated similarly to asking questions about the current data input to the tree, like in the picture below:

Credit: Daksh Trehan

 

In a random forest the data is fed through all trees and the predictions are averaged - this is the prediction of the random forest.

 

Credit: https://www.oreilly.com/library/view/tensorflow-machine-learning/9781789132212/d3d388ea-3e0b-4095-b01e-a0fe8cb3e575.xhtm…

 

This page explains it in more detail.

 

The data is further split using K-Fold cross-validation, in order to train the model for a given number of epochs. Then it is finally used to predict on the intact test set.

folds = KFold(n_splits=10, shuffle=False)
# resplit the data after processing
x_train, x_test, y_train, y_test = train_test_split(df, df_target, test_size=0.3)

 

Finally, the model training can commence:

linear = LinearRegression()
errors_sbp = []
errors_dbp = []

for i, (train_idx, val_idx) in enumerate(folds.split(x_train, y_train)):
    train_data, train_target = x_train.iloc[train_idx], y_train.iloc[train_idx]
    val_data, val_target = x_train.iloc[val_idx], y_train.iloc[val_idx]
    
    linear.fit(train_data, train_target) 
    predictions = linear.predict(val_data)
    error_sbp = mean_absolute_error(predictions[:,0], val_target["SBP"].values)
    error_dbp = mean_absolute_error(predictions[:,1], val_target["DBP"].values)
    
    print(f"Train fold {i} MAE SBP: {error_sbp} MAE DBP: {error_dbp}")
    errors_sbp.append(error_sbp)
    errors_dbp.append(error_dbp)
    
print(f"Average MAE SBP: {np.mean(errors_sbp)} MAE DBP: {np.mean(errors_dbp)}")

 

The RF model is trained in a similar way.

 

The results are visible in the picture below:

It is visible that the rightmost model is the most accurate - Random Forest with 10s segment lengths and 100 estimators. The results are far from ideal but they are not much worse than recent State of The Art (we have still much to improve in this area).

 

The most important problem we are now facing is making the data captured from the PPG with the Arduino fit our model (be normalized and rescaled). Also the PPG reading often wanders and has varying amplitude between readings, so we will have to rescale the data dynamically (averaging a period of 10 seconds or more).

 

References

In case you are eager to learn more about the sources used as a basis for this ML project, here are the links:

  • Slapničar G, Mlakar N, Luštrek M. _Blood Pressure Estimation from Photoplethysmogram Using a Spectro-Temporal Deep Neural Network._ Sensors (Basel). 2019;19(15):3420. Published 2019 Aug 4. doi:10.3390/s19153420
  • Serj Haddad, Assim Boukhayma, and Antonino Caizzone -  _Continuous PPG-Based Blood Pressure MonitoringUsing Multi-Linear Regression_
  • M. Kachuee, M. M. Kiani, H. Mohammadzade, M. Shabany, _Cuff-Less High-Accuracy Calibration-Free Blood Pressure Estimation Using Pulse Transit Time_, IEEE International Symposium on Circuits and Systems (ISCAS'15), 2015
  • Khalid SG, Zhang J, Chen F, Zheng D. _Blood Pressure Estimation Using Photoplethysmography Only: Comparison between Different Machine Learning Approaches._ J Healthc Eng. 2018;2018:1548647. Published 2018 Oct 23. doi:10.1155/2018/1548647

 

Summary

We went through the extensive ML data preparation and model training process and we hope you enjoyed it and learned something valuable. Getting this right was cumbersome as the data is never clean and ready to use (welcome to biosignals ) but the results paid off! We have reusable model which we can put to use in various BP estimation projects. Be sure to check out the links referenced throughout the post to deepen your knowledge and give us feedback!

 

Next time: Exporting our model and deploying it in AWS Lambda + sending requests from Arduino (at the moment AWS is very moody and does not want our HTTP POST requests...). Cheers!

 

Jakub & Szymon & Michał

 

Next post
BIBoP 4 - AWS Lambda deployment and MQTT communication