This post is part of my Roadtest Review of Cypress PSoC 62S2 Wi-Fi & BT5.0 Pioneer Dev Kit - Review. My review is splitted into multiple reviews and tutorials. Main page of review contains brief description of every chapter. Some projects are implemented in multiple variants. Main page of review contains brief description about every chapter, project and variants and contains some reccomendations for reading.

 

Table of Contents

 

Project 6 – Profiler

This is last project of my roadtest review. I will show one unique peripheral of PSoC 62. It is profiler. At first, I will create dual core application. CM0 core will start CM4 and then starts sorting array of random numbers using very unoptimized bubble sort algorithm. Application will measure how long it takes to sort that array and print that over UART. After sorting it revert array to the state before sorting and start the same again. Because every iteration sorts the same data, it should run exactly the same time every iteration. CM4 core will run USB CDC Echo demo. It will act as USB device with virtual serial console. Logic in that demo just copy every char received from user back to him. While project is made in a way that CM0 and CM4 runs fully independent, you will see that CM4 operation affects performance of CM0 in some way. I will use profiler peripheral to try diagnosing cause of that performance drop and provide fix for that. In this project I will use multiple peripherals described in some previous projects, so I will not describe that peripherals and code related to them much. I will use 2 SCBs (both for independent UARTs), 2 TCPWMs (one for measuring elapsed time, second as timer with periodic interrupts), CRYPTO block for generating random numbers to array for sorting, IPC for notifying completion of initialization phase, USB for running some code on CM4 and finally PROFILER for troubleshooting.

 

Create new project based on Dual-CPU Empty PSoC6 App template. Because we are creating dual core app and we need huge array buffers in Cortex-M0 code we must repartition memory layout for both flash and ram. Change partitioning of flash for CM0 from 0x2000 (8 KiB) to 0x8000 (32 KiB). How to repartition flash was described in previous project 5 – dual core application. In this project we also need repartition RAM layout which was not done in project 5.

 

Changing RAM memory layout

Open linker script for CM0 image:

 

mtb_shared\TARGET_CY8CKIT-062S2-43012\latest-v2.X\COMPONENT_CM0P\TOOLCHAIN_GCC_ARM\cy8c6xxa_cm0plus.ld

 

Change stack size defined by STACK_SIZE variable from 0x1000 (4 KiB) to 0x6000 (24 KiB). Then go to MEMORY section and change length of ram block from 0x2000 (8 KiB) to 0x10000 (64 KiB).

Similarly, we must modify linker script for CM4. Open linker script for CM4:

 

mtb_shared\TARGET_CY8CKIT-062S2-43012\latest-v2.X\COMPONENT_CM4\TOOLCHAIN_GCC_ARM\cy8c6xxa_cm4_dual.ld

 

Now we must change start address of ram region in memory section and his length. RAM starts at 0x08000000. Because CM0 uses 0x10000 bytes of RAM, we dedicate to CM4 block of ram starting at 0x08000000 + 0x10000 = 0x08010000. This is new value of origin for CM4. Length is little bit complicated to calculate. Size of RAM is 1 MIB. First byte out of RAM is at 0x08100000. We must also reserve last 2048 bytes for system libraries. Finally, we can calculate size as

 

{end} – {start of range} – 2048 = 0x81000000 - 0x08010000 – 2048 = 0xEF800 (958 KiB)

 

This is value for LENGTH parameter of ram region for CM4.

 

So now we have 1 MiB ram split in a way that 64 KiB at 0x08000000 is dedicated to CM0, 958 KiB at 0x08010000 is dedicated to CM4 and 2KiB (at 0x80FF800) is dedicated for system libraries. Because 64 + 958 + 2 = 1024 we know that we properly calculated all sizes (1024 KiB = 1 MiB).

 

Cortex-M0 code

Before starting writing code, we must initialize peripherals. UART for transmitting information about elapsed time and TCPWM as Timer running at the same speed as CPU core for measuring clock cycles spent by sorting array. In Device Configuration tool check checkbox of SCB5 and TCPWM0 clock 0 at Peripherals tab. This will reserve hardware and HAL running on CM4 will not use that resources. I will use following functions and structures to initialize SCB and TCPWM. I will not describe them here because most of them are already described in project 1 – PDL variant.

 

static const cy_stc_scb_uart_config_t scb_5_config =
{
      .uartMode = CY_SCB_UART_STANDARD,
      .enableMutliProcessorMode = false,
      .smartCardRetryOnNack = false,
      .irdaInvertRx = false,
      .irdaEnableLowPowerReceiver = false,
      .oversample = 8,
      .enableMsbFirst = false,
      .dataWidth = 8UL,
      .parity = CY_SCB_UART_PARITY_NONE,
      .stopBits = CY_SCB_UART_STOP_BITS_1,
      .enableInputFilter = false,
      .breakWidth = 11UL,
      .dropOnFrameError = false,
      .dropOnParityError = false,
      .receiverAddress = 0x0UL,
      .receiverAddressMask = 0x0UL,
      .acceptAddrInFifo = false,
      .enableCts = false,
      .ctsPolarity = CY_SCB_UART_ACTIVE_LOW,
      .rtsRxFifoLevel = 0UL,
      .rtsPolarity = CY_SCB_UART_ACTIVE_LOW,
      .rxFifoTriggerLevel = 63UL,
      .rxFifoIntEnableMask = 0UL,
      .txFifoTriggerLevel = 63UL,
      .txFifoIntEnableMask = 0UL,
};

static const cy_stc_tcpwm_counter_config_t tcpwm_0_cnt_0_config =
{
      .period = 4294967295,
      .clockPrescaler = CY_TCPWM_COUNTER_PRESCALER_DIVBY_1,
      .runMode = CY_TCPWM_COUNTER_CONTINUOUS,
      .countDirection = CY_TCPWM_COUNTER_COUNT_UP,
      .compareOrCapture = CY_TCPWM_COUNTER_MODE_CAPTURE,
      .compare0 = 16384,
      .compare1 = 16384,
      .enableCompareSwap = false,
      .interruptSources = CY_TCPWM_INT_NONE,
      .captureInputMode = 0x7U & 0x3U,
      .captureInput = CY_TCPWM_INPUT_0,
      .reloadInputMode = 0x7U & 0x3U,
      .reloadInput = CY_TCPWM_INPUT_0,
      .startInputMode = 0x7U & 0x3U,
      .startInput = CY_TCPWM_INPUT_0,
      .stopInputMode = 0x7U & 0x3U,
      .stopInput = CY_TCPWM_INPUT_0,
      .countInputMode = 0x7U & 0x3U,
      .countInput = CY_TCPWM_INPUT_1,
};

void TIMER_Init() {
      cy_rslt_t status;

      status = Cy_TCPWM_Counter_Init(TCPWM0, 0, &tcpwm_0_cnt_0_config);
      assert(status == CY_RSLT_SUCCESS);

      Cy_TCPWM_Counter_Enable(TCPWM0, 0);
      Cy_TCPWM_TriggerStart(TCPWM0, 1 << 0);
}

cy_stc_scb_uart_context_t uartContext;

void UART_Init() {
      cy_rslt_t status;

      status = Cy_SCB_UART_Init(SCB5, &scb_5_config, &uartContext);
      assert(status == CY_RSLT_SUCCESS);

      Cy_SCB_UART_Enable(SCB5);
}

 

In main of CM0 do initialization of semaphore for getting notification that CM4 successfully initialized BSP. This was described in more details in project 5 – dual core application.

 

cy_rslt_t status;

// init SEMA for receiving notification that BSP is initialized
status = Cy_IPC_Sema_Set(16, false);
assert(status == CY_IPC_SEMA_SUCCESS);

// enable CM4
Cy_SysEnableCM4(CY_CORTEX_M4_APPL_ADDR);

// wait until C4 unlock semaphore
while (Cy_IPC_Sema_Status(16) != CY_IPC_SEMA_STATUS_UNLOCKED) {
}

 

Then initialize CRYPTO engine for generating true random numbers.

 

// init crypto hw for random number generation
status = Cy_Crypto_Core_Enable(CRYPTO);
assert(status == CY_RSLT_SUCCESS);

 

Next, We declare global buffers for storing generated random numbers and second array for backup of that random data to be able restore them after every iteration.

 

#define DATA_LEN 2500

static uint32_t bytesOriginal[DATA_LEN];
static uint32_t bytesWorkingCopy[DATA_LEN];

 

Now go back to main and generate some data.

 

// generate DATA_LEN random numbers
for (size_t i = 0; i < DATA_LEN; i++) {
      Cy_Crypto_Core_Trng(CRYPTO, 0x04c11db7UL, 0x04c11db7UL, 32, bytesOriginal + i);
}

 

Initialize UART and TCPWM using functions written above.

 

TIMER_Init();
UART_Init();

 

And now we can write infinite loop, saving time at start and end, calculating how long sorting array took and print it over SCB.

 

while (1) {
      uint32_t start = Cy_TCPWM_Counter_GetCounter(TCPWM0, 0);
      sortData();
      uint32_t end = Cy_TCPWM_Counter_GetCounter(TCPWM0, 0);

      char buffer[256];
      sprintf(buffer, "Elapsed time: %lu clock cycles.\r\n", end - start);

      Cy_SCB_UART_PutString(SCB5, buffer);
}

 

Last remaining function for CM0 is sortData which copy random data to working array and then it sorts it using bubble sort algorithm.

 

void sortData() {
      for (size_t i = 0; i < DATA_LEN; i++) {
            bytesWorkingCopy[i] = bytesOriginal[i];
      }

      for (size_t i = 0; i < DATA_LEN - 1; i++) {
            for (size_t j = 0; j < DATA_LEN - i - 1; j++) {
                  if (bytesWorkingCopy[j] > bytesWorkingCopy[j + 1]) {
                        uint32_t swap = bytesWorkingCopy[j + 1];
                        bytesWorkingCopy[j + 1] = bytesWorkingCopy[j];
                        bytesWorkingCopy[j] = swap;
                  }
            }
      }
}

 

Now go to CM4 project.

 

Cortex-M4 code

For CM4 part I will reuse USB CDC Echo example.

 

Reusing USB example

Create that new project and copy .cyusbdev file from that. This will generate some sources for you after build. Generated files contain USB descriptors and some other structures, constants and so on.

Open Makefile and change COMPONENTS variable to following line. This enables linking USB library and configures correct include directories.

 

COMPONENTS=usbdev

 

Now we will copy most of the code of example to our project. Add following global variables.

 

static const cy_stc_sysint_t usb_high_interrupt_cfg = {
      .intrSrc = (IRQn_Type) usb_interrupt_hi_IRQn,
      .intrPriority = 5U,
};

static const cy_stc_sysint_t usb_medium_interrupt_cfg = {
      .intrSrc = (IRQn_Type) usb_interrupt_med_IRQn,
      .intrPriority = 6U,
};

static const cy_stc_sysint_t usb_low_interrupt_cfg = {
      .intrSrc = (IRQn_Type) usb_interrupt_lo_IRQn,
      .intrPriority = 7U,
};

static cy_stc_usbfs_dev_drv_context_t  usb_drvContext;
static cy_stc_usb_dev_context_t          usb_devContext;
static cy_stc_usb_dev_cdc_context_t usb_cdcContext;

 

Add following interrupt handlers

 

static void usb_high_isr(void) {
      Cy_USBFS_Dev_Drv_Interrupt(CYBSP_USBDEV_HW, Cy_USBFS_Dev_Drv_GetInterruptCauseHi(CYBSP_USBDEV_HW), &usb_drvContext);
}

static void usb_medium_isr(void) {
      Cy_USBFS_Dev_Drv_Interrupt(CYBSP_USBDEV_HW, Cy_USBFS_Dev_Drv_GetInterruptCauseMed(CYBSP_USBDEV_HW), &usb_drvContext);
}

static void usb_low_isr(void) {
      Cy_USBFS_Dev_Drv_Interrupt(CYBSP_USBDEV_HW, Cy_USBFS_Dev_Drv_GetInterruptCauseLo(CYBSP_USBDEV_HW), &usb_drvContext);
}

 

At the beginning of main, initialize system and trigger semaphore to allow CM0 application run. Variables count and buffer came from USB example and we will use them later.

 

cy_rslt_t result;
uint32_t count;
uint8_t buffer[USBUART_BUFFER_SIZE];

// init BSP
result = cybsp_init();
CY_ASSERT(result == CY_RSLT_SUCCESS);

// clear semaphore to notify CM0 that BSP is initialiźed
Cy_IPC_Sema_Clear(16, false);

 

Enable interrupts and initialize usbdev library to act as CDC device.

 

__enable_irq();

Cy_USB_Dev_Init(CYBSP_USBDEV_HW, &CYBSP_USBDEV_config, &usb_drvContext, &usb_devices[0], &usb_devConfig, &usb_devContext);
Cy_USB_Dev_CDC_Init(&usb_cdcConfig, &usb_cdcContext, &usb_devContext);

Cy_SysInt_Init(&usb_high_interrupt_cfg,   &usb_high_isr);
Cy_SysInt_Init(&usb_medium_interrupt_cfg, &usb_medium_isr);
Cy_SysInt_Init(&usb_low_interrupt_cfg,   &usb_low_isr);
NVIC_EnableIRQ(usb_high_interrupt_cfg.intrSrc);
NVIC_EnableIRQ(usb_medium_interrupt_cfg.intrSrc);
NVIC_EnableIRQ(usb_low_interrupt_cfg.intrSrc);

Cy_USB_Dev_Connect(true, CY_USB_DEV_WAIT_FOREVER, &usb_devContext);

 

Now copy main loop of USB example. This code copies all received data from virtual comport back.

 

while (1) {
      if (Cy_USB_Dev_CDC_IsDataReady(USBUART_COM_PORT, &usb_cdcContext)) {
            count = Cy_USB_Dev_CDC_GetAll(USBUART_COM_PORT, buffer, USBUART_BUFFER_SIZE, &usb_cdcContext);

            if (0u != count) {

                  while (0u == Cy_USB_Dev_CDC_IsReady(USBUART_COM_PORT, &usb_cdcContext)) {
                  }

                  Cy_USB_Dev_CDC_PutData(USBUART_COM_PORT, buffer, count, &usb_cdcContext);

                  if (USBUART_BUFFER_SIZE == count) {
                        while (0u == Cy_USB_Dev_CDC_IsReady(USBUART_COM_PORT, &usb_cdcContext)) {
                        }

                        Cy_USB_Dev_CDC_PutData(USBUART_COM_PORT, NULL, 0u, &usb_cdcContext);
                  }
            }
      }

    /* Go to sleep */
    cyhal_syspm_sleep();
}

 

Now you can run the application. When you open serial terminal, you will see something as follows. Numbers will be every run different because application generates always different random data.

 

You can see first interesting note. First run took little bit more time (about 0.03%) to sort the (same) array. It is probably because CM0 is complex pipelined core and while instructions were executed first time, some internal pipeline or predictor was in different state then in following iterations. Now connect second USB cable between development board and your computer. On the first terminal you will see that when you opened terminal there were also performance glitch that made bubble sort running little bit slower.

 

It is very interesting because CM0 running sorting code does not depend on USB which is fully handled by CM4. CM0 is fully independent on CM4, isn’t it? Let’s do some next experiments. Type some random letters on your keyboard to USB virtual terminal.

 

 

You can see second very interesting behavior. When you have used USB, application managed by CM4 also decreased performance of fully independent CM0 core. Try copy some long text (for example text of this review) and paste it to your USB virtual terminal.

 

Now you have seen that performance of CM0 core decreased about 2 to 3% (which is much more than in previous cases). In our trivial app we can simply ignore it, but you may face real live requirements when this will cause significant issue. I mentioned that CM0 and CM4 are fully independent. But in fact, they are not. They share infrastructure of the same chip. For example, we used USB from CM4 but in the same time we should use the same peripheral registers from CM0. Internally there are buses named AHB (high performance) and APB (peripherals). APB is connected to AHB. AHB has some masters and some slaves. Masters can drive transactions on the bus. From the TRM:

 

Datawires are DMAs, DAP is Debug Access Port.

 

The problem is that when multiple masters need access AHB at the same time, there are limited opportunity to doing that. Because AHB in PSoC 62 is multilayer that is partially possible. For example, if CM0 do transaction against some address in peripheral address range (for example it is accessing SCB) and CM4 reads variable from RAM, they can do that at the same but if both cores execute read instructions with address of some peripheral (it could be completely different peripheral at each core) one core must wait until bus became free. This is most probably cause of issue. But there are multiple peripheral candidates what could cause the issue. Now we must determine what utilizes our internal AHB (or APB?) bus when performance loses occurs. Good news is that PSoC 62 has peripheral named Profiler which can partially help us. Profiler can track multiple information and one of feature is that it can count number of transactions at AHB bus done by some peripherals. List of all measurable metrics is part of source code reference in TRM. TRM contains some description but it is description of metrics available in PSoC 63 and not a metrics available in PSoC 62. Following image show list of metrics for PSoC 62.

 

Choosing sources to monitor

Now we must select up to 8 sources which we will monitor. First is virtual source and it is not interesting to us. Second metric monitor count of clock cycles that was CM0 or CM4 running. Because I do not use any interesting sleep modes, I will not use that. But you can use that for example for profiling power consumption of chip using this metrics. Next two metrics reports utilization of bus by flashes. At the first look you may think that they are completely useless to us because we do not use flash. But in fact, we use flash memory, and it can cause similar issue. Instruction which CPU executes are loaded from flash and as we know, both cores executes instructions at the same time. So flash is one of the possible causes of problem and I will monitor it. There are two kinds of flash. I did not know which one is correct at the first time, so I monitored both. Next 3 metrics are DMA related. It reports how frequently DMA blocks access AHB bus. Fourth is how often CRYPTO block access bus. Next is monitor that reports how frequently USB block utilize AHB bus. Because I use USB, this is very interesting metric to me. Next metrics reports utilization by SCB blocks. Because we use SCB5 we will monitor how frequently it utilize bus. Last monitors monitor how long SMIF slaves were active and how many reads and writes was issued to SDHC peripheral. SDHC is not just microSD card. Wi-FI module is connected using this interface so you can use this metric to profile similar performance or power consumption issues in IoT application using these SDHC metrics.

 

As result, I will monitor following metrics:

 

  • MAIN_FLASH
  • WORK_FLASH
  • USB
  • SCB5

 

We must setup profiler, start him and periodically read measured values and “visualize” them.

 

Now we must decide which core we should use for running profiler relative code. It is better to affect CM0 as less as possible because we are troubleshooting his problems now. If you will be troubleshooting performance problems of CM4 it should be better to run profiler related code on CM0. If you profile something that is not related to performance (for example energy consumption using CM0 and CM4 metrics) it does not matter which core you choose. In that case you usually select CM4 because on CM4 you can use HAL and it will be easier to initialize required stuff in HAL rather than PDL.

 

Profiler logic

So now, I initialize timer to periodically trigger interrupt every second and SCB to act as UART using HAL library. This is mostly described in project 1 – HAL variant. I do not describe details about that here.

 

cyhal_timer_t timer_obj;
cyhal_uart_t uart_obj;

void PROFILER_InitTimer() {
      cy_rslt_t status;

      const cyhal_timer_cfg_t timer_cfg =
      {
            .compare_value = 0,                        /* Timer compare value, not used */
            .period = 9999,                            /* Defines the timer period */
            .direction = CYHAL_TIMER_DIR_UP,  /* Timer counts up */
            .is_compare = false,                 /* Don't use compare mode */
            .is_continuous = true,               /* Run the timer indefinitely */
            .value = 0                                 /* Initial value of counter */
      };

      status = cyhal_timer_init(&timer_obj, NC, NULL);
      CY_ASSERT(status == CY_RSLT_SUCCESS);

      status = cyhal_timer_configure(&timer_obj, &timer_cfg);
      CY_ASSERT(status == CY_RSLT_SUCCESS);

      status = cyhal_timer_set_frequency(&timer_obj, 10000);
      CY_ASSERT(status == CY_RSLT_SUCCESS);

      cyhal_timer_register_callback(&timer_obj, PROFILER_Tick, NULL);
      CY_ASSERT(status == CY_RSLT_SUCCESS);

      cyhal_timer_enable_event(&timer_obj, CYHAL_TIMER_IRQ_TERMINAL_COUNT, 3, true);
      CY_ASSERT(status == CY_RSLT_SUCCESS);

      status = cyhal_timer_start(&timer_obj);
      CY_ASSERT(status == CY_RSLT_SUCCESS);
}

void PROFILER_InitUart() {
      cy_rslt_t status;
      uint32_t actualbaud;

      const cyhal_uart_cfg_t uart_config =
      {
            .data_bits = 8,
            .stop_bits = 1,
            .parity = CYHAL_UART_PARITY_NONE,
            .rx_buffer = NULL,
            .rx_buffer_size = 0
      };

      status = cyhal_uart_init(&uart_obj, P10_1, P10_0, NULL, &uart_config);
      CY_ASSERT(status == CY_RSLT_SUCCESS);

      status = cyhal_uart_set_baud(&uart_obj, 115200, &actualbaud);
      CY_ASSERT(status == CY_RSLT_SUCCESS);
}

void PROFILER_Init() {
      PROFILER_InitProfiler();
      PROFILER_InitTimer();
      PROFILER_InitUart();
}

 

Now initialization of profiler itself remains. It is easy. You can select up to 8 monitored variables using Cy_Profile_ConfigureCounter, then you can enable that counters using Cy_Profile_EnableCounter and finally start profiling using Cy_Profile_StartProfiling. At this moment profiler starts incrementing counters when configured event is triggered.

 

cy_stc_profile_ctr_ptr_t scbMonitor;
cy_stc_profile_ctr_ptr_t usbMonitor;
cy_stc_profile_ctr_ptr_t flashMainMonitor;
cy_stc_profile_ctr_ptr_t flashWorkMonitor;

void PROFILER_InitProfiler() {
      cy_rslt_t status;

      Cy_Profile_Init();

      scbMonitor = Cy_Profile_ConfigureCounter(SCB5_MONITOR_AHB, CY_PROFILE_EVENT, CY_PROFILE_CLK_HF, 1);
      usbMonitor = Cy_Profile_ConfigureCounter(USB_MONITOR_AHB, CY_PROFILE_EVENT, CY_PROFILE_CLK_HF, 1);
      flashMainMonitor = Cy_Profile_ConfigureCounter(CPUSS_MONITOR_MAIN_FLASH, CY_PROFILE_EVENT, CY_PROFILE_CLK_HF, 1);
      flashWorkMonitor = Cy_Profile_ConfigureCounter(CPUSS_MONITOR_WORK_FLASH, CY_PROFILE_EVENT, CY_PROFILE_CLK_HF, 1);

      status = Cy_Profile_EnableCounter(scbMonitor);
      CY_ASSERT(status == CY_RSLT_SUCCESS);

      status = Cy_Profile_EnableCounter(usbMonitor);
      CY_ASSERT(status == CY_RSLT_SUCCESS);

      status = Cy_Profile_EnableCounter(flashMainMonitor);
      CY_ASSERT(status == CY_RSLT_SUCCESS);

      status = Cy_Profile_EnableCounter(flashWorkMonitor);
      CY_ASSERT(status == CY_RSLT_SUCCESS);

      Cy_Profile_StartProfiling();
}

 

Now profiling is started, and we must write interrupt handler for timer. In this handler we will calculate how value changed since previous reading of values. I will declare global variables for storing previous counts.

 

uint64_t scbOperationsPrevious;
uint64_t usbOperationsPrevious;
uint64_t flashMainOperationsPrevious;
uint64_t flashWorkOperationsPrevious;

 

In the timer interrupt handler, I collect that values using Cy_Profile_GetRawCount for each metric, calculate difference and print it over uart.

 

void PROFILER_Tick() {
      char message[128];
      cy_rslt_t status;
      uint64_t scbOperations;
      uint64_t usbOperations;
      uint64_t flashMainOperations;
      uint64_t flashWorkOperations;
      size_t len;

      status = Cy_Profile_GetRawCount(scbMonitor, &scbOperations);
      CY_ASSERT(status == CY_RSLT_SUCCESS);

      status = Cy_Profile_GetRawCount(usbMonitor, &usbOperations);
      CY_ASSERT(status == CY_RSLT_SUCCESS);

      status = Cy_Profile_GetRawCount(flashMainMonitor, &flashMainOperations);
      CY_ASSERT(status == CY_RSLT_SUCCESS);

      status = Cy_Profile_GetRawCount(flashWorkMonitor, &flashWorkOperations);
      CY_ASSERT(status == CY_RSLT_SUCCESS);

      scbOperations -= scbOperationsPrevious;
      usbOperations -= usbOperationsPrevious;
      flashMainOperations -= flashMainOperationsPrevious;
      flashWorkOperations -= flashWorkOperationsPrevious;

      sprintf(message, "scb +%-8lu; usb +%-8lu; flashm +%-8lu; flashw +%-8lu\r\n", (uint32_t)scbOperations, (uint32_t)usbOperations, (uint32_t)flashMainOperations, (uint32_t)flashWorkOperations);
      len = strlen(message);

      scbOperationsPrevious += scbOperations;
      usbOperationsPrevious += usbOperations;
      flashMainOperationsPrevious += flashMainOperations;
      flashWorkOperationsPrevious += flashWorkOperations;

      cyhal_uart_write(&uart_obj, message, &len);
}

Analysis of problem

Now you can connect second external USB-to-UART converter to the expansion board. Connect RX of USB-to-UART convertor to pin P10_1 (TX) on expansion port of board. If you open serial viewer for both UARTs. You will see something like this.

 

Because we introduced some new logic to CM4, CM0 is affected more than before and values that was stable are no more stable. But they differ only a little. Usually value changes only at least significant 3 decimal places. On the second UART which’s output comes from profiler you can see that applications do not use SCB5 much but utilize a USB and flash a lot at the startup. Then it stabilizes in a way that it does not use USB peripheral at all, and flash is briefly used all the time. Now if you do experiment similar from previous section, you will see following:

 

As you can see, we have now limited possibility to determine performance loss from first window but in the second windows we exactly see what utilized AHB bus. We can see that keystroes did not change usage of SCB5 at all (this make sense, because SCB5 is used very rarely for transmitting short message how long sorting took). But we see that USB was used quite and usage of flash also grow a little. Let’s try experiment with copy pasting long text for trigger higher performance loss.

 

Now you can see about 2 to 3% performance loss (note that second most significant digit now changes) and you also can see that AHB utilization by USB grown significantly in that time. Note that while profiler prints line every 1 second, sorting code does it irregularly based on the performance of core. Because MCU runs at 100 MHz and calculation takes about 70 million of clock cycles it prints line about every 0.7 second. Different “print rate” is reason why lines optically do not match. In our app is easy to determine that performance loss is related to something with USB but in more complex application it probably will not be so easy to determine real cause. This (or similar) analysis can help you. Also note that amount of transferred data does not affect usage of flash. It does not matter if you transfer one letter or paste one chapter of this review to terminal. This is probably caused by cache on flash bus.

 

Now we know that our issue exactly is caused by USB related part of our application. But we must think if it is really caused by USB peripheral itself. We eliminated SCB and Flash because they do almost nothing over time. For SCB it is easy to determine that it is doing nothing and for flash we have monitoring results which says us that. It is good to sometime refer to documentation again even If you looked it multiple times.

 

Peripheral to AHB connection looks as follows (AHB is slim line in the middle of image):

 

 

We eliminated Flash, DMA, CRYPTO and peripherals.

 

But look at the center. We completely ignored SRAM. If you think a little, you probably notice that both cores extensively use SRAM. CM0 does sorting of values stored in SRAM and CM4 also copies lot of data to buffers stored in SRAM. Also note that USB peripheral itself cannot block CM0 from accessing bus because it is not used in CM0 at all and AHB is multilayer bus. This means that while CM4 do some communication with USB on AHB, CM0 still can communicate with for example SRAM at the same time. So, the USB peripheral totally is not a cause of problem in this case.

 

Now see different aspect of previous schematics. There are not only one SRAM, but there are three independent SRAM blocks, 3 independent controllers of them and 3 independent connections to AHB. In real these 3 SRAMs are mapped to the address space adjacent, so It enables you to use whole SRAM as one large block without worrying about which block is used. In datasheet there are description how large that blocks are (it is not in TRP because values differ per chip/family).

 

And now we can try do some experiment. We will try separate CM0 app to use SRAM0 and CM4 app to use SRAM1 and SRAM2. It is easy because we know how to change partitioning of memory. Now we are using following layout that we created at the beginning.

 

User of memory

Assigned size

Used SRAM blocks

CM0

0x10000 (64 KiB)

SRAM0

CM4

0xEF800 (958 KiB)

SRAM0, SRAM1 and SRAM2

System

0x800 (2 KiB)

SRAM2

 

Change the memory layout to following using steps at the beginning of article.

 

User of memory

Assigned size

Used SRAM blocks

CM0

0x80000 (512 KiB)

SRAM0

CM4

0x7F800 (510 KiB)

SRAM1 and SRAM2

System

0x800 (2 KiB)

SRAM2

 

This change resolve issue. On the following screenshot you can see the same application as previous. Single difference is changed RAM memory layout. As you can see CM0 now runs fully independently on USB application running at CM4 in all situations. Changing one number in linker script was enough to resolve performance inconsistency and made application more deterministic. Single different number now remain only at startup (first line in first terminal) but note that it is not very significant. Difference from stabilized value is only 34 clock cycles and it is probably caused by some internal state of pipeline or some predictor which may be present in quite complex Cortex-M0 core.

 

That is all from this project. You have seen interesting issue which can occur on modern ARM MCUs (not only PSoCs), one unique peripheral of PSoC 62 and how single number in some script which lot of users have no idea for what it is used for can change behavior of app. Information in this project are mostly valid on any ARM MCU even if it has only one CPU core. It is because for example DMA can access AHB bus in the same way as CPU core and it could result into similar issue.