11 Replies Latest reply on Jul 2, 2020 1:08 PM by hemilshah0211

    Invalidating the data cache

    josevi

      I have a problem that I am trying to fix in a program I have written.

      I receive internet packets, the Rx interrupt affects a bool so that in a loop I can to to then request the RxBD from the hardware. I then copy out the packets into some linked list containers that I have before returning the BDs to the hardware.


      My problem is that when I go to read the data that the BD points to, the cached data is wrong, so I invalidate it. Except that using

      void Xil_DCacheInvalidateRange(unsigned int adr, unsigned len)

      invalidates a cache line but not necessarily where the packet information starts. It looks at the address I pass to it, then it moves from there to the start of the first cache line and only then begins invalidating the cache.


      This leads to varying parts of my packet information being "chopped off". Anywhere from 0 to 28 bytes worth of data.


      A solution to this, as stated in the Xilinxs documentation seems to be to cache align the... sections of memory that the BD point to. The problem is, I have no idea how to do this.


      I have tried simply disabling the entire cache at the start of the program, however this doesn't seem to work. I don't know if it is my attempt that doesn't work, or if it simply doesn't fix the problem.

      (Code used to try to disable the entire cache)
      Xil_L2CacheDisable();
      Xil_L1DCacheDisable();
      Xil_L1ICacheDisable();
      dsb();
      isb();

        • You either can disable the
          yaro123

          You either can disable the caches or (what is preferable) align your buffers. In C you can do:
          typedef u8 ethernetFrameData[XEMACPS_MAX_VLAN_FRAME_SIZE] __attribute__ ((aligned(32)));

          You can disable the cache by Xil_DCacheDisable();

          Have a look at the provided ethernet examples.

          • Any time I try to disable the
            josevi

            Any time I try to disable the cache (in any manner) using any of: Xil_SetTlbAttributes(RX_BD_START_ADDRESS, 0xc02); or Xil_DCacheDisable(); or even simply aligning them using XEmacPs_BdRingCreate(RxRingPtr,(u32)&RxRingPntrBase,(u32)&RxRingPntrBase,32/*alignment*/,32/*number of BDs*/);
            I encounter an error: Direction is 1, ErrorWord is 11.
            What does this error mean? How do I look it up in the technical reference manual?

            • Where do you get this error?
              yaro123

              Where do you get this error? In which function? Try to singlestep with the debugger.

              You cannot simply use Xil_SetTlbAttributes(RX_BD_START_ADDRESS, 0xc02); because here RX_BD_START_ADDRESS must be a multiple of 1MB.

              If you do Xil_DCacheDisable() in the verry beginning of your program, you will not have cacheing problems.

              • In the IEEE 1588 PTP example,
                josevi

                In the IEEE 1588 PTP example, which is what I am building my project on op of, when I go to xemacps_ieee1588.h and change XEMACPS_IEEE1588_BD_ALIGNMENT from 4 to 32, I end up with an error, the output of it reading "Direction 2, ErrorWord 0". Where do I go to find out what this error maens?
                I I try to disable the entire cache (using Xil_DCacheDisable()), my linked lists stop working.

                What does Dicrection, ErrorWord 0 2 mean? How can I look it up to find out more about it?

                • Buffer alignment
                  yaro123

                  I don't know the  PTP example, so I cannot help you with the errors. But you stil did not align your buffers (not the buffer descriptors!) in theright manner.

                  Do it as follows:
                  typedef u8 ethernetFrameData[XEMACPS_MAX_VLAN_FRAME_SIZE] __attribute__ ((aligned(32)));

                  and then define your frame buffers like ethernetFrameData myBuffer;

                  You can leave your BD alignment to 4, it's totaly right.

                  If your linked list does not work with disabled caches, then there must be something wrong with your linked list (or something that corrupts its memory).

                  • Sorry, you are right, and to
                    josevi

                    Sorry, you are right, and to top it off I wasn't clear.
                    I have an array of ethernetframedata arrays, and I can try aligning it:
                    u8 RxBuf[32][1540]__attribute__((aligned(32)));
                    And that causes Direction 0, ErrorWord 2, It turns out you can parse by looking at the GEM Register Details in the Technical Reference Manual (TRM). However I only know that Direction 1 is a Tx error, and Direction 2 is an Rx error.

                    The RxBD ring works, and the linked list storage containers work (in so much as the RxBD always accurately points to the RxBuffer, and the linked list correctly store bits copied to it).

                    If I try an alternate approach, where instead of aligning, I instead edit the linker script to create a new section of memory (Which I call "MEMORY_NON_CACHEABLE"), I disable the cache on that section at start up (Xil_SetTlbAttributes(MEMORY_NON_CACHEABLE, 0xc02)), and try to order the compiler to put the RxBuffer into that section of memory (__attribute__ ((section("MEMORY_NON_CACHEABLE")));), it compiles and runs, but when I watch it with print statements, I can see that the buffers are at a completely different address.
                    Core 0 and core 1 address the shared memory differently; does this have something to do with the fact that I am running my program on core 1?

                    Also, thank you for your help so far Yaro123. I appreciate it.

                    • Hum...
                      josevi

                      Hum...
                      using:
                      u8 RxBuf[32][1540]__attribute__((aligned(32)));
                      with
                      Xil_DCacheInvalidateRange(((u32)BufAddr),((unsigned)BufLen));
                      And leaving the RxBD alignment alone (Why was I even messing with that? Nothing was wrong with it...) does not cause any fatal errors (Direction # ErrorWord #), but it still doesn't fix the problem where the packet in the RxBuffer appears to be "hidden" by the cache. And that the cache invalidating command doesn't always reveal the "front" of the packet. Yet still, as the program runs, the number of packets that are unidentifiable based on ethertype (meaning too much of the packet has been hidden with zeros for it to be identifiable) slowly decreases.

                      Removing the cache invalidation command, and at the beginning of the program (First thing the ZedBoard ever does) disabling the entire cache still doesn't fix the problem. If I debug and break on the first unknown packet's arrival, I can see that the RxBD points to an empty place in memory, where the next packet should have been located.
                      It is here that attempting to do
                      Xil_DCacheInvalidateRange(((u32)BufAddr), ((unsigned)BufLen));
                      causes Direction 0 ErrorWord 2. Which is understandable. How could it invalidate the cache, if it is disabled?

                      While running with the entire cache disabled, the program more rapidly moves towards being able to identify all packets. Where it would take 40 minutes before it reached zero unknown packets per 1 million loops through the program, with cache disabled it gets there in about 4 minutes.

                      I have absolutely no idea why, as it runs, it would be able to identify more and more packets. It always stays in the same 32 sections of memory, where RxBuffer is allocated.

                      I'm so confused by this.

                      • I'm not sure about the
                        yaro123

                        I'm not sure about the alignment of your 2-dimensional array
                        u8 RxBuf[32][1540]__attribute__((aligned(32)));
                        Maybe the whole array will be 32-aligned, not every element.
                        Better try it this way:
                        typedef u8 ethernetFrameData[XEMACPS_MAX_VLAN_FRAME_SIZE] __attribute__ ((aligned(32)));
                        ethernetFrameData RxBuf[32];

                        I now understand what you mean by direction and error word. You can use this error handler, to check for your errors:
                        void ethernetError_handler(void *Callback, u8 Direction, u32 ErrorWord) {
                        t//XEmacPs *EmacPsInstancePtr = (XEmacPs *) Callback;

                        tu32* availableBuffer_l = availableBuffer;
                        tu32* receiveBdCount_l = receiveBdCount;

                        tXEmacPs *emacPsInstancePtr = (XEmacPs *) Callback;

                        tswitch (Direction) {
                        tcase XEMACPS_RECV:
                        ttif (ErrorWord & XEMACPS_RXSR_HRESPNOK_MASK) {
                        tttdebug_ReportError("Receive DMA error");
                        tt}
                        ttif (ErrorWord & XEMACPS_RXSR_RXOVR_MASK) {
                        tttdebug_ReportError("Receive over run");
                        tt}
                        ttif (ErrorWord & XEMACPS_RXSR_BUFFNA_MASK) {
                        tttdebug_ReportError("Receive buffer not available");
                        tt}
                        ttbreak;
                        tcase XEMACPS_SEND:
                        ttif (ErrorWord & XEMACPS_TXSR_HRESPNOK_MASK) {
                        tttdebug_ReportError("Transmit DMA error");
                        tt}
                        ttif (ErrorWord & XEMACPS_TXSR_URUN_MASK) {
                        tttdebug_ReportError("Transmit under run");
                        tt}
                        ttif (ErrorWord & XEMACPS_TXSR_BUFEXH_MASK) {
                        tttdebug_ReportError("Transmit buffer exhausted");
                        tt}
                        ttif (ErrorWord & XEMACPS_TXSR_RXOVR_MASK) {
                        tttdebug_ReportError("Transmit retry exceeded limits");
                        ttt//todo: maybe restart transmission
                        tt}
                        ttif (ErrorWord & XEMACPS_TXSR_FRAMERX_MASK) {
                        tttdebug_ReportError("Transmit collision");
                        tt}
                        ttif (ErrorWord & XEMACPS_TXSR_USEDREAD_MASK) {
                        tttdebug_ReportError("Transmit buffer not available");
                        tt}
                        ttbreak;
                        t}
                        }

                        register the handler by
                        XEmacPs_SetHandler(EmacPsInstancePtr, XEMACPS_HANDLER_ERROR, (void *) ethernetError_handler, EmacPsInstancePtr);

                        the defines are in xemacps_hw.h


                        It is very important, that you place your BD-list in non-cacheable memory.
                        Beware: Using Xil_SetTlbAttributes you must assure, that the first argument is a multiple of 0x100000, because otherwise this function maps your virtual memory to another physical address (and thus you get into trouble).
                        To get into cacheing (especially important with arm1) I recommend reading http://www.silica.com/fileadmin/02_Products/Productdetails/Xilinx/Zynq_MMU_caches_control_ver1.0.pdf

                        Instead of using the linker I recommend using a custom memory layout. Search yourself for an appropriate part of your memory and place there your BD-lists and frame buffers (and make them all uncacheable). Then you will not suffer from all that problems.
                        e.g. my layout looks as follows:
                        I use the Linker skript just to manage my personal layout. Thus I define the symbols:
                        _EthernetFramesStorage_startAddress = 0x0F018000;
                        _EthernetFramesStorage_maxLength = 0x9E8000;

                        _txBDList_startAddress = 0x0F0C000;
                        _txBDList_maxLength = 0xC000;

                        _rxBDList_startAddress = 0x0F000000;
                        _rxBDList_maxLength = 0xC000;

                        Then in a .h I define
                        /*
                        * Buffer descriptors and frames are allocated in uncached memory. The memory is made
                        * uncached by setting the attributes appropriately in the MMU table.
                        */
                        extern u32 _rxBDList_startAddress;
                        extern u32 _txBDList_startAddress;
                        extern u32 _EthernetFramesStorage_startAddress;
                        #define RX_BD_LIST_START_ADDRESSt((size_t)&_rxBDList_startAddress)
                        #define TX_BD_LIST_START_ADDRESSt((size_t)&_txBDList_startAddress)
                        #define ETHERNET_FRAMES_STORAGE_START_ADDRESSt((size_t)&_EthernetFramesStorage_startAddress)

                        extern u32 _rxBDList_maxLength;
                        extern u32 _txBDList_maxLength;
                        extern u32 _EthernetFramesStorage_maxLength;
                        #define RX_BD_LIST_MAX_LENGTHt((u32)&_rxBDList_maxLength)
                        #define TX_BD_LIST_MAX_LENGTHt((u32)&_txBDList_maxLength)
                        #define ETHERNET_FRAMES_STORAGE_MAX_LENGTH ((u32)&_EthernetFramesStorage_maxLength)

                        typedef u8 ethernetFrameData[XEMACPS_MAX_VLAN_FRAME_SIZE] __attribute__ ((aligned(32)));
                        typedef struct {
                        tethernetFrameData data;
                        tu32 length;
                        } ethernetFrame;


                        in a .c I define
                        static ethernetFrame* ethernetFrames = ETHERNET_FRAMES_STORAGE_START_ADDRESS;

                        static int initBDRings(void) {
                        tint status;

                        tXEmacPs_Bd BdTemplate;

                        t/*
                        t * Setup RxBD space.
                        t *
                        t * Setup a BD template for the Rx channel. This template will be
                        t * copied to every RxBD. We will not have to explicitly set these
                        t * again.
                        t */
                        tXEmacPs_BdClear(&BdTemplate);

                        t/*
                        t * Create the RxBD ring with 2*RECEIVE_FIFO_ATTENDENCE BDs, so that there are always enough BDs to be allocated, even if not all have been returned yet
                        t */
                        tstatus = XEmacPs_BdRingCreate(&(XEmacPs_GetRxRing(usedEmacPsInstancePtr)), RX_BD_LIST_START_ADDRESS, RX_BD_LIST_START_ADDRESS, XEMACPS_BD_ALIGNMENT,
                        ttt2 * RECEIVE_FIFO_ATTENDENCE);
                        tif (status != XST_SUCCESS) {
                        ttdebug_ReportError("Error setting up RxBD space, BdRingCreate");
                        ttreturn XST_FAILURE;
                        t}

                        tstatus = XEmacPs_BdRingClone(&(XEmacPs_GetRxRing(usedEmacPsInstancePtr)), &BdTemplate, XEMACPS_RECV);
                        tif (status != XST_SUCCESS) {
                        ttdebug_ReportError("Error setting up RxBD space, BdRingClone");
                        ttreturn XST_FAILURE;
                        t}



                        t/*
                        t * Setup TxBD space.
                        t *
                        t * Like RxBD space, we have already defined a properly aligned area
                        t * of memory to use.
                        t *
                        t * Also like the RxBD space, we create a template.
                        t * The "last" attribute is set, s.t. every BD contains a full frame.
                        t */
                        tXEmacPs_BdClear(&BdTemplate);
                        tXEmacPs_BdSetLast(&BdTemplate);
                        tXEmacPs_BdSetStatus(&BdTemplate, XEMACPS_TXBUF_USED_MASK);

                        t/*
                        t * Create the TxBD ring
                        t */
                        tstatus = XEmacPs_BdRingCreate(&(XEmacPs_GetTxRing(usedEmacPsInstancePtr)), TX_BD_LIST_START_ADDRESS, TX_BD_LIST_START_ADDRESS, XEMACPS_BD_ALIGNMENT,
                        tttETHERNET_FRAME_RESERVE);
                        tif (status != XST_SUCCESS) {
                        ttdebug_ReportError("Error setting up TxBD space, BdRingCreate");
                        ttreturn XST_FAILURE;
                        t}
                        tstatus = XEmacPs_BdRingClone(&(XEmacPs_GetTxRing(usedEmacPsInstancePtr)), &BdTemplate, XEMACPS_SEND);
                        tif (status != XST_SUCCESS) {
                        ttdebug_ReportError("Error setting up TxBD space, BdRingClone");
                        ttreturn XST_FAILURE;
                        t}

                        treturn XST_SUCCESS;
                        }

                        For cacheing I use a wrapper, to make changing the page table save (with defines from the Link I posted):
                        #define L1_NON_CACHEABLE 0x00  //  C = b0, B = b0
                        #define L1_WRITEBACK_WRITEALLOCATE  0x0004  //  C = b0, B = b1
                        #define L1_WRITETHROUGH_NO_WRITEALLOCATE 0x0008  // C = b1, B = b0
                        #define L1_WRITEBACK_NO_WRITEALLOCATE 0x000C  //  C = b1, B = b1


                        #define L2_NON_CACHEABLE 0x00  // TEX(1:0) = b00, C = b0, B = b0
                        #define L2_WRITEBACK_WRITEALLOCATE  0x1000  // TEX(1:0) = b01, C = b0, B = b1
                        #define L2_WRITETHROUGH_NO_WRITEALLOCATE 0x2000  // TEX(1:0) = b10, C = b1, B = b0
                        #define L2_WRITEBACK_NO_WRITEALLOCATE 0x3000  // TEX(1:0) = b11, C = b1, B = b1


                        //L1 and L2
                        #define NON_CACHEABLE  (L1_NON_CACHEABLE | L2_NON_CACHEABLE)
                        #define WRITEBACK_WRITEALLOCATE   (L1_WRITEBACK_WRITEALLOCATE | L2_WRITEBACK_WRITEALLOCATE)
                        #define WRITETHROUGH_NO_WRITEALLOCATE  (L1_WRITETHROUGH_NO_WRITEALLOCATE | L2_WRITETHROUGH_NO_WRITEALLOCATE)
                        #define WRITEBACK_NO_WRITEALLOCATE  (L1_WRITEBACK_NO_WRITEALLOCATE | L2_WRITEBACK_NO_WRITEALLOCATE)


                        #define NON_GLOBAL 0x20000  // nG = b1

                        #define EXECUTE_NEVER 0x10  // XN = b1

                        #define SHAREABLE 0x10000  // S = b1

                        #define AP_PERMISSIONFAULT 0x00  // AP(2) = b0, AP(1:0) = b00
                        #define AP_PRIVIEGED_ACCESS_ONLY 0x400  // AP(2) = b0, AP(1:0) = b01
                        #define AP_NO_USERMODE_WRITE 0x800  // AP(2) = b0, AP(1:0) = b10
                        #define AP_FULL_ACCESS 0xC00  // AP(2) = b0, AP(1:0) = b11
                        #define AP_PRIVILEGED_READ_ONLY 0x8800  // AP(2) = b1, AP(1:0) = b10

                        void adjustMmuMode_1MBGranularity(u32 address, u32 length, u32 features) {
                        tunsigned int mmu_attributes = 0;

                        t/* Declare the part of the page table value that gets written to the */
                        t/* MMU Table, which is always fixed. */
                        t/* NS = b0, Bit 18 = b0, TEX(2) = b1, Bit 9 = b0, Domain = b1111, */
                        t/* Bits(1:0) = b10 ... Equivalent hex value = 0x41e2 */
                        tconst u32 fixed_values = 0x41e2;

                        t// Calculate the value that will be written to the MMU Page Table
                        tmmu_attributes = fixed_values | features;

                        t// Write the value to the TLB
                        tu32 onePastLastAddress = address + length;
                        tu32 roundedAddress = address & 0xFFF00000;
                        tfor (; roundedAddress < onePastLastAddress; roundedAddress += 0x100000) {
                        ttXil_SetTlbAttributes(roundedAddress, mmu_attributes);
                        t}
                        }


                        When running the program, first assure changing the page table settings:
                        adjustMmuMode_1MBGranularity(RX_BD_LIST_START_ADDRESS, RX_BD_LIST_MAX_LENGTH, NON_CACHEABLE | AP_FULL_ACCESS | SHAREABLE);
                        tadjustMmuMode_1MBGranularity(TX_BD_LIST_START_ADDRESS, TX_BD_LIST_MAX_LENGTH, NON_CACHEABLE | AP_FULL_ACCESS | SHAREABLE);
                        tadjustMmuMode_1MBGranularity(ETHERNET_FRAMES_STORAGE_START_ADDRESS, ETHERNET_FRAMES_STORAGE_MAX_LENGTH, NON_CACHEABLE | AP_FULL_ACCESS | SHAREABLE);



                        I do not know how you configure your PHY, but try
                        #define PHY_REG0_RESET    0x8000
                        #define PHY_REG0_10       0x0100
                        #define PHY_REG0_100      0x2100
                        #define PHY_REG0_1000     0x0140
                        #define PHY_REG21_10      0x0030
                        #define PHY_REG21_100     0x2030
                        #define PHY_REG21_1000    0x0070
                        static int initPhy(XEmacPs * EmacPsInstancePtr) {
                        tint Status;
                        tu32 PhyAddr = 0;
                        tu16 PhyReg0 = PHY_REG0_1000; //gigabit (no meore)
                        tu16 PhyReg21  = PHY_REG21_1000;
                        tu16 PhyReg22  = 0;

                        tStatus = XEmacPs_PhyWrite(EmacPsInstancePtr, PhyAddr, 0, PhyReg0);
                        t/*
                        t * Make sure new configuration is in effect
                        t */
                        tStatus = XEmacPs_PhyRead(EmacPsInstancePtr, PhyAddr, 0, &PhyReg0);
                        tif (Status != XST_SUCCESS) {
                        ttdebug_ReportError("Error setup phy speed");
                        ttreturn XST_FAILURE;
                        t}



                        t/*
                        t * Switching to PAGE2
                        t */
                        tPhyReg22 = 0x2;
                        tStatus = XEmacPs_PhyWrite(EmacPsInstancePtr, PhyAddr, 22, PhyReg22);

                        t/*
                        t * Adding Tx and Rx delay. Configuring loopback speed.
                        t */
                        tStatus = XEmacPs_PhyWrite(EmacPsInstancePtr, PhyAddr, 21, PhyReg21);
                        t/*
                        t * Make sure new configuration is in effect
                        t */
                        tStatus = XEmacPs_PhyRead(EmacPsInstancePtr, PhyAddr, 21, &PhyReg21);
                        tif (Status != XST_SUCCESS) {
                        ttdebug_ReportError("Error setting Reg 21 in Page 2");
                        ttreturn XST_FAILURE;
                        t}
                        t/*
                        t * Switching to PAGE0
                        t */
                        tPhyReg22 = 0x0;
                        tStatus = XEmacPs_PhyWrite(EmacPsInstancePtr, PhyAddr, 22, PhyReg22);





                        t/*
                        t * Issue a reset to phy
                        t */
                        tStatus = XEmacPs_PhyRead(EmacPsInstancePtr, PhyAddr, 0, &PhyReg0);
                        tPhyReg0 |= PHY_REG0_RESET;
                        tStatus = XEmacPs_PhyWrite(EmacPsInstancePtr, PhyAddr, 0, PhyReg0);

                        tStatus = XEmacPs_PhyRead(EmacPsInstancePtr, PhyAddr, 0, &PhyReg0);
                        tif (Status != XST_SUCCESS) {
                        ttdebug_ReportError("Error reset phy");
                        ttreturn XST_FAILURE;
                        t}

                        t/*
                        t * Delay loop
                        t */
                        tsleep(EMACPS_PHY_DELAY_SEC);

                        treturn XST_SUCCESS;

                        }


                        I think that it schould not be a problem to run the MAC from arm1, but there could be problems, if your arm0 writes to memory, where he shouldn't write to. And much more important, there will be huge problems, if you use cacheing with both arms and the one invalidates or flushes cachelines of L2 with not aligned memory. Normally only arm0 should have control over the L2 cache and arm1 should ask arm0 to savely flush or invalidate the cache.
                        There really are a lot of pitfalls. If you do not exactly know what you are doing, turn your caches of. Otherwise you will loose random bytes (depending on race conditions)
                        A wise guy said, there are only two hard things in Computer Science: cache invalidation and naming things.



                        Now to your second post:
                        I think you messed it up with SetTlbAttributes.
                        And then disabling the cache from arm1 can cause problems. On arm1 you can not use the Xil_DCacheInvalidateRange or the Xil_DCacheDisable function, since arm0 controls L2 cache, too. You better should just switch cacheing off in the page table (mmu table) by Xil_SetTlbAttributes.
                        You should try to run your program from arm0 alone. When it works, you can think about porting it to arm1 and clearify all that cache and mmu stuff.



                        Man... that's a lot... I admit that I did not read the text twice, because I have written it for an hour and have to go to sleep now. So if there are some mistakes, be merciful

                        • Wow. You're incredible Yaro.
                          josevi

                          Wow. You're incredible Yaro. Thanks to your help, and the help of a co-worker, I have been able to fix my program. I stopped trying to turn off the cache (bad idea), I stopped trying to invalidate cache lines (not a bad idea, but not the best idea), and instead used a double pointer and a struct instead of my array or arrays, to force the RxBuffer into a specific section of memory. Than I was able to use xil_settAttributes to disable the cache over that area of memory. And it works! I wanted to run around the office yelling when I first saw it working. I've been working on this for a while (I'm a part time intern) and to see it finally working is incredible.

                          Relevant code, in case anyone else reads this looking for solutions:
                          #define RX_BUF_ADDRESS 0x1FF00000
                          #define RxBufMem (unsigned char **) RX_BUF_ADDRESS
                          typedef struct packet {
                          tu8 data[XEMACPS_RX_BUF_SIZE];
                          } Packet;
                          Packet *packets;

                          int main(void)
                          {
                          tXil_SetTlbAttributes(RX_BUF_ADDRESS, 0xc02);
                          tpackets = (Packet*)RX_BUF_ADDRESS;
                          ...

                          Only minor changes had to be made to the rest of the xilinx SDK XEMACPS IEEE1588 PTP example program in order to get everything to work.

                          • great!
                            yaro123

                            great!