Saturday, January 19, 2013

Optimize memcpy(), Faster memcpy()

A primitive implementation of memcpy function would look like this:

 void basic_memcp(char *des, char *src, uint32 len)   
 {  
   uint32 i;  
   for(i=0;i<len;i++)  
   {  
    *des++ = *src++;  
   }  
 }  



Question is to improve on this.
Improvement can be  made in terms that, the above function is not doing overlap check of source and destination buffers.

But, we are looking on performance improvement in term of execution time.
Normally, CPU runs at much faster clock than the memory. Which means in system memory access is slowest operation. So, if we can reduce number of memory access during any operation, performance is bound to improve. 

How can we reduce number of memory accesses, its by transferring data in term of word size. Many system has word size as 4. So in above function, if instead of accessing data byte by byte, we access data in term of 4 bytes, execution time should reduce.

But, solution is not straight forward, both src and/or destination pointers can be unaligned. Many system do not support unaligned memory access. Even on the platform it is supported, its more costly than aligned memory access. So we have to make sure that all memory accesses are word aligned.

I have tried to solve one of such case, where destination address is assumed as word aligned and source address is  starting at 1 byte after the word boundry.
 /*Assumptions  
 a) destination is word aligned  
 b) source is at 3 bytes before next word aligned boundry  
 c) focusing more on main concept, avoiding case where len is less than 4  
 d) machine is little endian  
 */  
 void memcpy_1(char *des, char *src, uint32 len)  
 {  
   uint32 srcData;  
   uint32 desData;  
   uint32 tempLen=len/4;  
   uint8 i;  
   uint32 *desPtr32 = (uint32*)des;  
   uint32 *srcPtr32 = (uint32*)(src-1);  
   uint8 *desPtr8 ;  
   uint8 *srcPtr8 ;  
   srcData = *srcPtr32;  
   while(tempLen)  
   {  
    /* chuck out msb 8 bits, DO remember its Little Endian*/  
    desData = srcData >> 8;   
    srcData = *(++srcPtr32);  
    /*Fill the last byte, Again Little Endian*/  
    desData |= (srcData <<24);  
    *desPtr32++ = desData;  
    tempLen--;  
   }  
   desPtr8 = (uint8*)desPtr32;  
   srcPtr8 = (uint8*)srcPtr32;  
   srcPtr8++;  
   /*Cover the case for leftover from above loop*/  
   for(i=0; i<(len%4); i++)  
   {  
    *desPtr8++ = *srcPtr8++;  
   }  
 }  


It should be noted that there are other cases, where we have to align that particular case in particular way(e.g. source address is at 2 bytes from next aligned address and destination address is 1 byte from next aligned address etc,). But, the concept remains the same .

I have also written a test program(tested on visual studio) which will show the performance gain of this approach.  

File can be downloaded from here.


No comments: