|
It appears the best pair of copy instructions are movq to load and movntq to store. The movntq is an unordered store so at the end of the copy loop a sfence is necessary. Similarly (p)xor-ing a register and then using movntq is the best way to zero memory. We can assume SSE. It would be nice to know whether there's an advantage in interleaving SSE XMM register movq/ntq(s) with MMX MM register ones.
There's specific coverage of using non-temporal stores and prefetching in section 9.7 of the Intel optimization manual:
http://www.intel.com/design/processor/manuals/248966.pdf |
||||||||||||||||||||||||||||||||||||||||||
http://cdrom.amd.com/devconn/events/AMD_block_prefetch_paper.pdf
we are using a 32bit copy loop with a performance of around 640MB/s (at 2001 bus speeds - DDR2100) whereas the best copy loop achieves 1976MB/s, and this is without using 128bit XMM registers.