AMD lanza guia de optimización de software para Bulldozer

AMD lanza guia de optimización de software para Bulldozer

por

20017 posts

con algunos detalles de su próxima arquitectura de CPUs

Hace 8 meses, AMD por fin brindó algunos detalles de su esperada arquitectura de CPUs AMD Bulldozer, siendo el CPU Zambezi el primer chip basado en la nueva arquitectura que será lanzado por AMD la semana del 20 de junio de este año. Desde Citavia nos llega un extenso documento con nada menos que 358 páginas que da a los desarrolladores de software una guia para optimizar sus aplicaciones para la nuevos CPUs, y que ademas indirectamente revela algunos detalles extra.

Entre las partes más saltantes del documento tenemos:

Página 25

1.6.4 Instruction Fetching Improvements
While previous AMD64 processors had a single 32-byte fetch window, AMD Family 15h processors have two 32-byte fetch windows, from which four µops can be selected. These fetch windows, when combined with the 128-bit floating-point execution unit, allow the processor to sustain a fetch/dispatch/retire sequence of four instructions per cycle.

Página 26

1.6.6 Notable Performance Improvements
Several enhancements to the AMD64 architecture have resulted in significant performance improvements in AMD Family 15h processors, including:
• Improved performance of shuffle instructions
• Improved data transfer between floating-point registers and general purpose registers
• Improved floating-point register to floating-point register moves
• Optimization of repeated move instructions
• More efficient PUSH/POP stack operations
• 1-Gbyte paging

Página 30

2.1 Key Microarchitecture Features
AMD Family 15h processors include many features designed to improve software performance. The internal design, or microarchitecture, of these processors provides the following key features:
• Integrated DDR3 memory controller with memory prefetcher
• 64-Kbyte L1 instruction cache and 16-Kbyte L1 data cache
• Shared L2 cache between cores of compute unit
• Shared L3 cache compute units on chip (for supported platforms)
• 32-byte instruction fetch
• Instruction predecode and branch prediction during cache-line fills
• Decoupled prediction and instruction fetch pipelines
• Four-wayAMD64 instruction decoding (This is a theoretical limit. See section 2.3 on page 31.)
• Dynamic scheduling and speculative execution
• Two-way integer execution
• Two-way address generation
• Two-way 128-bit wide floating-point execution
• Legacy single-instruction multiple-data (SIMD) instruction extensions, as well as support for XOP, FMA4, VPERMILx, and Advanced Vector Extensions (AVX).
• Superforwarding
• Prefetch into L2 or L1 data cache
• Deep out-of-order integer and floating-point execution
• HyperTransport™ technology

Página 34

The minimum branch misprediction penalty is 20 cycles in the case of conditional and indirect branches and 15 cycles for unconditional direct branches and returns.

Conclusiones

Interesante documento que viene a confirmar los artículos previamente publicados sobre la inclusión de macro-ops fusion y los juegos de instrucciones soportados. Se notan también importantes cambios con respecto a la ahora vieja arquitectura K10.5, pues Bulldozer duplica muchas áreas de hardware que constituian el punto debil de la arquitectura, y mejora la eficiencia de muchas de sus unidades.

Pueden leer el documento completo de AMD desde este link.

Link: AMD Bulldozer Software Optimization Guide is online (Citavia)