当前位置: 首页 > 工具软件 > Intrinsic > 使用案例 >

GGC 编译Intrinsic

戚晨
2023-12-01

http://www.linuxjournal.com/content/introduction-gcc-compiler-intrinsics-vector-processing?page=0,1

http://stackoverflow.com/questions/7156908/sse-intrinsic-functions-reference

Table 1. GCC Command-Line Options to Generate SIMD Code

Processor/ Options
X86/MMX/SSE1/SSE2-mfpmath=sse -mmmx -msse -msse2
ARM Neon-mfpu=neon -mfloat-abi=softfp
Freescale Altivec-maltivec -mabi=altivec

Here are the include files you need:

  • arm_neon.h - ARM Neon types & intrinsics
  • altivec.h - Freescale Altivec types & intrinsics
  • mmintrin.h - X86 MMX
  • xmmintrin.h - X86 SSE1
  • emmintrin.h - X86 SSE2

X86: MMX, SSE, SSE2 Types and Debugging

The X86 compatibles with MMX, SSE1 and SSE2 have the following types:

  • MMX: __m64 64 bits of integers broken down as eight 8-bit integers, four 16-bit shorts or two 32-bit integers.
  • SSE1: __m128 128 bits: four single precision floats.
  • SSE2: __m128i 128 bits of any size packed integers, __m128d 128 bits: two doubles.

Table 2. Subset of vector operators and intrinsics used in the examples.

Operation

Altivec

Neon

MMX/SSE/SSE2

loading

vec_ld

vld1q_f32

_mm_set_epi16

vector

vec_splat

vld1q_s16

_mm_set1_epi16

 

vec_splat_s16

vsetq_lane_f32

_mm_set1_pi16

 

vec_splat_s32

vld1_u8

_mm_set_pi16

 

vec_splat_s8

vdupq_lane_s16

_mm_load_ps

 

vec_splat_u16

vdupq_n_s16

_mm_set1_ps

 

vec_splat_u32

vmovq_n_f32

_mm_loadh_pi

 

vec_splat_u8

vset_lane_u8

_mm_loadl_pi

storing

vec_st

vst1_u8

 

vector

 

vst1q_s16

_mm_store_ps

 

 

vst1q_f32

 

 

 

vst1_s16

 

add

vec_madd

vaddq_s16

_mm_add_epi16

 

vec_mladd

vaddq_f32

_mm_add_pi16

 

vec_adds

vmlaq_n_f32

_mm_add_ps

subtract

vec_sub

vsubq_s16

 

multiply

vec_madd

vmulq_n_s16

_mm_mullo_epi16

 

vec_mladd

vmulq_s16

_mm_mullo_pi16

 

 

vmulq_f32

_mm_mul_ps

 

 

vmlaq_n_f32

 

arithmetic

vec_sra

vshrq_n_s16

_mm_srai_epi16

shift

vec_srl

 

_mm_srai_pi16

 

vec_sr

 

 

byte

vec_perm

vtbl1_u8

_mm_shuffle_pi16

permutation

vec_sel

vtbx1_u8

_mm_shuffle_ps

 

vec_mergeh

vget_high_s16

 

 

vec_mergel

vget_low_s16

 

 

 

vdupq_lane_s16

 

 

 

vdupq_n_s16

 

 

 

vmovq_n_f32

 

 

 

vbsl_u8

 

type

vec_cts

vmovl_u8

_mm_packs_pu16

conversion

vec_unpackh

vreinterpretq_s16_u16

 

 

vec_unpackl

vcvtq_u32_f32

 

 

vec_cts

vqmovn_s32

_mm_cvtps_pi16

 

vec_ctu

vqmovun_s16

_mm_packus_epi16

 

 

vqmovn_u16

 

 

 

vcvtq_f32_s32

 

 

 

vmovl_s16

 

 

 

vmovq_n_f32

 

vector

vec_pack

vcombine_u16

 

combination

vec_packsu

vcombine_u8

 

 

 

vcombine_s16

 

maximum

 

 

_mm_max_ps

minimum

 

 

_mm_min_ps

vector

 

 

_mm_andnot_ps

logic

 

 

_mm_and_ps

 

 

 

_mm_or_ps

rounding

vec_trunc

 

 

misc

 

 

_mm_empty

 

 

Check Processor at Runtime

Next, your code should check your processor at runtime to see if you have vector support for it. If you don't have a vector code path for that processor, fall back to your scalar code. If you have vector support, and the vector support is faster, use the vector code path. Test processor features on X86 with the cpuid instruction from <cpuid.h>. (You saw examples of that in samples/simple/x86/*c.) We couldn't find something that well established for Altivec and Neon, so the examples there parse /proc/cpuinfo. (Serious code might insert a test SIMD instruction. If the processor throws a SIGILL signal when it encounters that test instruction, you do not have that feature.)

Summary

In summary, GCC offers intrinsics that allow you to get more from your processor without the work of going all the way to assembly. We have covered basic types and some of the vector math functions. When you use intrinsics, make sure you test thoroughly. Test for speed and correctness against a scalar version of your code. Different features of each processor and how well they operate means that this is a wide open field. The more effort you put into it, the more you will get out.

References:

The GCC include files that map intrinsics to compiler built-ins (eg arm_neon.h) and the GCC info pages that explain those built-ins:

http://gcc.gnu.org/onlinedocs/gcc/Target-Builtins.html


http://ds9a.nl/gcc-simd/
http://softpixel.com/~cwright/programming/simd/index.php

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/BABCJFDG.html
http://www.arm.com/products/processors/technologies/neon.php
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/ch01s04s02.html
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0205j/BABGHIFH.html

http://www.tommesani.com/Docs.html
http://www.linuxjournal.com/article/7269

http://developer.apple.com/hardwaredrivers/ve/sse.html
http://en.wikipedia.org/wiki/Multiplication_algorithm#Shift_and_add
http://www.ibm.com/developerworks/power/library/pa-unrollav1/
http://en.wikipedia.org/wiki/MMX_(instruction_set)

Integrated Performance Primitives
http://software.intel.com/en-us/articles/intel-ipp/
http://software.intel.com/en-us/articles/non-commercial-software-download/

OpenMAX
http://www.khronos.org/developers/resources/openmax

Freescale AltiVec Libs for Linux
http://www.freescale.com/webapp/sps/site/overview.jsp?code=DRPPCNWALTVCLIB


AltiVec TM Technology Programming Interface Manual
http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf

http://developer.apple.com/hardwaredrivers/ve/instruction_crossref.html

Ian Ollmann's Altivec Tutorial
http://www-linux.gsi.de/~ikisel/reco/Systems/Altivec.pdf
http://arstechnica.com/civis/viewtopic.php?f=19&t=381165

RealView Compilation Tools Compiler Reference Guide (especially Appendix E)
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348c/index.html

RealView Compilation Tools Assembler Guide (esp chapter 5)
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/index.html

Intel C++ Intrinsics Reference

http://software.intel.com/sites/default/files/m/9/4/c/8/e/18072-347603.pdf

 

 类似资料: