GGC 编译Intrinsic

戚晨

2023-12-01

http://www.linuxjournal.com/content/introduction-gcc-compiler-intrinsics-vector-processing?page=0,1

http://stackoverflow.com/questions/7156908/sse-intrinsic-functions-reference

Table 1. GCC Command-Line Options to Generate SIMD Code

Processor/		Options
X86/MMX/SSE1/SSE2	-mfpmath=sse -mmmx -msse -msse2
ARM Neon	-mfpu=neon -mfloat-abi=softfp
Freescale Altivec	-maltivec -mabi=altivec

Here are the include files you need:

arm_neon.h - ARM Neon types & intrinsics
altivec.h - Freescale Altivec types & intrinsics
mmintrin.h - X86 MMX
xmmintrin.h - X86 SSE1
emmintrin.h - X86 SSE2

X86: MMX, SSE, SSE2 Types and Debugging

The X86 compatibles with MMX, SSE1 and SSE2 have the following types:

MMX: __m64 64 bits of integers broken down as eight 8-bit integers, four 16-bit shorts or two 32-bit integers.
SSE1: __m128 128 bits: four single precision floats.
SSE2: __m128i 128 bits of any size packed integers, __m128d 128 bits: two doubles.

Table 2. Subset of vector operators and intrinsics used in the examples.

Operation	Altivec	Neon	MMX/SSE/SSE2
loading	vec_ld	vld1q_f32	_mm_set_epi16
vector	vec_splat	vld1q_s16	_mm_set1_epi16
	vec_splat_s16	vsetq_lane_f32	_mm_set1_pi16
	vec_splat_s32	vld1_u8	_mm_set_pi16
	vec_splat_s8	vdupq_lane_s16	_mm_load_ps
	vec_splat_u16	vdupq_n_s16	_mm_set1_ps
	vec_splat_u32	vmovq_n_f32	_mm_loadh_pi
	vec_splat_u8	vset_lane_u8	_mm_loadl_pi
storing	vec_st	vst1_u8
vector		vst1q_s16	_mm_store_ps
		vst1q_f32
		vst1_s16
add	vec_madd	vaddq_s16	_mm_add_epi16
	vec_mladd	vaddq_f32	_mm_add_pi16
	vec_adds	vmlaq_n_f32	_mm_add_ps
subtract	vec_sub	vsubq_s16
multiply	vec_madd	vmulq_n_s16	_mm_mullo_epi16
	vec_mladd	vmulq_s16	_mm_mullo_pi16
		vmulq_f32	_mm_mul_ps
		vmlaq_n_f32
arithmetic	vec_sra	vshrq_n_s16	_mm_srai_epi16
shift	vec_srl		_mm_srai_pi16
	vec_sr
byte	vec_perm	vtbl1_u8	_mm_shuffle_pi16
permutation	vec_sel	vtbx1_u8	_mm_shuffle_ps
	vec_mergeh	vget_high_s16
	vec_mergel	vget_low_s16
		vdupq_lane_s16
		vdupq_n_s16
		vmovq_n_f32
		vbsl_u8
type	vec_cts	vmovl_u8	_mm_packs_pu16
conversion	vec_unpackh	vreinterpretq_s16_u16
	vec_unpackl	vcvtq_u32_f32
	vec_cts	vqmovn_s32	_mm_cvtps_pi16
	vec_ctu	vqmovun_s16	_mm_packus_epi16
		vqmovn_u16
		vcvtq_f32_s32
		vmovl_s16
		vmovq_n_f32
vector	vec_pack	vcombine_u16
combination	vec_packsu	vcombine_u8
		vcombine_s16
maximum			_mm_max_ps
minimum			_mm_min_ps
vector			_mm_andnot_ps
logic			_mm_and_ps
			_mm_or_ps
rounding	vec_trunc
misc			_mm_empty

Check Processor at Runtime

Next, your code should check your processor at runtime to see if you have vector support for it. If you don't have a vector code path for that processor, fall back to your scalar code. If you have vector support, and the vector support is faster, use the vector code path. Test processor features on X86 with the cpuid instruction from <cpuid.h>. (You saw examples of that in samples/simple/x86/*c.) We couldn't find something that well established for Altivec and Neon, so the examples there parse /proc/cpuinfo. (Serious code might insert a test SIMD instruction. If the processor throws a SIGILL signal when it encounters that test instruction, you do not have that feature.)

Summary

In summary, GCC offers intrinsics that allow you to get more from your processor without the work of going all the way to assembly. We have covered basic types and some of the vector math functions. When you use intrinsics, make sure you test thoroughly. Test for speed and correctness against a scalar version of your code. Different features of each processor and how well they operate means that this is a wide open field. The more effort you put into it, the more you will get out.

References:

The GCC include files that map intrinsics to compiler built-ins (eg arm_neon.h) and the GCC info pages that explain those built-ins:

http://gcc.gnu.org/onlinedocs/gcc/Target-Builtins.html

http://ds9a.nl/gcc-simd/
http://softpixel.com/~cwright/programming/simd/index.php

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/BABCJFDG.html
http://www.arm.com/products/processors/technologies/neon.php
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/ch01s04s02.html
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0205j/BABGHIFH.html

http://www.tommesani.com/Docs.html
http://www.linuxjournal.com/article/7269

http://developer.apple.com/hardwaredrivers/ve/sse.html
http://en.wikipedia.org/wiki/Multiplication_algorithm#Shift_and_add
http://www.ibm.com/developerworks/power/library/pa-unrollav1/
http://en.wikipedia.org/wiki/MMX_(instruction_set)

Integrated Performance Primitives
http://software.intel.com/en-us/articles/intel-ipp/
http://software.intel.com/en-us/articles/non-commercial-software-download/

OpenMAX
http://www.khronos.org/developers/resources/openmax

Freescale AltiVec Libs for Linux
http://www.freescale.com/webapp/sps/site/overview.jsp?code=DRPPCNWALTVCLIB

AltiVec TM Technology Programming Interface Manual
http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf

http://developer.apple.com/hardwaredrivers/ve/instruction_crossref.html

Ian Ollmann's Altivec Tutorial
http://www-linux.gsi.de/~ikisel/reco/Systems/Altivec.pdf
http://arstechnica.com/civis/viewtopic.php?f=19&t=381165

RealView Compilation Tools Compiler Reference Guide (especially Appendix E)
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348c/index.html

RealView Compilation Tools Assembler Guide (esp chapter 5)
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/index.html

Intel C++ Intrinsics Reference

http://software.intel.com/sites/default/files/m/9/4/c/8/e/18072-347603.pdf

GGC 编译Intrinsic

X86: MMX, SSE, SSE2 Types and Debugging

Check Processor at Runtime

Summary

References:

相关阅读

相关文章

相关问答

相关文档