BCB6运行SSE2指令

BCB6是早期的产物，有着快速优美的开发体验，可比喻为天堂般的享受。随着岁月的流失，BCB6的光芒逐渐暗淡，但光明依然存在，当运行SSE2指令后，有喜也有忧，喜的是可以运行加速指令，提高运算速度，忧的是SSE2后的指令再也运行不动了，这优美的软件，今天到了速度的边界，天堂的光芒在此消失，也许是千古绝唱。

为什么用SSE2指令？是为了加速，比如几秒放一帧图像无法实用。为什么用浮点数？绘图、DCT（离散余弦变换）等均需要浮点运算。为什么用双精度浮点数？曾用单精度浮点画一斜线，结果画的太斜了，封不上口，换用双精度浮点数当即完美。整形数CPU已经可以很好处理了，直到SSE2扩展指令才支持双精度浮点数，所以倾向于双精度浮点数加速。

所谓“指令”就是汇编，一种低级语言，执行速度最快。执行SSE2指令需要asm嵌入汇编，SSE2支持双精度浮点和整形数运算，一次可以执行2个双精度浮点数运算，称为SIMD（单指令多数据）。如果满载存取，需要16字节地址对齐，如果一次存取一个数据，则不需要。

变量定义：变量存在于内存中，因为一次可以执行2个双精度浮点数，所以需要定义一段128位的内存，一次存取16个字节，所以需要16个字节对齐。

__m128d：表示内存中128位的两个双精度浮点数。采用结构方式，定义如下：

#pragma pack(push)//保存对齐状态
#pragma pack(16)//设定为16字节对齐
struct __m128d //内存128双精度变量结构
{
  double m128_f64[2];//2个双精度浮点数
};
#pragma pack(pop)//恢复对齐状态

SSE2有144条指令，分为数据移动，数据运算，逻辑判断，数值比较等。

使用：以asm嵌入，装载数据，数据运算，传出数据过程。

asm
{
   1. 载入数据（装载Load）
   2. 运算数据
   3. 传出数据（存储Store）
}

已做好的函数（汇编）：

double double_Mul(double fD1, double fD2, double fD3)//乘
{
  double fRet;

  asm 
  {
    movsd XMM0, fD1  // XMM0 <- fD1
    mulsd XMM0, fD2  // fD1 = fD1 * fD2
    mulsd XMM0, fD3  // fD1 = fD1 * fD3
    movsd fRet, XMM0 // fRet <- XMM0
  }
  return fRet;
}

double double_Mul(double fD1, double fD2)//乘
{
  double fRet;

  asm 
  {
    movsd XMM0, fD1  // XMM0 <- fD1
    mulsd XMM0, fD2  // XMM0 = XMM0 * fD2
    movsd fRet, XMM0 // fRet <- XMM0
  }
  return fRet;
}

注：以上为我自学心得，仅供参考，部分自定义。在BCB6上应用，需自我量身，自我定做。

附录：SSE2 — OpCode List

(under construction -- this list might be incomplete?!)
Additionally, with AMD64's 64/128 bit register extensions some of the functionality changes...

Arithmetic:
addpd - Adds 2 64bit doubles.
addsd - Adds bottom 64bit doubles.
subpd - Subtracts 2 64bit doubles.
subsd - Subtracts bottom 64bit doubles.
mulpd - Multiplies 2 64bit doubles.
mulsd - Multiplies bottom 64bit doubles.
divpd - Divides 2 64bit doubles.
divsd - Divides bottom 64bit doubles.
maxpd - Gets largest of 2 64bit doubles for 2 sets.
maxsd - Gets largets of 2 64bit doubles to bottom set.
minpd - Gets smallest of 2 64bit doubles for 2 sets.
minsd - Gets smallest of 2 64bit values for bottom set.
paddb - Adds 16 8bit integers.
paddw - Adds 8 16bit integers.
paddd - Adds 4 32bit integers.
paddq - Adds 2 64bit integers.
paddsb - Adds 16 8bit integers with saturation.
paddsw - Adds 8 16bit integers using saturation.
paddusb - Adds 16 8bit unsigned integers using saturation.
paddusw - Adds 8 16bit unsigned integers using saturation.
psubb - Subtracts 16 8bit integers.
psubw - Subtracts 8 16bit integers.
psubd - Subtracts 4 32bit integers.
psubq - Subtracts 2 64bit integers.
psubsb - Subtracts 16 8bit integers using saturation.
psubsw - Subtracts 8 16bit integers using saturation.
psubusb - Subtracts 16 8bit unsigned integers using saturation.
psubusw - Subtracts 8 16bit unsigned integers using saturation.
pmaddwd - Multiplies 16bit integers into 32bit results and adds results.
pmulhw - Multiplies 16bit integers and returns the high 16bits of the result.
pmullw - Multiplies 16bit integers and returns the low 16bits of the result.
pmuludq - Multiplies 2 32bit pairs and stores 2 64bit results.
rcpps - Approximates the reciprocal of 4 32bit singles.
rcpss - Approximates the reciprocal of bottom 32bit single.
sqrtpd - Returns square root of 2 64bit doubles.
sqrtsd - Returns square root of bottom 64bit double.

Logic:
andnpd - Logically NOT ANDs 2 64bit doubles.
andnps - Logically NOT ANDs 4 32bit singles.
andpd - Logically ANDs 2 64bit doubles.
pand - Logically ANDs 2 128bit registers.
pandn - Logically Inverts the first 128bit operand and ANDs with the second.
por - Logically ORs 2 128bit registers.
pslldq - Logically left shifts 1 128bit value.
psllq - Logically left shifts 2 64bit values.
pslld - Logically left shifts 4 32bit values.
psllw - Logically left shifts 8 16bit values.
psrad - Arithmetically right shifts 4 32bit values.
psraw - Arithmetically right shifts 8 16bit values.
psrldq - Logically right shifts 1 128bit values.
psrlq - Logically right shifts 2 64bit values.
psrld - Logically right shifts 4 32bit values.
psrlw - Logically right shifts 8 16bit values.
pxor - Logically XORs 2 128bit registers.
orpd - Logically ORs 2 64bit doubles.
xorpd - Logically XORs 2 64bit doubles.

Compare:
cmppd - Compares 2 pairs of 64bit doubles.
cmpsd - Compares bottom 64bit doubles.
comisd - Compares bottom 64bit doubles and stores result in EFLAGS.
ucomisd - Compares bottom 64bit doubles and stores result in EFLAGS. (QNaNs don't throw exceptions with ucomisd, unlike comisd.
pcmpxxb - Compares 16 8bit integers.
pcmpxxw - Compares 8 16bit integers.
pcmpxxd - Compares 4 32bit integers.
Compare Codes (the xx parts above):
eq - Equal to.
lt - Less than.
le - Less than or equal to.
ne - Not equal.
nlt - Not less than.
nle - Not less than or equal to.
ord - Ordered.
unord - Unordered.

Conversion:
cvtdq2pd - Converts 2 32bit integers into 2 64bit doubles.
cvtdq2ps - Converts 4 32bit integers into 4 32bit singles.
cvtpd2pi - Converts 2 64bit doubles into 2 32bit integers in an MMX register.
cvtpd2dq - Converts 2 64bit doubles into 2 32bit integers in the bottom of an XMM register.
cvtpd2ps - Converts 2 64bit doubles into 2 32bit singles in the bottom of an XMM register.
cvtpi2pd - Converts 2 32bit integers into 2 32bit singles in the bottom of an XMM register.
cvtps2dq - Converts 4 32bit singles into 4 32bit integers.
cvtps2pd - Converts 2 32bit singles into 2 64bit doubles.
cvtsd2si - Converts 1 64bit double to a 32bit integer in a GPR.
cvtsd2ss - Converts bottom 64bit double to a bottom 32bit single. Tops are unchanged.
cvtsi2sd - Converts a 32bit integer to the bottom 64bit double.
cvtsi2ss - Converts a 32bit integer to the bottom 32bit single.
cvtss2sd - Converts bottom 32bit single to bottom 64bit double.
cvtss2si - Converts bottom 32bit single to a 32bit integer in a GPR.
cvttpd2pi - Converts 2 64bit doubles to 2 32bit integers using truncation into an MMX register.
cvttpd2dq - Converts 2 64bit doubles to 2 32bit integers using truncation.
cvttps2dq - Converts 4 32bit singles to 4 32bit integers using truncation.
cvttps2pi - Converts 2 32bit singles to 2 32bit integers using truncation into an MMX register.
cvttsd2si - Converts a 64bit double to a 32bit integer using truncation into a GPR.
cvttss2si - Converts a 32bit single to a 32bit integer using truncation into a GPR.

Load/Store:
(is "minimize cache pollution" the same as "without using cache"??)
movq - Moves a 64bit value, clearing the top 64bits of an XMM register.
movsd - Moves a 64bit double, leaving tops unchanged if move is between two XMMregisters.
movapd - Moves 2 aligned 64bit doubles.
movupd - Moves 2 unaligned 64bit doubles.
movhpd - Moves top 64bit value to or from an XMM register.
movlpd - Moves bottom 64bit value to or from an XMM register.
movdq2q - Moves bottom 64bit value into an MMX register.
movq2dq - Moves an MMX register value to the bottom of an XMM register. Top is cleared to zero.
movntpd - Moves a 128bit value to memory without using the cache. NT is "Non Temporal."
movntdq - Moves a 128bit value to memory without using the cache.
movnti - Moves a 32bit value without using the cache.
maskmovdqu - Moves 16 bytes based on sign bits of another XMM register.
pmovmskb - Generates a 16bit Mask from the sign bits of each byte in an XMM register.

Shuffling:
pshufd - Shuffles 32bit values in a complex way.
pshufhw - Shuffles high 16bit values in a complex way.
pshuflw - Shuffles low 16bit values in a complex way.
unpckhpd - Unpacks and interleaves top 64bit doubles from 2 128bit sources into 1.
unpcklpd - Unpacks and interleaves bottom 64bit doubles from 2 128 bit sources into 1.
punpckhbw - Unpacks and interleaves top 8 8bit integers from 2 128bit sources into 1.
punpckhwd - Unpacks and interleaves top 4 16bit integers from 2 128bit sources into 1.
punpckhdq - Unpacks and interleaves top 2 32bit integers from 2 128bit sources into 1.
punpckhqdq - Unpacks and interleaces top 64bit integers from 2 128bit sources into 1.
punpcklbw - Unpacks and interleaves bottom 8 8bit integers from 2 128bit sources into 1.
punpcklwd - Unpacks and interleaves bottom 4 16bit integers from 2 128bit sources into 1.
punpckldq - Unpacks and interleaves bottom 2 32bit integers from 2 128bit sources into 1.
punpcklqdq - Unpacks and interleaces bottom 64bit integers from 2 128bit sources into 1.
packssdw - Packs 32bit integers to 16bit integers using saturation.
packsswb - Packs 16bit integers to 8bit integers using saturation.
packuswb - Packs 16bit integers to 8bit unsigned integers unsing saturation.

Cache Control:
clflush - Flushes a Cache Line from all levels of cache.
lfence - Guarantees that all memory loads issued before the lfence instruction are completed before anyloads after the lfence instruction.
mfence - Guarantees that all memory reads and writes issued before the mfence instruction are completed before any reads or writes after the mfence instruction.
pause - Pauses execution for a set amount of time.

附录来源：http://softpixel.com/~cwright/programming/simd/sse2.php

猜你喜欢