Efficient implementation of Level FHE & advanced algorithm compatible with Level FHE

references:

  1. [CS05] Choi Y, Swartzlander E E. Parallel prefix adder design with matrix representation[C]//17th IEEE Symposium on Computer Arithmetic (ARITH’05). IEEE, 2005: 90-98.
  2. [SV11] Smart N P, Vercauteren F. Fully homomorphic SIMD operations[J]. Designs, codes and cryptography, 2014, 71: 57-81.
  3. [GHS12] Gentry C, Halevi S, Smart N P. Fully homomorphic encryption with polylog overhead[C]//Annual International Conference on the Theory and Applications of Cryptographic Techniques. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012: 465-482.
  4. [GHS12] Gentry C, Halevi S, Smart N P. Homomorphic evaluation of the AES circuit[C]//Annual Cryptology Conference. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012: 850-867.
  5. [HS13] Halevi S, Shoup V. Design and implementation of a homomorphic-encryption library[J]. IBM Research (Manuscript), 2013, 6(12-15): 8-36.
  6. [ZQH+20] Zhang N, Qin Q, Hou Z, et al. Efficient comparison and addition for FHE with weighted computational complexity model[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 40(9): 1896-1908.

Carry-Lookahead Adder

Given two integers A = an − 1 ⋯ a 1 a 0 A=a_{n-1}\cdots a_1 a_0A=an1a1a0 B = b n − 1 ⋯ b 1 b 0 B=b_{n-1}\cdots b_1 b_0 B=bn1b1b0, in order to calculate their sum, there are a variety of adders to choose from. We use the following logic symbols: AND gate ⋅ \cdot , OR gate+++ , XOR gate⊕ \oplus , where⋅ \cdot has the highest priority.

Input a , b , cin ∈ { 0 , 1 } a,b,cin \in \{0,1\} of Full Adder (FA)a,b,cin{ 0,1},输出 s o u t , c o u t ∈ { 0 , 1 } sout,cout \in \{0,1\} so u _ _cout{ 0,1},逻辑表达式为:
s o u t = a ⊕ b ⊕ c i n c o u t = a ⋅ b + a ⋅ c i n + b ⋅ c i n \begin{aligned} sout &= a \oplus b \oplus cin\\ cout &= a \cdot b+a \cdot cin + b \cdot cin \end{aligned} so u tcout=abcin=ab+acin+bcin

Define signal g = a ⋅ bg=a \cdot bg=ab(generate), k = a + b k=a+b k=a+b(kill), p = a ⊕ b p=a \oplus b p=ab(propagate),那么
s o u t = p ⊕ c i n c o u t = g + k ⋅ c i n \begin{aligned} sout &= p \oplus cin\\ cout &= g+k \cdot cin \end{aligned} so u tcout=pcin=g+kcin

Ripple-Carry Adder

Traveling wave carry adder (RCA): convert nnn full adders are connected in series, inputA, B, c 0 A,B,c_0A,B,c0,迭代计算,
g i = a i ⋅ b i k i = a i + b i p i = a i ⊕ b i s i = p i ⊕ c i c i + 1 = g i + k i ⋅ c i \begin{aligned} g_i &= a_i \cdot b_i\\ k_i &= a_i + b_i\\ p_i &= a_i \oplus b_i\\\\ s_i &= p_i \oplus c_i\\ c_{i+1} &= g_i+k_i \cdot c_i \end{aligned} gikipisici+1=aibi=ai+bi=aibi=pici=gi+kici

The length of the critical path is the carry chain c 1 , ⋯ , cn − 1 , cn c_1,\cdots,c_{n-1},c_nc1,,cn1,cnThe number of iteration rounds O ( n ) O(n)O ( n )

Carry-Lookahead Adder

Carry-ahead adder (CLA): Continuously expand the carry chain of RCA to obtain
ci + 1 = gi + ki ⋅ ( gi − 1 + ki − 1 ⋅ ( gi − 2 + ⋯ ) ) = gi + ∑ j = 0 i − 1 ( ∏ l = j + 1 ikl ) gj + c 0 ∏ l = 0 ikl \begin{aligned} c_{i+1} &= g_i+k_i \cdot (g_{i-1}+k_{i -1} \cdot (g_{i-2} + \cdots))\\ &= g_i + \sum_{j=0}^{i-1} \left(\prod_{l=j+1}^i k_l\right)g_j + c_0\prod_{l=0}^i k_l \end{aligned}ci+1=gi+ki(gi1+ki1(gi2+))=gi+j=0i1 l = j + 1ikl gj+c0l=0ikl

So the carry signal ci + 1 c_{i+1}ci+1It can be directly based on (gi, ki) (g_i,k_i)(gi,ki) signal is calculated without waiting forci c_{i}ciSignal. The critical path length is only O ( log ⁡ n ) O(\log n)O(logn ) , but the fan-in and fan-out coefficients are extremely high. In order to balance computational delay and circuit complexity, you can use [Blocked/Graded Carry Lookahead] ([Hardware Arithmetic Design] Hardware Adder Principle | Chapter 2: Carry Lookahead Adder, Blocked and Hierarchical Carry Lookahead Addition Device - Zhihu (zhihu.com)).

Parallel-Prefix Adder

Parallel Prefix Adder (PPA): In order to better understand CLA, it can be described in algebraic language (gi, ki) (g_i,k_i)(gi,ki) signal merging process. We define the ordered pair( g , k ) (g,k)(g,Binary operation ∘ \circon k ) 如下:
( g j ,    k j ) ∘ ( g i ,    k i ) : = ( g j + k j g i ,    k i k j ) (g_j,\,\, k_j) \circ (g_i,\,\, k_i) := (g_j+k_jg_i,\,\, k_ik_j) (gj,kj)(gi,ki):=(gj+kjgi,kikj)

Easy to verify, it is: associative, idempotent, and non-commutative.

Then the logical expression of CLA can be rewritten as:
( ci + 1 , k 0 k 1 ⋯ ki ) = ( gi + ki ⋅ ( gi − 1 + ki − 1 ⋅ ( gi − 2 + ⋯ ) ) , ki ⋯ k 1 k 0 ) = ( gi , ki ) ∘ ( gi − 1 , ki − 1 ) ∘ ⋯ ∘ ( g 0 , k 0 ) ∘ ( c 0 , 1 ) \begin{aligned} (c_{i+1}, \,\, k_0k_1\cdots k_i) &= (g_i+k_i \cdot (g_{i-1}+k_{i-1} \cdot (g_{i-2} + \cdots)),\,\, k_i\cdots k_1 k_0)\\ &= (g_i,k_i) \circ (g_{i-1},k_{i-1}) \circ \cdots \circ (g_0,k_0) \circ (c_0,1) \end{aligned}(ci+1,k0k1ki)=(gi+ki(gi1+ki1(gi2+)),kik1k0)=(gi,ki)(gi1,ki1)(g0,k0)(c0,1)

Each signal (gi, ki) (g_i,k_i)(gi,ki) and( c 0 , 1 ) (c_0,1)(c0,1 ) are all immediate. According to the associative law of binary operations, a family (g, k) (g,k)can be designed(g,k ) Parallel merging network of signals. Knowles proposed a class of depth-optimal parallel prefix adders. The circuit structure is:

Insert image description here

Knowles Adder is limited to two well-known PPAs: Kogge-Stone Adder, Ladner-Fischer Adder . The LF structure requires infinite fan-out (suitable for FHE operations), while the KS structure requires infinite parallelism (number of multiplications). Brent-Kung Adder balances the fan-out coefficient and the number of computing units by increasing the number of level layers (multiplication depth), and there are some Hybrid Schemes that mix KS, LF, and BK structures.

We use Parallel-Prefix Graghs to simplify the representation of PPA, and each black dot represents a binary operation ∘ \circ operation (two multiplications, one addition), the operation direction is from bottom to top. Knowles Adder with different fanouts:

Insert image description here

The most suitable PPA for FHE operations is the Ladner-Fischer Adder: our soft implementation does not care about the fan-out coefficient, but focuses on the multiplication depth and number of multiplications.

HE library

Double-CRT

[HS13] designed the RLWE-BGV code implementation library HElib based on C++ NTL , using the SIMD technology of [SV11] and the Permuting/Routing technology of [GHS12].

For further acceleration, [HS13] converts the ring Z q [ x ] / ϕ m ( x ) \mathbb Z_q[x]/\phi_m(x)Zq[ x ] / ϕmThe polynomials in ( x ) are expressed in double-CRT form. Chain of moduli is used in BGV,ql = ∏ i = 0 lpi q_l = \prod_{i=0}^l p_iql=i=0lpi, in which pi p_ipiis a prime number of approximate size and satisfies m ∣ pi − 1 m|p_i-1mpi1 makesZ pi \mathbb Z_{p_i}Zpimm exists inmth primitive unit rootζ i \zeta_igi,于是 ϕ m ( x ) = ∏ j ∈ Z m ∗ ( x − ζ i j ) ( m o d p i ) \phi_m(x)=\prod_{j \in \mathbb Z_m^*}(x-\zeta_i^j) \pmod{p_i} ϕm(x)=jZm(xgij)(modpi)

Let’s first consider the modulus ql q_lqlDo CRT decomposition, abbreviation R = Z [ x ] / ϕ m ( x ) R = \mathbb Z[x]/\phi_m(x)R=Z [ x ] / ϕm( x ) ,R p = R / ( p R ) R_{p}=R/(pR)Rp=R/(pR)
Z q l [ x ] / ϕ m ( x ) ≅ R / ( q l R ) ≅ R / ( p 0 R ) × R / ( p 1 R ) × ⋯ × R / ( p l R ) Z_{q_l}[x]/\phi_m(x) \cong R/(q_lR) \cong R/(p_0R) \times R/(p_1R) \times \cdots \times R/(p_lR) Zql[ x ] / ϕm(x)R/(qlR)R/(p0R)×R/(p1R)××R/(plR)

Then for each small ring R pi ≅ Z pi [ x ] / ϕ m ( x ) R_{p_i} \cong \mathbb Z_{p_i}[x]/\phi_m(x)RpiZpi[ x ] / ϕm(x) 继续做 CRT 分解,
Z p i [ x ] / ϕ m ( x ) ≅ Z p i [ x ] / ( x − ζ i j 1 ) × ⋯ × Z p i [ x ] / ( x − ζ i j ϕ ( m ) ) , j t ∈ Z m ∗ \mathbb Z_{p_i}[x]/\phi_m(x) \cong \mathbb Z_{p_i}[x]/(x-\zeta_i^{j_1}) \times \cdots \times Z_{p_i}[x]/(x-\zeta_i^{j_{\phi(m)}}), j_t \in \mathbb Z_m^* Zpi[ x ] / ϕm(x)Zpi[x]/(xgij1)××Zpi[x]/(xgijϕ ( m )),jtZm

Assume m = 2 km=2^km=2k makesϕ m ( x ) = xm / 2 + 1 \phi_m(x)=x^{m/2}+1ϕm(x)=xm/2+1 , thenZ m ∗ = { 1 , 3 , 5 , ⋯ , m − 1 } \mathbb Z_m^*=\{1,3,5,\cdots,m-1\}Zm={ 1,3,5,,m1 } , you can useFFT/NTTto calculate quickly. For generalmmm (mixed radix-X), inZ m ∗ \mathbb Z_m^*ZmThe distribution of numbers in is irregular, and the conversion between coefficient representation and Double-CRT is more complicated.

To sum up, given the modulus ql q_lqlThe ring element a ∈ Z ql [ x ] / ϕ m ( x ) a \in Z_{q_l}[x]/\phi_m(x) corresponding to levelaZql[ x ] / ϕm( x ) , can be expressed in matrix form:iii 行是 a ( x ) ( m o d p i ) a(x) \pmod{p_i} a(x)(modpi) of the NTT domain,ii.Line i , jjj 列是元素 a ( ζ i j ) ( m o d p i ) a(\zeta_i^j) \pmod{p_i} a ( gij)(modpi) , shape( l + 1 ) × ϕ ( m ) (l+1) \times \phi(m)(l+1)×ϕ(m)
D o u b l e C R T l ( a ) = { a ( x ) ( m o d p i ) ( m o d x − ζ i j ) } 0 ≤ i ≤ l ,    j ∈ Z m ∗ = { a ( ζ i j ) ( m o d p i ) } 0 ≤ i ≤ l ,    j ∈ Z m ∗ \begin{aligned} DoubleCRT^l(a) &= \{a(x) \pmod{p_i}\pmod{x-\zeta_i^j}\}_{0\le i\le l,\,\, j\in \mathbb Z_m^*}\\ &= \{a(\zeta_i^j) \pmod{p_i}\}_{0\le i\le l,\,\, j\in \mathbb Z_m^*}\\ \end{aligned} DoubleCRTl(a)={ a(x)(modpi)(modxgij)}0il,jZm={ a ( gij)(modpi)}0il,jZm

It is easy to verify that the addition/multiplication of ring elements is the component-wise addition/multiplication of the matrix ,
D ouble CRT l ( a + b ) = D ouble CRT l ( a ) + D ouble CRT l ( a ) D ouble CRT l ( a ⋅ b ) = D ouble CRT l ( a ) ⋅ D ouble CRT l ( a ) \begin{aligned} DoubleCRT^l(a+b) &= DoubleCRT^l(a) + DoubleCRT^l(a)\\ DoubleCRT^l(a \cdot b) &= DoubleCRT^l(a) \cdot DoubleCRT^l(a)\\ \end{aligned}DoubleCRTl(a+b)DoubleCRTl(ab)=DoubleCRTl(a)+DoubleCRTl(a)=DoubleCRTl(a)DoubleCRTl(a)

Galois function G al ( Q ( ζ ) / Q ) ≅ Z m ∗ \mathcal{Gal}(\mathbb Q(\zeta)/\mathbb Q) \cong \mathbb Z_m^*G a l ( Q ( ζ ) / Q )ZmThe automorphism mapping in is
κ k ( a ( x ) ) : = a ( xk ) ( mod ϕ m ( x ) ) , ∀ k ∈ Z m ∗ \kappa_k(a(x)):=a(x^ k) \pmod{\phi_m(x)},\,\, \forall k \in \mathbb Z_m^*Kk(a(x)):=a(xk)(modϕm(x)),kZm

Because m ∣ pi − 1 m|p_i-1mpi1 ,the functionG a L ( Z pi ( ζ ) / Z pi ) \mathcal{GaL}(\mathbb Z_{p_i}(\zeta)/\mathbb Z_{p_i})GaL(Zpi( g ) / Zpi) are permutations between plaintext slots (Frob mapκ 1 ( ζ ) = ζ pi = ζ \kappa_1(\zeta)=\zeta^{p_i}=\zetaK1( g )=gpi=ζ是恒健在线).Other mappingκk : ζ ↦ ζ k \kappa_k:\zeta \mapsto \zeta^{k}Kk:ggk is equivalent to: converting the matrixD double CRT l ( a ) DoubleCRT^l(a)DoubleCRTThe jjthof l (a)Column j , move tojk (modm) jk \pmod mjk(modm ) column. So Routing is easy to implementon Double-CRT representation.

New Key Switching in CRT

Key switching is essentially about W [ s 1 → s 2 ] = E ncs 2 ( s 1 ; e ) W[s_1 \to s_2] = Enc_{s_2}(s_1; e)W[s1s2]=Encs2(s1;Homomorphic multiplication operationof e ) :
c 2 = c 1 W = E ncs 2 ( c 1 s 1 ; c 1 e ) c 2 s 2 = c 1 W s 2 = c 1 s 1 c_2=c_1W=Enc_{ s_2}(c_1s_1; c_1e)\\ c_2s_2 = c_1Ws_2 = c_1s_1c2=c1W=Encs2(c1s1;c1e)c2s2=c1Ws2=c1s1

But the ciphertext c 1 c_1c1Norm as a scalar relative to modulus ql q_lqlIf it is too large, the noise term after homomorphic operation will destroy the plaintext.

In BGV, in order to control the noise term of key switching, a binary decomposition scheme is used : given s ′ s'sCiphertextc ′ c' under ′c,计算 c ′ = ∑ i 2 i c i ′ c'=\sum_i 2^i c_i' c=i2ici,令
c ∗ = c 0 ′ ∣ c 1 ′ ∣ c 2 ′ ∣ ⋯   ,    s ∗ = s ′ ∣ 2 s ′ ∣ 4 s ′ ∣ ⋯ c^*=c_0'|c_1'|c_2'|\cdots,\,\, s^*=s'|2s'|4s'|\cdots c=c0c1c2,s=s∣2s∣4s

Then calculate the matrix W = W [ s ∗ → s ] W=W[s^* \to s]W=W[ss ] , which satisfiess ∗ = W ⋅ ss^*=W \cdot ss=Ws , then calculate the ciphertextc = c ∗ ⋅ W c=c^* \cdot Wc=cW
⟨ c ′ , s ′ ⟩ = ⟨ c ∗ , s ∗ ⟩ = ⟨ c , s ⟩ \langle c', s' \rangle = \langle c^*, s^* \rangle = \langle c, s \rangle c,s=c,s=c,s

In order to calculate c ′ c'c binary decomposition, first need to convert Double-CRT back to coefficient representation. Then,log ⁡ ql \log q_llogqlCi ′ c_i'ciConvert to Double-CRT respectively to perform the inner product operation of decryption. This costs O ( l log ⁡ ql ) O(l \log q_l)O(llogql) length isn = ϕ ( m ) n=\phi(m)n=FFT/NTT operation of ϕ ( m ) .

[GHS12] used a different noise term control technique: temporarily boosting the modulus . Using the LSB encoding scheme, assume ⟨ c ′ , s ′   = 2 e ′ + a ( modq ) \langle c',s' \rangle=2e'+a \pmod qc,s=2 e+a(modq ) , for any odd numberppp ,dimensions⟨
c ′ , ps ′ ⟩ = 2 pe ′ + pa = 2 e ′ ′ + a ( modpq ) \angle c',ps' \angle = 2pe'+pa = 2e''+a \pmod{ pq}c,ps=2pe+pa=2 e′′+a(modpq)

Among them, e ′ ′ = pe ′ + a ( p − 1 ) / 2 e'' = pe'+a(p-1)/2e′′=pe+a(p1 ) /2 is approximately the original noisee ′ e'e' pp_p times. Just perform a modulo switch and return to moduloqqq , then the noise is basically stille ′ e'e magnitude. Therefore, we choose a sufficiently large odd numberp ≈ qmp \approx q\sqrt mpqm , use the transformation matrix W = W [ ps ′ → s ] ( modpq ) W=W[ps' \to s] \pmod{pq}W=W[pss](modThe ciphertext of pq ) serves as a key switching aid.

Although there is no need for c ′ c'c has been binary decomposed, but it still needs to be calculatedc ′ ( modp ) c' \pmod pc(modp ) to formZ pq [ x ] / ϕ m ( x ) \mathbb Z_{pq}[x]/\phi_m(x)Zpq[ x ] / ϕmDouble-CRT representation of ( x ) . However, because there is no need for dimension expansionlog ⁡ ql \log q_llogqltimes, so only O ( l ) O(l) is requiredO ( l ) FFT/NTT operations.

Note that because the modulus pq ≥ q 2 pq \ge q^2pqq2 , making the ratio of modulus/noise very large, so the dimension of the grid needs to ben = ϕ ( m ) n=\phi(m)n=ϕ ( m ) is expanded nearly twice to maintain safe strength. But the complexity of FFT/NTT isO ( n log ⁡ n ) O(n \log n)O ( nlogn ) , so the calculation efficiency is still improved by aboutlog ⁡ ql / 2 \log q_l/2logql/2 factor, and the storage overhead of public keys is also lower.

You can also use a mixture of base decomposition and modulus improvement: convert the ciphertext c ′ c'c Decomposition为∑ i D ici ′ \sum_i D^ic_i'iDici, where DDD is a large radix such thatci ′ c_i'ciThe number of is very small; then, for each ci ′ c_i'ciAdopt a plan to temporarily increase the modulus.

Modulus Switching in CRT

We hope to keep the ciphertext and secret key in Double-CRT representation except for some necessary coefficient representations to facilitate quick calculation. From the modulus ql = ∏ j = 0 lpj q_l=\prod_{j=0}^l p_jql=j=0lpjThe cipher text under ccc switches to modulusql − 1 = ∏ j = 0 l − 1 pj q_{l-1}=\prod_{j=0}^{l-1} p_jql1=j=0l1pjThe ciphertext c ′ c' underc , maintain the decryption correctnessc ′ s ≡ cs ( mod 2 ) c' s\equiv cs \pmod 2cscs(mod2 ) , and also requires "rounding error"τ = c ′ − c / pt \tau=c'-c/p_tt=cc/ptThe norm of is very small.

Because pl p_lplis an odd prime number, just calculate c † = pl ⋅ c ′ c^\dagger = p_l \cdot c'c=plc , such that:pl ∣ c † p_l|c^\daggerplc c † ≡ c ( m o d 2 ) c^\dagger \equiv c \pmod 2 cc(mod2 ) (sufficient but not necessary), andc † − cc^\dagger-ccThe norm of c is very small. Then outputc ′ = c † / pl c'=c^\dagger/p_lc=c/plThat’s it. The mode switching algorithm under Double-CRT is:

  1. From the last line of Double CRT, calculate c ˉ = c ( modpl ) \bar c=c \pmod{p_l}cˉ=c(modpl) , the value range is[ − pl / 2 , pl / 2 ) [-p_l/2,p_l/2)[pl/2,pl/2 ) , but note thatc ˉ \bar ccˉBelongs toRRR (available inR ql R_{q_l}RqlMedium simulation)
  2. Will c ˉ \bar ccThose coefficients of ˉ are odd numbers, by adding and subtractingpl p_lplbecomes an even number, the value range is [ − pl , pl ) [-p_l,p_l)[pl,pl) , get the small polynomialδ \deltaδ , meetδ ≡ c ( modpl ) \delta \equiv c \pmod{p_l}dc(modpl) andδ ≡ 0 ( mod 2 ) \delta \equiv 0 \pmod{2}d0(mod2)
  3. Default \deltaDouble CRT representation of δ , and then calculatec † = c − δ ( modql ) c^\dagger = c-\delta \pmod{q_l}c=cd(modql) , which satisfiespl ∣ c † p_l|c^\daggerplc
  4. For each c † ( modpj ) , j < lc^\dagger \pmod{p_j}, j<lc(modpj),j<l , by calculatingpl − 1 ⋅ c † ( modpj ) p_l^{-1} \cdot c^\dagger \pmod{p_j}pl1c(modpj) , obtainc ′ = c † / pl c'=c^\dagger/p_lc=c/plDouble-CRT representation

In the above process, step 1 cost 1 11 INTT, step 3 costlll NTT, totalingO ( l ) O(l)O ( l ) lengthn = ϕ ( m ) n=\phi(m)n=FFT/NTT operation of ϕ ( m ) .

WCC

In high-level applications of Level FHE, the number of multiplications and the depth of multiplication can be used to measure the efficiency of the algorithm. However, in many works, people tend to only focus on a certain aspect rather than comprehensively considering it. Therefore, the advanced algorithms designed are not necessarily suitable for calculations on level FHE.

[ZQH+20] proposed the Weighted Computational Complexity (WCC) computing model. The complexity is defined as
WCC = ∑ l = 0 LW l ⋅ N l WCC = \sum_{l=0}^L W_l \cdot N_lWCC=l=0LWlNl

Among them N l N_lNlIt’s the llthThe number of multiplications in layer l , W l W_lWlIt’s the llthMultiplication cost of layer l (if W l = 1 , ∀ l W_l=1,\forall lWl=1,l only focuses on the number of multiplications, ifW l = 0 , l < L W_l=0,l<LWl=0,l<L only focuses on the multiplication depth). According to [GHS12], the cost of key switching and mode switching under Double-CRT isO ( l ) O (l)O ( l ) lengthn = ϕ ( m ) n=\phi(m)n=NTT/FFT operation of ϕ ( m ) . Therefore, ignoring the complicated details, roughly select the weights asW l = l + 1 , l ≥ 0 W_l = l+1, l\ge 0Wl=l+1,l0 , which is appropriate and sufficient to guide us in designing advanced algorithms more suitable for level FHE.

WCC Analysis

[ZQH+20] analyzed the WCC complexity of integer comparison and integer addition.

Three comparison algorithms,

  1. Linear Comparison (LinC),线性比较
    c o m i = ( a i ⊕ 1 ) ⋅ b i ⊕ ( a i ⊕ b i ⊕ 1 ) ⋅ c o m i − 1 com_i=(a_i\oplus 1) \cdot b_i \oplus (a_i\oplus b_i\oplus 1) \cdot com_{i-1} comi=(ai1)bi(aibi1)comi1

    Final output COM ( A , B ) = comn − 1 COM(A,B)=com_{n-1}AS ( A ,B)=comn1, it can be calculated that WCC = O ( n 2 ) WCC=O(n^2)WCC=O ( n2)

  2. Iteration Comparison (IterC),分治算法
    t i , 1 = ( a i ⊕ 1 ) ⋅ b i ,    z i , 1 = a i ⊕ b i ⊕ 1 t i j = t i + h , j − h + z i + h , j − h ⋅ t i , h ,    j > 1 ,    h = ⌈ j / 2 ⌉ t_{i,1} = (a_i \oplus 1)\cdot b_i,\,\, z_{i,1}=a_i\oplus b_i \oplus 1\\ t_{ij}=t_{i+h,j-h}+z_{i+h,j-h} \cdot t_{i,h},\,\, j>1,\,\, h=\lceil j/2 \rceil ti,1=(ai1)bi,zi,1=aibi1tij=ti+h,jh+zi+h,jhti,h,j>1,h=j/2

    Final output COM (A, B) = t 0, n COM(A,B)=t_{0,n}AS ( A ,B)=t0,n, its WCC general formula is more complicated and difficult to write

  3. Logarithm Comparison (LogC), complete expansion
    di = ( ai ⊕ 1 ) ⋅ bi , zi = ai ⊕ bi ⊕ 1 ci = di ⋅ ∏ j = i + 1 n − 1 zj d_i=(a_i \oplus 1) \cdot b_i,\,\, z_i=a_i \oplus b_i \oplus 1\\ c_i = d_i \cdot \prod_{j=i+1}^{n-1} z_jdi=(ai1)bi,zi=aibi1ci=dij=i+1n1zj

    Final output COM ( A , B ) = ∑ i = 0 n − 1 ci COM(A,B) = \sum_{i=0}^{n-1}c_iAS ( A ,B)=i=0n1ci, use a binary tree to calculate Z n − 1 = ∏ j = 1 n − 1 zj Z_{n-1}=\prod_{j=1}^{n-1} z_jZn1=j=1n1zj, other Z n − i = ∏ j = in − 1 zj Z_{ni}=\prod_{j=i}^{n-1} z_jZni=j=in1zjis the product of some intermediate results of this binary tree (the total number of multiplications is O ( n log ⁡ n ) O(n \log n)O ( nlogn ) ButO ( n ) O(n)O ( n ) ). Its complexity isWCC = O ( n log ⁡ 2 n ) WCC=O(n \log^2 n)WCC=O ( nlog2n)

Insert image description here

Addition of two integers,

  1. Ripple-Carry Adder, can calculate WCC = O ( n 2 ) WCC=O(n^2)WCC=O ( n2 ), the hidden constant is larger than LinC
  2. Parallel-Prefix Adder, can calculate WCC = O ( n log ⁡ 2 n ) WCC=O(n \log^2 n)WCC=O ( nlog2n ) , the hidden constant is larger than LogC

Dot Multiplication

Similar to the binary operation of PPA, [ZQH+20] defines DotMC,
( P 1 , G 1 ) ∘ ( P 0 , G 0 ) = ( P 1 + G 1 ⋅ P 0 , G 1 ⋅ G 0 ) ( P_1,\,\,G_1) \circ (P_0,\,\,G_0) = (P_1+G_1 \cdot P_0,\,\, G_1 \cdot G_0)(P1,G1)(P0,G0)=(P1+G1P0,G1G0)

简记 A d d [ ( P , G ) ] = P + G Add[(P,G)]=P+G Add[(P,G)]=P+G is the merging operation of the final signal. Define the initial signalpi = ( ai ⊕ 1 ) ⋅ bi , 0 ≤ i ≤ n − 1 p_i=(a_i \oplus 1) \cdot b_i, 0 \le i \le n-1pi=(ai1)bi,0in1 ,onegi = ai ⊕ bi ⊕ 1 , 1 ≤ i ≤ n − 1 g_i=a_i \oplus b_i \oplus 1,1\le i \le n-1gi=aibi1,1in1g 0 = 0 g_0=0g0=0,那么有:
C O M ( A , B ) = ( a n − 1 ⊕ 1 ) ⋅ b n − 1 ⊕ ( a n − 1 ⊕ b n − 1 ⊕ 1 ) ⋅ c o m n − 2 = A d d [ ( p n − 1 , g n − 1 ) ∘ ( p n − 2 , g n − 2 ) ∘ ⋯ ∘ ( p 0 , g 0 ) ] \begin{aligned} COM(A,B) &= (a_{n-1}\oplus 1) \cdot b_{n-1} \oplus (a_{n-1}\oplus b_{n-1}\oplus 1) \cdot com_{n-2}\\ &= Add\left[(p_{n-1},g_{n-1}) \circ (p_{n-2},g_{n-2}) \circ \cdots \circ (p_0,g_0)\right] \end{aligned} AS ( A ,B)=(an11)bn1(an1bn11)comn2=Add[(pn1,gn1)(pn2,gn2)(p0,g0)]

Because the DotMC operation is combined, the binary tree method can be used to combine signals in parallel.

Insert image description here

It can be calculated that WCC = O ( n log ⁡ n ) WCC=O(n \log n)WCC=O ( nlogn ) , the number of multiplications isO ( n ) O(n)O ( n ) , multiplication depthO ( log ⁡ n ) O(\log n)O(logn ) . In comparison, the number of multiplications of LogC isO ( n log ⁡ n ) O(n \log n)O ( nlogn ) , the computational complexity isWCC = O ( n log ⁡ 2 n ) WCC=O(n \log^2 n)WCC=O ( nlog2n ) . Reason: Similar tothe Horner Rule, there are many multiplications that are redundant.

CLA itself uses DotMC, but some of its DotMCs can delay calculations and shift calculations to lower levels, making W l W_lWlsmaller.

The original CLA circuit is:

Insert image description here

Move some DotMC (green dots),

Insert image description here

The paper points out that when n > 16 n>16n>At 16 , the comparison of DotMC and OptCLA is more efficient, and withnnThe effect is better when n is increased.

Insert image description here

Guess you like

Origin blog.csdn.net/weixin_44885334/article/details/133277268