references:
- [CS05] Choi Y, Swartzlander E E. Parallel prefix adder design with matrix representation[C]//17th IEEE Symposium on Computer Arithmetic (ARITH’05). IEEE, 2005: 90-98.
- [SV11] Smart N P, Vercauteren F. Fully homomorphic SIMD operations[J]. Designs, codes and cryptography, 2014, 71: 57-81.
- [GHS12] Gentry C, Halevi S, Smart N P. Fully homomorphic encryption with polylog overhead[C]//Annual International Conference on the Theory and Applications of Cryptographic Techniques. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012: 465-482.
- [GHS12] Gentry C, Halevi S, Smart N P. Homomorphic evaluation of the AES circuit[C]//Annual Cryptology Conference. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012: 850-867.
- [HS13] Halevi S, Shoup V. Design and implementation of a homomorphic-encryption library[J]. IBM Research (Manuscript), 2013, 6(12-15): 8-36.
- [ZQH+20] Zhang N, Qin Q, Hou Z, et al. Efficient comparison and addition for FHE with weighted computational complexity model[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 40(9): 1896-1908.
Article directory
Carry-Lookahead Adder
Given two integers A = an − 1 ⋯ a 1 a 0 A=a_{n-1}\cdots a_1 a_0A=an−1⋯a1a0, B = b n − 1 ⋯ b 1 b 0 B=b_{n-1}\cdots b_1 b_0 B=bn−1⋯b1b0, in order to calculate their sum, there are a variety of adders to choose from. We use the following logic symbols: AND gate ⋅ \cdot⋅ , OR gate+++ , XOR gate⊕ \oplus⊕ , where⋅ \cdot⋅ has the highest priority.
Input a , b , cin ∈ { 0 , 1 } a,b,cin \in \{0,1\} of Full Adder (FA)a,b,cin∈{
0,1},输出 s o u t , c o u t ∈ { 0 , 1 } sout,cout \in \{0,1\} so u _ _cout∈{
0,1},逻辑表达式为:
s o u t = a ⊕ b ⊕ c i n c o u t = a ⋅ b + a ⋅ c i n + b ⋅ c i n \begin{aligned} sout &= a \oplus b \oplus cin\\ cout &= a \cdot b+a \cdot cin + b \cdot cin \end{aligned} so u tcout=a⊕b⊕cin=a⋅b+a⋅cin+b⋅cin
Define signal g = a ⋅ bg=a \cdot bg=a⋅b(generate), k = a + b k=a+b k=a+b(kill), p = a ⊕ b p=a \oplus b p=a⊕b(propagate),那么
s o u t = p ⊕ c i n c o u t = g + k ⋅ c i n \begin{aligned} sout &= p \oplus cin\\ cout &= g+k \cdot cin \end{aligned} so u tcout=p⊕cin=g+k⋅cin
Ripple-Carry Adder
Traveling wave carry adder (RCA): convert nnn full adders are connected in series, inputA, B, c 0 A,B,c_0A,B,c0,迭代计算,
g i = a i ⋅ b i k i = a i + b i p i = a i ⊕ b i s i = p i ⊕ c i c i + 1 = g i + k i ⋅ c i \begin{aligned} g_i &= a_i \cdot b_i\\ k_i &= a_i + b_i\\ p_i &= a_i \oplus b_i\\\\ s_i &= p_i \oplus c_i\\ c_{i+1} &= g_i+k_i \cdot c_i \end{aligned} gikipisici+1=ai⋅bi=ai+bi=ai⊕bi=pi⊕ci=gi+ki⋅ci
The length of the critical path is the carry chain c 1 , ⋯ , cn − 1 , cn c_1,\cdots,c_{n-1},c_nc1,⋯,cn−1,cnThe number of iteration rounds O ( n ) O(n)O ( n )
Carry-Lookahead Adder
Carry-ahead adder (CLA): Continuously expand the carry chain of RCA to obtain
ci + 1 = gi + ki ⋅ ( gi − 1 + ki − 1 ⋅ ( gi − 2 + ⋯ ) ) = gi + ∑ j = 0 i − 1 ( ∏ l = j + 1 ikl ) gj + c 0 ∏ l = 0 ikl \begin{aligned} c_{i+1} &= g_i+k_i \cdot (g_{i-1}+k_{i -1} \cdot (g_{i-2} + \cdots))\\ &= g_i + \sum_{j=0}^{i-1} \left(\prod_{l=j+1}^i k_l\right)g_j + c_0\prod_{l=0}^i k_l \end{aligned}ci+1=gi+ki⋅(gi−1+ki−1⋅(gi−2+⋯))=gi+j=0∑i−1
l = j + 1∏ikl
gj+c0l=0∏ikl
So the carry signal ci + 1 c_{i+1}ci+1It can be directly based on (gi, ki) (g_i,k_i)(gi,ki) signal is calculated without waiting forci c_{i}ciSignal. The critical path length is only O ( log n ) O(\log n)O(logn ) , but the fan-in and fan-out coefficients are extremely high. In order to balance computational delay and circuit complexity, you can use [Blocked/Graded Carry Lookahead] ([Hardware Arithmetic Design] Hardware Adder Principle | Chapter 2: Carry Lookahead Adder, Blocked and Hierarchical Carry Lookahead Addition Device - Zhihu (zhihu.com)).
Parallel-Prefix Adder
Parallel Prefix Adder (PPA): In order to better understand CLA, it can be described in algebraic language (gi, ki) (g_i,k_i)(gi,ki) signal merging process. We define the ordered pair( g , k ) (g,k)(g,Binary operation ∘ \circon k )∘ 如下:
( g j , k j ) ∘ ( g i , k i ) : = ( g j + k j g i , k i k j ) (g_j,\,\, k_j) \circ (g_i,\,\, k_i) := (g_j+k_jg_i,\,\, k_ik_j) (gj,kj)∘(gi,ki):=(gj+kjgi,kikj)
Easy to verify, it is: associative, idempotent, and non-commutative.
Then the logical expression of CLA can be rewritten as:
( ci + 1 , k 0 k 1 ⋯ ki ) = ( gi + ki ⋅ ( gi − 1 + ki − 1 ⋅ ( gi − 2 + ⋯ ) ) , ki ⋯ k 1 k 0 ) = ( gi , ki ) ∘ ( gi − 1 , ki − 1 ) ∘ ⋯ ∘ ( g 0 , k 0 ) ∘ ( c 0 , 1 ) \begin{aligned} (c_{i+1}, \,\, k_0k_1\cdots k_i) &= (g_i+k_i \cdot (g_{i-1}+k_{i-1} \cdot (g_{i-2} + \cdots)),\,\, k_i\cdots k_1 k_0)\\ &= (g_i,k_i) \circ (g_{i-1},k_{i-1}) \circ \cdots \circ (g_0,k_0) \circ (c_0,1) \end{aligned}(ci+1,k0k1⋯ki)=(gi+ki⋅(gi−1+ki−1⋅(gi−2+⋯)),ki⋯k1k0)=(gi,ki)∘(gi−1,ki−1)∘⋯∘(g0,k0)∘(c0,1)
Each signal (gi, ki) (g_i,k_i)(gi,ki) and( c 0 , 1 ) (c_0,1)(c0,1 ) are all immediate. According to the associative law of binary operations, a family (g, k) (g,k)can be designed(g,k ) Parallel merging network of signals. Knowles proposed a class of depth-optimal parallel prefix adders. The circuit structure is:
Knowles Adder is limited to two well-known PPAs: Kogge-Stone Adder, Ladner-Fischer Adder . The LF structure requires infinite fan-out (suitable for FHE operations), while the KS structure requires infinite parallelism (number of multiplications). Brent-Kung Adder balances the fan-out coefficient and the number of computing units by increasing the number of level layers (multiplication depth), and there are some Hybrid Schemes that mix KS, LF, and BK structures.
We use Parallel-Prefix Graghs to simplify the representation of PPA, and each black dot represents a binary operation ∘ \circ∘ operation (two multiplications, one addition), the operation direction is from bottom to top. Knowles Adder with different fanouts:
The most suitable PPA for FHE operations is the Ladner-Fischer Adder: our soft implementation does not care about the fan-out coefficient, but focuses on the multiplication depth and number of multiplications.
HE library
Double-CRT
[HS13] designed the RLWE-BGV code implementation library HElib based on C++ NTL , using the SIMD technology of [SV11] and the Permuting/Routing technology of [GHS12].
For further acceleration, [HS13] converts the ring Z q [ x ] / ϕ m ( x ) \mathbb Z_q[x]/\phi_m(x)Zq[ x ] / ϕmThe polynomials in ( x ) are expressed in double-CRT form. Chain of moduli is used in BGV,ql = ∏ i = 0 lpi q_l = \prod_{i=0}^l p_iql=∏i=0lpi, in which pi p_ipiis a prime number of approximate size and satisfies m ∣ pi − 1 m|p_i-1m∣pi−1 makesZ pi \mathbb Z_{p_i}Zpimm exists inmth primitive unit rootζ i \zeta_igi,于是 ϕ m ( x ) = ∏ j ∈ Z m ∗ ( x − ζ i j ) ( m o d p i ) \phi_m(x)=\prod_{j \in \mathbb Z_m^*}(x-\zeta_i^j) \pmod{p_i} ϕm(x)=∏j∈Zm∗(x−gij)(modpi)
Let’s first consider the modulus ql q_lqlDo CRT decomposition, abbreviation R = Z [ x ] / ϕ m ( x ) R = \mathbb Z[x]/\phi_m(x)R=Z [ x ] / ϕm( x ) ,R p = R / ( p R ) R_{p}=R/(pR)Rp=R/(pR),
Z q l [ x ] / ϕ m ( x ) ≅ R / ( q l R ) ≅ R / ( p 0 R ) × R / ( p 1 R ) × ⋯ × R / ( p l R ) Z_{q_l}[x]/\phi_m(x) \cong R/(q_lR) \cong R/(p_0R) \times R/(p_1R) \times \cdots \times R/(p_lR) Zql[ x ] / ϕm(x)≅R/(qlR)≅R/(p0R)×R/(p1R)×⋯×R/(plR)
Then for each small ring R pi ≅ Z pi [ x ] / ϕ m ( x ) R_{p_i} \cong \mathbb Z_{p_i}[x]/\phi_m(x)Rpi≅Zpi[ x ] / ϕm(x) 继续做 CRT 分解,
Z p i [ x ] / ϕ m ( x ) ≅ Z p i [ x ] / ( x − ζ i j 1 ) × ⋯ × Z p i [ x ] / ( x − ζ i j ϕ ( m ) ) , j t ∈ Z m ∗ \mathbb Z_{p_i}[x]/\phi_m(x) \cong \mathbb Z_{p_i}[x]/(x-\zeta_i^{j_1}) \times \cdots \times Z_{p_i}[x]/(x-\zeta_i^{j_{\phi(m)}}), j_t \in \mathbb Z_m^* Zpi[ x ] / ϕm(x)≅Zpi[x]/(x−gij1)×⋯×Zpi[x]/(x−gijϕ ( m )),jt∈Zm∗
Assume m = 2 km=2^km=2k makesϕ m ( x ) = xm / 2 + 1 \phi_m(x)=x^{m/2}+1ϕm(x)=xm/2+1 , thenZ m ∗ = { 1 , 3 , 5 , ⋯ , m − 1 } \mathbb Z_m^*=\{1,3,5,\cdots,m-1\}Zm∗={ 1,3,5,⋯,m−1 } , you can useFFT/NTTto calculate quickly. For generalmmm (mixed radix-X), inZ m ∗ \mathbb Z_m^*Zm∗The distribution of numbers in is irregular, and the conversion between coefficient representation and Double-CRT is more complicated.
To sum up, given the modulus ql q_lqlThe ring element a ∈ Z ql [ x ] / ϕ m ( x ) a \in Z_{q_l}[x]/\phi_m(x) corresponding to levela∈Zql[ x ] / ϕm( x ) , can be expressed in matrix form:iii 行是 a ( x ) ( m o d p i ) a(x) \pmod{p_i} a(x)(modpi) of the NTT domain,ii.Line i , jjj 列是元素 a ( ζ i j ) ( m o d p i ) a(\zeta_i^j) \pmod{p_i} a ( gij)(modpi) , shape( l + 1 ) × ϕ ( m ) (l+1) \times \phi(m)(l+1)×ϕ(m)
D o u b l e C R T l ( a ) = { a ( x ) ( m o d p i ) ( m o d x − ζ i j ) } 0 ≤ i ≤ l , j ∈ Z m ∗ = { a ( ζ i j ) ( m o d p i ) } 0 ≤ i ≤ l , j ∈ Z m ∗ \begin{aligned} DoubleCRT^l(a) &= \{a(x) \pmod{p_i}\pmod{x-\zeta_i^j}\}_{0\le i\le l,\,\, j\in \mathbb Z_m^*}\\ &= \{a(\zeta_i^j) \pmod{p_i}\}_{0\le i\le l,\,\, j\in \mathbb Z_m^*}\\ \end{aligned} DoubleCRTl(a)={
a(x)(modpi)(modx−gij)}0≤i≤l,j∈Zm∗={
a ( gij)(modpi)}0≤i≤l,j∈Zm∗
It is easy to verify that the addition/multiplication of ring elements is the component-wise addition/multiplication of the matrix ,
D ouble CRT l ( a + b ) = D ouble CRT l ( a ) + D ouble CRT l ( a ) D ouble CRT l ( a ⋅ b ) = D ouble CRT l ( a ) ⋅ D ouble CRT l ( a ) \begin{aligned} DoubleCRT^l(a+b) &= DoubleCRT^l(a) + DoubleCRT^l(a)\\ DoubleCRT^l(a \cdot b) &= DoubleCRT^l(a) \cdot DoubleCRT^l(a)\\ \end{aligned}DoubleCRTl(a+b)DoubleCRTl(a⋅b)=DoubleCRTl(a)+DoubleCRTl(a)=DoubleCRTl(a)⋅DoubleCRTl(a)
Galois function G al ( Q ( ζ ) / Q ) ≅ Z m ∗ \mathcal{Gal}(\mathbb Q(\zeta)/\mathbb Q) \cong \mathbb Z_m^*G a l ( Q ( ζ ) / Q )≅Zm∗The automorphism mapping in is
κ k ( a ( x ) ) : = a ( xk ) ( mod ϕ m ( x ) ) , ∀ k ∈ Z m ∗ \kappa_k(a(x)):=a(x^ k) \pmod{\phi_m(x)},\,\, \forall k \in \mathbb Z_m^*Kk(a(x)):=a(xk)(modϕm(x)),∀k∈Zm∗
Because m ∣ pi − 1 m|p_i-1m∣pi−1 ,the functionG a L ( Z pi ( ζ ) / Z pi ) \mathcal{GaL}(\mathbb Z_{p_i}(\zeta)/\mathbb Z_{p_i})GaL(Zpi( g ) / Zpi) are permutations between plaintext slots (Frob mapκ 1 ( ζ ) = ζ pi = ζ \kappa_1(\zeta)=\zeta^{p_i}=\zetaK1( g )=gpi=ζ是恒健在线).Other mappingκk : ζ ↦ ζ k \kappa_k:\zeta \mapsto \zeta^{k}Kk:g↦gk is equivalent to: converting the matrixD double CRT l ( a ) DoubleCRT^l(a)DoubleCRTThe jjthof l (a)Column j , move tojk (modm) jk \pmod mjk(modm ) column. So Routing is easy to implementon Double-CRT representation.
New Key Switching in CRT
Key switching is essentially about W [ s 1 → s 2 ] = E ncs 2 ( s 1 ; e ) W[s_1 \to s_2] = Enc_{s_2}(s_1; e)W[s1→s2]=Encs2(s1;Homomorphic multiplication operationof e ) :
c 2 = c 1 W = E ncs 2 ( c 1 s 1 ; c 1 e ) c 2 s 2 = c 1 W s 2 = c 1 s 1 c_2=c_1W=Enc_{ s_2}(c_1s_1; c_1e)\\ c_2s_2 = c_1Ws_2 = c_1s_1c2=c1W=Encs2(c1s1;c1e)c2s2=c1Ws2=c1s1
But the ciphertext c 1 c_1c1Norm as a scalar relative to modulus ql q_lqlIf it is too large, the noise term after homomorphic operation will destroy the plaintext.
In BGV, in order to control the noise term of key switching, a binary decomposition scheme is used : given s ′ s'sCiphertextc ′ c' under ′c′,计算 c ′ = ∑ i 2 i c i ′ c'=\sum_i 2^i c_i' c′=∑i2ici′,令
c ∗ = c 0 ′ ∣ c 1 ′ ∣ c 2 ′ ∣ ⋯ , s ∗ = s ′ ∣ 2 s ′ ∣ 4 s ′ ∣ ⋯ c^*=c_0'|c_1'|c_2'|\cdots,\,\, s^*=s'|2s'|4s'|\cdots c∗=c0′∣c1′∣c2′∣⋯,s∗=s′∣2s′∣4s′∣⋯
Then calculate the matrix W = W [ s ∗ → s ] W=W[s^* \to s]W=W[s∗→s ] , which satisfiess ∗ = W ⋅ ss^*=W \cdot ss∗=W⋅s , then calculate the ciphertextc = c ∗ ⋅ W c=c^* \cdot Wc=c∗⋅W,
⟨ c ′ , s ′ ⟩ = ⟨ c ∗ , s ∗ ⟩ = ⟨ c , s ⟩ \langle c', s' \rangle = \langle c^*, s^* \rangle = \langle c, s \rangle ⟨c′,s′⟩=⟨c∗,s∗⟩=⟨c,s⟩
In order to calculate c ′ c'c′ binary decomposition, first need to convert Double-CRT back to coefficient representation. Then,log ql \log q_llogqlCi ′ c_i'ci′Convert to Double-CRT respectively to perform the inner product operation of decryption. This costs O ( l log ql ) O(l \log q_l)O(llogql) length isn = ϕ ( m ) n=\phi(m)n=FFT/NTT operation of ϕ ( m ) .
[GHS12] used a different noise term control technique: temporarily boosting the modulus . Using the LSB encoding scheme, assume ⟨ c ′ , s ′ = 2 e ′ + a ( modq ) \langle c',s' \rangle=2e'+a \pmod q⟨c′,s′⟩=2 e′+a(modq ) , for any odd numberppp ,dimensions⟨
c ′ , ps ′ ⟩ = 2 pe ′ + pa = 2 e ′ ′ + a ( modpq ) \angle c',ps' \angle = 2pe'+pa = 2e''+a \pmod{ pq}⟨c′,ps′⟩=2pe′+pa=2 e′′+a(modpq)
Among them, e ′ ′ = pe ′ + a ( p − 1 ) / 2 e'' = pe'+a(p-1)/2e′′=pe′+a(p−1 ) /2 is approximately the original noisee ′ e'e' pp_p times. Just perform a modulo switch and return to moduloqqq , then the noise is basically stille ′ e'e′ magnitude. Therefore, we choose a sufficiently large odd numberp ≈ qmp \approx q\sqrt mp≈qm, use the transformation matrix W = W [ ps ′ → s ] ( modpq ) W=W[ps' \to s] \pmod{pq}W=W[ps′→s](modThe ciphertext of pq ) serves as a key switching aid.
Although there is no need for c ′ c'c′ has been binary decomposed, but it still needs to be calculatedc ′ ( modp ) c' \pmod pc′(modp ) to formZ pq [ x ] / ϕ m ( x ) \mathbb Z_{pq}[x]/\phi_m(x)Zpq[ x ] / ϕmDouble-CRT representation of ( x ) . However, because there is no need for dimension expansionlog ql \log q_llogqltimes, so only O ( l ) O(l) is requiredO ( l ) FFT/NTT operations.
Note that because the modulus pq ≥ q 2 pq \ge q^2pq≥q2 , making the ratio of modulus/noise very large, so the dimension of the grid needs to ben = ϕ ( m ) n=\phi(m)n=ϕ ( m ) is expanded nearly twice to maintain safe strength. But the complexity of FFT/NTT isO ( n log n ) O(n \log n)O ( nlogn ) , so the calculation efficiency is still improved by aboutlog ql / 2 \log q_l/2logql/2 factor, and the storage overhead of public keys is also lower.
You can also use a mixture of base decomposition and modulus improvement: convert the ciphertext c ′ c'c′ Decomposition为∑ i D ici ′ \sum_i D^ic_i'∑iDici′, where DDD is a large radix such thatci ′ c_i'ci′The number of is very small; then, for each ci ′ c_i'ci′Adopt a plan to temporarily increase the modulus.
Modulus Switching in CRT
We hope to keep the ciphertext and secret key in Double-CRT representation except for some necessary coefficient representations to facilitate quick calculation. From the modulus ql = ∏ j = 0 lpj q_l=\prod_{j=0}^l p_jql=∏j=0lpjThe cipher text under ccc switches to modulusql − 1 = ∏ j = 0 l − 1 pj q_{l-1}=\prod_{j=0}^{l-1} p_jql−1=∏j=0l−1pjThe ciphertext c ′ c' underc′ , maintain the decryption correctnessc ′ s ≡ cs ( mod 2 ) c' s\equiv cs \pmod 2c′s≡cs(mod2 ) , and also requires "rounding error"τ = c ′ − c / pt \tau=c'-c/p_tt=c′−c/ptThe norm of is very small.
Because pl p_lplis an odd prime number, just calculate c † = pl ⋅ c ′ c^\dagger = p_l \cdot c'c†=pl⋅c′ , such that:pl ∣ c † p_l|c^\daggerpl∣c†, c † ≡ c ( m o d 2 ) c^\dagger \equiv c \pmod 2 c†≡c(mod2 ) (sufficient but not necessary), andc † − cc^\dagger-cc†−The norm of c is very small. Then outputc ′ = c † / pl c'=c^\dagger/p_lc′=c†/plThat’s it. The mode switching algorithm under Double-CRT is:
- From the last line of Double CRT, calculate c ˉ = c ( modpl ) \bar c=c \pmod{p_l}cˉ=c(modpl) , the value range is[ − pl / 2 , pl / 2 ) [-p_l/2,p_l/2)[−pl/2,pl/2 ) , but note thatc ˉ \bar ccˉBelongs toRRR (available inR ql R_{q_l}RqlMedium simulation)
- Will c ˉ \bar ccThose coefficients of ˉ are odd numbers, by adding and subtractingpl p_lplbecomes an even number, the value range is [ − pl , pl ) [-p_l,p_l)[−pl,pl) , get the small polynomialδ \deltaδ , meetδ ≡ c ( modpl ) \delta \equiv c \pmod{p_l}d≡c(modpl) andδ ≡ 0 ( mod 2 ) \delta \equiv 0 \pmod{2}d≡0(mod2)
- Default \deltaDouble CRT representation of δ , and then calculatec † = c − δ ( modql ) c^\dagger = c-\delta \pmod{q_l}c†=c−d(modql) , which satisfiespl ∣ c † p_l|c^\daggerpl∣c†
- For each c † ( modpj ) , j < lc^\dagger \pmod{p_j}, j<lc†(modpj),j<l , by calculatingpl − 1 ⋅ c † ( modpj ) p_l^{-1} \cdot c^\dagger \pmod{p_j}pl−1⋅c†(modpj) , obtainc ′ = c † / pl c'=c^\dagger/p_lc′=c†/plDouble-CRT representation
In the above process, step 1 cost 1 11 INTT, step 3 costlll NTT, totalingO ( l ) O(l)O ( l ) lengthn = ϕ ( m ) n=\phi(m)n=FFT/NTT operation of ϕ ( m ) .
WCC
In high-level applications of Level FHE, the number of multiplications and the depth of multiplication can be used to measure the efficiency of the algorithm. However, in many works, people tend to only focus on a certain aspect rather than comprehensively considering it. Therefore, the advanced algorithms designed are not necessarily suitable for calculations on level FHE.
[ZQH+20] proposed the Weighted Computational Complexity (WCC) computing model. The complexity is defined as
WCC = ∑ l = 0 LW l ⋅ N l WCC = \sum_{l=0}^L W_l \cdot N_lWCC=l=0∑LWl⋅Nl
Among them N l N_lNlIt’s the llthThe number of multiplications in layer l , W l W_lWlIt’s the llthMultiplication cost of layer l (if W l = 1 , ∀ l W_l=1,\forall lWl=1,∀ l only focuses on the number of multiplications, ifW l = 0 , l < L W_l=0,l<LWl=0,l<L only focuses on the multiplication depth). According to [GHS12], the cost of key switching and mode switching under Double-CRT isO ( l ) O (l)O ( l ) lengthn = ϕ ( m ) n=\phi(m)n=NTT/FFT operation of ϕ ( m ) . Therefore, ignoring the complicated details, roughly select the weights asW l = l + 1 , l ≥ 0 W_l = l+1, l\ge 0Wl=l+1,l≥0 , which is appropriate and sufficient to guide us in designing advanced algorithms more suitable for level FHE.
WCC Analysis
[ZQH+20] analyzed the WCC complexity of integer comparison and integer addition.
Three comparison algorithms,
-
Linear Comparison (LinC),线性比较
c o m i = ( a i ⊕ 1 ) ⋅ b i ⊕ ( a i ⊕ b i ⊕ 1 ) ⋅ c o m i − 1 com_i=(a_i\oplus 1) \cdot b_i \oplus (a_i\oplus b_i\oplus 1) \cdot com_{i-1} comi=(ai⊕1)⋅bi⊕(ai⊕bi⊕1)⋅comi−1Final output COM ( A , B ) = comn − 1 COM(A,B)=com_{n-1}AS ( A ,B)=comn−1, it can be calculated that WCC = O ( n 2 ) WCC=O(n^2)WCC=O ( n2)
-
Iteration Comparison (IterC),分治算法
t i , 1 = ( a i ⊕ 1 ) ⋅ b i , z i , 1 = a i ⊕ b i ⊕ 1 t i j = t i + h , j − h + z i + h , j − h ⋅ t i , h , j > 1 , h = ⌈ j / 2 ⌉ t_{i,1} = (a_i \oplus 1)\cdot b_i,\,\, z_{i,1}=a_i\oplus b_i \oplus 1\\ t_{ij}=t_{i+h,j-h}+z_{i+h,j-h} \cdot t_{i,h},\,\, j>1,\,\, h=\lceil j/2 \rceil ti,1=(ai⊕1)⋅bi,zi,1=ai⊕bi⊕1tij=ti+h,j−h+zi+h,j−h⋅ti,h,j>1,h=⌈j/2⌉Final output COM (A, B) = t 0, n COM(A,B)=t_{0,n}AS ( A ,B)=t0,n, its WCC general formula is more complicated and difficult to write
-
Logarithm Comparison (LogC), complete expansion
di = ( ai ⊕ 1 ) ⋅ bi , zi = ai ⊕ bi ⊕ 1 ci = di ⋅ ∏ j = i + 1 n − 1 zj d_i=(a_i \oplus 1) \cdot b_i,\,\, z_i=a_i \oplus b_i \oplus 1\\ c_i = d_i \cdot \prod_{j=i+1}^{n-1} z_jdi=(ai⊕1)⋅bi,zi=ai⊕bi⊕1ci=di⋅j=i+1∏n−1zjFinal output COM ( A , B ) = ∑ i = 0 n − 1 ci COM(A,B) = \sum_{i=0}^{n-1}c_iAS ( A ,B)=∑i=0n−1ci, use a binary tree to calculate Z n − 1 = ∏ j = 1 n − 1 zj Z_{n-1}=\prod_{j=1}^{n-1} z_jZn−1=∏j=1n−1zj, other Z n − i = ∏ j = in − 1 zj Z_{ni}=\prod_{j=i}^{n-1} z_jZn−i=∏j=in−1zjis the product of some intermediate results of this binary tree (the total number of multiplications is O ( n log n ) O(n \log n)O ( nlogn ) ButO ( n ) O(n)O ( n ) ). Its complexity isWCC = O ( n log 2 n ) WCC=O(n \log^2 n)WCC=O ( nlog2n)
Addition of two integers,
- Ripple-Carry Adder, can calculate WCC = O ( n 2 ) WCC=O(n^2)WCC=O ( n2 ), the hidden constant is larger than LinC
- Parallel-Prefix Adder, can calculate WCC = O ( n log 2 n ) WCC=O(n \log^2 n)WCC=O ( nlog2n ) , the hidden constant is larger than LogC
Dot Multiplication
Similar to the binary operation of PPA, [ZQH+20] defines DotMC,
( P 1 , G 1 ) ∘ ( P 0 , G 0 ) = ( P 1 + G 1 ⋅ P 0 , G 1 ⋅ G 0 ) ( P_1,\,\,G_1) \circ (P_0,\,\,G_0) = (P_1+G_1 \cdot P_0,\,\, G_1 \cdot G_0)(P1,G1)∘(P0,G0)=(P1+G1⋅P0,G1⋅G0)
简记 A d d [ ( P , G ) ] = P + G Add[(P,G)]=P+G Add[(P,G)]=P+G is the merging operation of the final signal. Define the initial signalpi = ( ai ⊕ 1 ) ⋅ bi , 0 ≤ i ≤ n − 1 p_i=(a_i \oplus 1) \cdot b_i, 0 \le i \le n-1pi=(ai⊕1)⋅bi,0≤i≤n−1 ,onegi = ai ⊕ bi ⊕ 1 , 1 ≤ i ≤ n − 1 g_i=a_i \oplus b_i \oplus 1,1\le i \le n-1gi=ai⊕bi⊕1,1≤i≤n−1,g 0 = 0 g_0=0g0=0,那么有:
C O M ( A , B ) = ( a n − 1 ⊕ 1 ) ⋅ b n − 1 ⊕ ( a n − 1 ⊕ b n − 1 ⊕ 1 ) ⋅ c o m n − 2 = A d d [ ( p n − 1 , g n − 1 ) ∘ ( p n − 2 , g n − 2 ) ∘ ⋯ ∘ ( p 0 , g 0 ) ] \begin{aligned} COM(A,B) &= (a_{n-1}\oplus 1) \cdot b_{n-1} \oplus (a_{n-1}\oplus b_{n-1}\oplus 1) \cdot com_{n-2}\\ &= Add\left[(p_{n-1},g_{n-1}) \circ (p_{n-2},g_{n-2}) \circ \cdots \circ (p_0,g_0)\right] \end{aligned} AS ( A ,B)=(an−1⊕1)⋅bn−1⊕(an−1⊕bn−1⊕1)⋅comn−2=Add[(pn−1,gn−1)∘(pn−2,gn−2)∘⋯∘(p0,g0)]
Because the DotMC operation is combined, the binary tree method can be used to combine signals in parallel.
It can be calculated that WCC = O ( n log n ) WCC=O(n \log n)WCC=O ( nlogn ) , the number of multiplications isO ( n ) O(n)O ( n ) , multiplication depthO ( log n ) O(\log n)O(logn ) . In comparison, the number of multiplications of LogC isO ( n log n ) O(n \log n)O ( nlogn ) , the computational complexity isWCC = O ( n log 2 n ) WCC=O(n \log^2 n)WCC=O ( nlog2n ) . Reason: Similar tothe Horner Rule, there are many multiplications that are redundant.
CLA itself uses DotMC, but some of its DotMCs can delay calculations and shift calculations to lower levels, making W l W_lWlsmaller.
The original CLA circuit is:
Move some DotMC (green dots),
The paper points out that when n > 16 n>16n>At 16 , the comparison of DotMC and OptCLA is more efficient, and withnnThe effect is better when n is increased.