时间:2024-07-28
Sikandar Zulqarnain Khan,Sajjad Shaukat Jamal,Asher Sajid and Muhammad Rashid
1Department of Aeronautical Engineering,Estonian Aviation Academy,Tartu,61707,Estonia
2Department of Mathematics,College of Science,King Khalid University,Abha,61413,Saudi Arabia
3Department of Electrical Engineering,Bahria University,Islamabad,44000,Pakistan
4Department of Computer Engineering,Umm Al-Qura University,Makkah,21955,Saudi Arabia
Abstract:This paper presents an efficient crypto processor architecture for key agreement using ECDH (Elliptic-curve Diffie Hellman) protocol overThe composition of our key-agreement architecture is expressed in consisting of the following:(i)Elliptic-curve Point Multiplication architecture for public key generation(DESIGN-I)and(ii)integration of DESIGN-I with two additional routing multiplexers and a controller for shared key generation(DESIGN-II).The arithmetic operators used in DESIGN-I and DESIGNII contain an adder,squarer,a multiplier and inversion.A simple shift and add multiplication method is employed to retain lower hardware resources.Moreover,an essential inversion operation is operated using the Itoh-Tsujii algorithm with similar hardware resources of used squarer and multiplier units.The proposed architecture is implemented in a Verilog HDL.The implementation results are given on a Xilinx Virtex-7 FPGA(field-programmable gate array) device.For DESIGN-I and DESIGN-II over GF(2163),(i) the utilized Slices are 3983 and 4037,(ii)the time to compute one public key and a shared secret is 553.7 μs and 1170.7 μs and(iii)the consumed power is 29 μW and 57 μW.Consequently,the achieved area optimized and power reduced results show that the proposed ECDH architecture is a suitable alternative(to generate a shared secret) for the applications that require low hardware resources and power consumption.
Keywords:Elliptic curve cryptography;point multiplication;key-agreement;diffie hellman;area optimized;architecture;FPGA
In recent years,demand for the secure transmission of reliable information over any network is increasing dramatically.Various applications such as healthcare[1],automotive[2,3],aviation[4,5]smart cards[6,7],Internet of Things (IoT)[8-12],and several others require the secure exchange of sensitive information on insecure public channels.The cryptographic algorithms(both symmetric and asymmetric) are available in the literature to provide several security services like a public key exchange,signature generation/verification,authentication and encryption/decryption.Elliptic-curve cryptography (ECC) is a form of asymmetric cryptography,which provides an equivalent security level with a smaller key size in respect to other public-key cryptosystems such as RSA (Rivest-Shamir-Adleman)[13,14].Moreover,many organizations such as IEEE (Institute of Electrical and Electronics Engineering),NIST(National Institute of Standards and Technology),ANSI(American National Standards Institute) and SEC (Standards for Efficient Cryptography) accepted ECC as standardization.
The security strength of ECC relies on solving the discrete logarithms problem[15].It comprises four layers of a model[13].The highest layer is the protocol that defines a set of rules to perform encryption/decryption,signature generation/verification and public key exchange.The Elliptic-curve point multiplication(ECPM)is the most critical operation.Its computation depends on the execution of point addition(PA)and double(PD)operations.Arithmetic operations(such as addition,multiplication,square and inversion)are needed to perform point addition and double computations.
There are different settings in the literature to implement ECC protocols.These settings indicate choices for (i) two different basis (polynomial and Gaussian normal) for the representation of the initial &final point and (ii) availability for two different coordinate systems (affine and projective).The polynomial basis is considered in this paper as it is more beneficial to acquire faster modular multiplications while a Gaussian normal basis is valuable where frequent squares are needed to compute[13,16].Moreover,a projective coordinate system is selected because it requires minimum inversion operations when compared to an affine coordinate system[15].
Several hardware architectures are published in the state-of-the-art to accelerate the performance of ECC operations[16-24].Mostly the hardware implementations are considered for the optimizations of point multiplication(PM)operation in terms of various parameters such as frequency,area,power and throughput[16-21].
In[16],low-cost and high-speed PM architectures are described for Binary Edward Curves(BEC).Low cost means lower hardware resource utilization.The use of a pipelined digit serial multiplier results in low area utilization.The high-speed is attained with the employment of three pipelined multipliers.OverGF(2163)andGF(2233),the synthesis results are given on Virtex-5 FPGA device.In[17],an efficient PM architecture using Lopez Dahab algorithm is presented overGF(2m)withm=163.Towards clock cycles reduction,a hybrid Karatsuba multiplier is utilized.The reduced hardware resources are achieved with the utilization of multiplier and squarer units for the computation of modular inversion using the Itoh-Tsujii algorithm.The implementation results are reported on Xilinx Virtex-7 device,where the architecture takes 25.3 μs for one PM computation and utilizes only the 3657 slices.Another area optimized architecture overGF(2163)for PM computation is described in[18]where bit/digit serial multipliers are utilized.On Virtex-5 FPGA,their architecture requires 0.11msfor one PM operation and utilizes 473 slices.
In[19],a high-speed and low-latency ECC processor implementation overGF(2m)on Virtex-4,Virtex-5 and Virtex-7 FPGA devices is presented.They reported implementation results using single and three modular multipliers.Their singular multiplier-based ECC processor provides the highest speed on Virtex4,Virtex5 and Virtex7 FPGA devices.Finally,on same FPGA devices,the use of three multipliers results in a decrease in the number of clock cycles.A dual binary field PM implementation of ECC overGF(2283)andGF(2571)on Virtex-5 and Virtex-7 FPGA devices is described in[20].A Karatsuba multiplication method is employed to reduce the utilization of required clock cycles.
In[21],a high-performance hardware architecture of ECPM overGF(2m)field is presented.Their design is consists of a Montgomery ladder method and employs a polynomial basis for finite field arithmetic operations.A single Karatsuba multiplier is used to improve the performance of the PM architecture.Other operators(adder and squarer)with the consideration of a small number of hardware resources are utilized in parallel with the multiplier.These architectural employments ultimately reduce the clock cycles and lead to improving performance.ForGF(2163)on Virtex-4 FPGA,their design utilizes 10417 slices and achieves an operational clock frequency of 121MHz.
There are very few hardware designs where key-agreement using Elliptic-curve Diffie Hellman is considered for the optimizations on FPGA[22-24].A hardware/software implementation of an ECDH protocol is described in[22].The objective was to reduce the hardware resources and processing power for area-constrained applications.The implementation of their architecture was realized on a Virtex-5 FPGA device with the integration of a MicroBlaze softcore processor.A high-level synthesis(HLS)method is used in[23]to offload the ECDH operations on FPGA for the area and power reductions.Recently in[24],for wireless sensor nodes,a low-cost ECC hardware accelerator architecture is presented on FPGA.Their accelerator architecture was validated using a hardware/software codesign of the ECDH scheme(integrated into the MicroZed FPGA board).The ECDH protocol is operated in 4.1msfor one shared key generation.
The related implementations,reported in[16-24],shows that the several modern applications such as healthcare[1],automobile[2,3],smart cards[6,7]and IoT[8-10]demand the key agreement architectures.Although,some limited designs are available[22-24],we believe further improvements in area and power can be achieved with use of different settings and strategies.Consequently,the aim of this work is to provide the FPGA implementation of ECDH protocol with different settings and optimization strategies to achieve low area and power values.
Our contributions and employed strategies to reduce area and power values are as follows:
• Elliptic-curve PM design(DESIGN-I):With the attention of flexible input/output interfaces,we have presented a low-area crypto processor architecture for Elliptic-curve PM computation overGF(2163).The flexible input/output interfaces determine that for generating coordinates of a public key,users can takexandycoordinates of an initial point on ECC either by using the constant(inside the crypto core)or user-dependent values(from the outside to the crypto core).
• Key-agreement architecture (DESIGN-II):Generation of a shared key (or) key agreement between two different users/parties using the integration of DESIGN-I with two additional routing multiplexers and an efficient controller.
• Dedicated controllers:For various control activities,we have used two dedicated finite state machine(FSM)based controllers.
• Employed strategies to achieve low area and power:To achieve the low area and power values,a schoolbook multiplication method is used in the proposed crypto processor architectures(DESIGN-I and DESIGN-II).Moreover,similar hardware resources of squarer and multiplier units are used to compute the essential inversion operation.To further reduce the hardware resources,we set a design goal from balanced to area reduction during the logic synthesis in the Vivado IDE tool.
The RTL (register transfer level) implementations of proposed crypto processor designs(DESIGN-I and DESIGN-II) are implemented in a Verilog (HDL).The results are reported on Xilinx Virtex-7(xc7vx690t-3ffg1930)FPGA device.For DESIGN-I,the utilized Slices,LUTs(lookup-tables)and FFs(flip flops)overGF(2163)are 3983,10693 and 2389.Similarly,the consumed Slices,LUTs and FFs for DESIGN-II are 4037,11713 and 2427.The required clock cycles for DESIGN-I and DESIGN-II are 163902(to generate one public key)and 327804(to generate one shared secret),respectively.The maximum clock frequency for DESIGN-I and DESIGN-II is 296 and 280MHz,respectively.The time to operate one public key using DESIGN-I is 553.7μs.Similarly,the time to generate one shared secret using the ECDH scheme is 1170.7μs.The DESIGN-I takes 29μWto generate one public key (or to operate one PM computation) while DESIGN-II consumes 57μWfor one shared secret generation.The achieved area optimized and power reduced results reveal that the DESIGN-II is an appropriate choice (to generate a shared secret) for the applications that demand lower hardware resources without considering the timing characteristics(such as clock cycles,throughput and latency).
The structure of this paper is organized as:Section 2 presents the essential information for the computation of the ECDH scheme and the PM operation,respectively.The proposed crypto processor architecture for generating a public key(DESIGN-I)and shared secret(DESIGN-II)is provided in Section 3.The implementation results are presented in Section 4.Finally,Section 5 concludes the paper.
The mathematical structure of the ECDH protocol is described in Section 2.1.Moreover,the sequence of instructions for the computation of Elliptic-curve PM and the corresponding point addition and double formulas are provided in Section 2.2.
A complete layout of the ECDH protocol for key agreement(or shared key generation)between two distinct users is illustrated in Fig.1.
As shown in Fig.1,we have two distinct users,i.e.,UA,andUB.TheUBwill generate his/her public keyQBby taking a base-pointGon the defined curve and a private keydB.AdBis anm-bit random number(or)a scalar multiplier.Based on these parameters(G&dB),the public key will be generated(i.e.,QB=dB×G).Similarly,UAwill repeat the exact same process to generate his/her public keyQAusing his/her private keydAand a base pointG.The resulting public key will beQA=dA×G.Prior to generate the shared key,both users(UA&UB)will exchange their public keys to each other.Then,theUAandUBwill generate their shared keys usingSKA=dA×QBandSKB=dB×QA,respectively.
The PM operation is the computation ofd-1 times the sum of a base (or) initial pointG,as presented in Eq.(1):
In Eq.(1),Gis the initial point on the curve,Qis the resultant point anddis a scalar multiplier(or a private key of the corresponding user).There are several algorithms in the literature such as Double and Add,Montgomery ladder,Lopez Dahab,and many more to operate the ECPM operation in ECC.A comprehensive comparison over various ECPM algorithms as hardware accelerators on FPGA and ASIC(application-specific integrated circuits)platforms is shown in[25].Consequently,we have used the following Montgomery (Algorithm 1) algorithm because (inherently) it counters simple power analysis(SPA)attacks.It is important to note that we are implementing only the SPA attacks at the algorithmic level without realizing on a design level.
Algorithm 1:Montgomery ECPM Algorithm[13]images/BZ_1850_1360_1781_1386_1807.pngimages/BZ_1850_267_1904_335_1943.pngimages/BZ_1850_499_2150_524_2176.pngimages/BZ_1850_504_2401_529_2427.pngimages/BZ_1850_988_2398_1013_2424.pngimages/BZ_1850_1470_2150_1495_2176.pngimages/BZ_1850_1485_2401_1510_2427.pngimages/BZ_1850_1924_2398_1949_2424.pngimages/BZ_1850_585_2641_610_2667.pngimages/BZ_1850_761_2641_786_2667.pngimages/BZ_1850_1020_2641_1045_2667.pngimages/BZ_1850_1220_2641_1245_2667.pngimages/BZ_1850_1330_2641_1355_2667.pngimages/BZ_1850_1983_2641_2009_2667.png
The inputs to Algorithm 1 are initial pointGand a scalar multiplierd.A sequencedn-1,...,d1,d0determines the bits of the scalar multiplier in terms of 0′sand 1′s.The output from Algorithm 1 is thexandycoordinates of resultant pointQ.The PA and PD operations in Algorithm 1 shows the number of instructions for point addition and double computations,respectively.
The proposed crypto processor architecture for ECDH is shown in Fig.2.It consists of three blocks,i.e.,(i) an ECDH controller,(ii) two routing multiplexers (M1andM2) and an ECPM(DESIGN-I)unit.Description of these units are as follows:
The ECDH controller is responsible for generating control signals for the routing multiplexersM1andM2,as shown with red color dotted lines in Fig.2.Moreover,it is responsible for communicating with the ECPM controller for the public and shared key generations.It takesd(anm-bit secret key or a scalar multiplier given in Algorithm 1),QSSxp(x-coordinate of an initial pointG(or)public key of another user for the shared secret generation),QSSyp(y-coordinate of an initial pointG(or)public key of another user for the shared secret generation)andST(start signal)as an input.TheRxpandRypdetermines thexandycoordinates of either(i)our generated public key forUAorUBor(ii)shared secret forUAorUBas an output.TheDPMandDSSare the one-bit done signals.These signals become true when the corresponding public key or shared secret generation is completed by the ECPM design.Theclkandrstsignals are input to the proposed ECDH crypto processor architecture and these are not shown in Fig.2.
The ECDH controller is implemented by adopting a finite state machine(FSM)model.It contains three states,i.e.,IDLE,PKG and SKG.As the name implies,IDLE is the initial state of the crypto processor that determines none of the operations is under computation.The PKG shows that the crypto processor is processing a public key.Finally,the SKG means that the processor architecture is computing a shared key.A two-bit input signal “select” from outside to ECDH controller,not shown in Fig.2,identifies either the generation of a public key (or) shared key.Whenever the twobitselectsignal is 00,it means NOP (meaning with no operation and processor will reside in IDLE state).For 01 and 10 cases,the ECDH controller passes theSTsignal to the ECPM controller for the generation of corresponding public and shared keys.Once the corresponding public or shared key is generated,the ECPM controller acknowledges the ECDH controller with its associated done signal(DPMwhen coordinates of a public key is generated andDSSwhen coordinates of a shared key is generated).Based on the two-bitselectsignal,the ECDH controller generates the control signals for the routing multiplexersM1andM2to select the appropriate inputs(either coordinates ofGor a public key of another user for computation of shared key)for DESIGN-I.
The purpose ofM1andM2multiplexers is to select thexandycoordinates of either base pointG(denoted withxpandypin Fig.2)or public key(shown withQSSxpandQSSypin Fig.2)of another user.As shown in Fig.1,the ECDH scheme requires two PM computations.One for public key generation and another for a shared secret generation.Our architecture selects thexandycoordinates ofGto generate the public key when the control signals forM1andM2are 0.Similarly,it selects thexandycoordinates of public key of another user to generate the shared key when the control signals toM1andM2are 1.
The ECPM unit consists of (i) an array of registers,(ii) three routing multiplexers,(iii) adder,squarer,multiplier,polynomial reduction &inversion units and (iv) an ECPM controller.The description of these units are given in the next subsequent sections:
3.3.1 Register Array
The register array in the proposed crypto processor architecture is an 8×msize of register file.A variablemdenotes the width of each register while a numerical value 8 shows the total number of register addresses.As a fact,the used register array is liable for preserving the intermediate and the final results during and after the implementation of the Algorithm 1.The inside to register array unit,shown in Fig.2,comprises two 8×1 multiplexers that are responsible to fetch/reading data from the used register array.Moreover,it includes a 1×8 demultiplexer to modify the contents of each register address.The corresponding control signals for reading and write operations are generated from the ECPM controller(not shown in Fig.2).
3.3.2 Three Routing Multiplexers(M3,M4and M5)
The three routing multiplexers in the proposed architecture includeM3(3×1),M4(2×1)andM5(3×1).TheM3multiplexer is used to choose an ECC parameter(i.e.,xcoord,ycoordand a curve constantb).These parameters are needed during the implementation of Algorithm 1.For our implementations,there are two choices to takexcoordandycoordof input parameters.The first choice is from the outside to our design usingM1andM2multiplexers.Another choice is the use of parameters as constants inside the design.For these two choices,we are using always a constant value for parameterb(shown in Fig.2).For instance,this is the limitation of our proposed architecture.Similarly,theM4multiplexer drives the output ofM3and an operand from the register array to arithmetic operators.Finally,M5multiplexer is responsible for selecting an output from arithmetic operators for the register file to perform written back.Noted that,in this paper,thexcoord,ycoordand constantbparameters are selected from NIST(National Institute of Standards and Technology)standardized document[26].
3.3.3 Adder,Squarer,Multiplier,Reduction and Inversion Units
The arithmetic operators,i.e.,addition (ADDER),squaring (SQUARE) and multiplication(MULT),perform the PM operation of the ECPM unit.The implementation of a polynomial addition inGF(2m)is rather simple than the prime field.Therefore,the ADDER unit comprises an array of bitwise Exclusive (OR) gates.The polynomialsDT1andDT2determine the inputs while anm-bit polynomialAoutis output.The implementation of the SQUARE unit is achieved by placing a 0-bit after each successive data bit.It takes anm-bit operand either from the register array or routing multiplexers to perform polynomial squaring.A 2×m-1 bit polynomial(Sout)is generated as an output.
The efficiency of a polynomial multiplier specifies the performance of an ECC-based crypto processor architecture.Based on the available literature,there are four possibilities to implement a multiplier circuit.These possibilities are (i) bit-serial,(ii) digit-serial,(iii) bit-parallel and (iv) digitparallel.Each aforementioned architectural choice has certain differences.When we consider only the computational cost in terms of clock cycles,the bit-serial multipliers requiremcycles.On the other hand,one clock cycle is needed to implement the bit/digit parallel multiplication approaches.The digit-serial multipliers demandcycles.A variableashows the operand size andbpresents the length of an incorporated digit.The aforesaid discussion reveals that each multiplication style has its pros and cons as the computational costs are different.Despite the computational costs,the bitserial multiplication approaches are practical in scenarios where demand for lower hardware resources and power consumption are required.The specific examples include the RFID (radio frequency identification networks) and WSNs (wireless sensor networks) related applications.The bit/digitparallel multipliers provide high-performance(s).The digit-serial multipliers are useful in applications that require lower hardware area and lower computation time.Consequently,a bit-serial shift and add multiplication method is utilized in our MULT unit to perform twombit polynomial multiplications.It takes twoDT1andDT2polynomials as input and yields in 2 ×m- 1 bit polynomial (Mout) as output.In the employed shift and add multiplication approach,the multiplicand(DT1in our work)is addedDT2times to itself,whereDT2is the multiplier.It takesmcycles to produce one polynomial multiplication.
The size of resultant polynomials generated after MULT and SQUARE units are 2×m-1 bit(not shown in Fig.2).Therefore,the reduction from 2×m-1 tombit is essential.The polynomial reduction,in this work,is achieved by using a NIST-specified algorithm(we refer interested readers to Algorithm 2.41 of[15]).
As shown in Algorithm 1,the reconversion from projective to affine requires two polynomial inversion computations.There are several inversion methods in the literature to perform the multiplicative inverse of the polynomial(s).Therefore,the Itoh Tsujii algorithm is extensively employed in the state-of-the-art because it needs multiplications and square operations for computation[13,17,18].In this paper,we have (also) utilized the implementation resources of our employed MULT and SQUARE units to perform multiplicative inverse over polynomials using the Itoh Tsujii algorithm.OverGF(2163),the cost for execution of nine polynomial multiplications andm-1 squares determine the computational cost of an Itoh Tsujii algorithm[27].As expressed before,each polynomial multiplication using shift and add method needsmclock cycles when the inputs to our MULT arem-bits.Therefore,the execution of nine multiplications needs 9×mcycles.Them-1 cycles are needed for the squaring as one square takes only the one clock cycle.Thus,the computational cost to perform one inversion is(9×m)+(m-1).
3.3.4 ECPM Controller
The FSM-based ECPM controller is responsible to perform the PM operation of Eq.(1).It generates the control signals to implement the Algorithm 1.The generated control signals for the routing multiplexers and register array are shown with blue color dotted lines in Fig.2.The details of the ECPM controller is as follows:
• The proposed crypto processor will remain in state 0(IDLE state).
• It starts the computation once it receives theSTsignal as 1 from the ECDH controller.After theSTsignal becomes 1,the control signals for projective to affine conversions are generated which takes six clock cycles(for six states).For PM computation in projective coordinates,the point addition and double instructions of Algorithm 1 are operated in 14 states as these functions contain seven instructions(from 1 to 7)for each point addition and double operation.Thus,a total of fourteen instructions are involved.During the execution of each state,the value for an inspected key bit,i.e.,di,is checked.When the inspected value fordiis 1,the corresponding instructions ofifstatement will be executed.If not,theelsestatement is operated.In short,these fourteen states will continue to repeat as far as the condition forloopinstruction of Algorithm 1 eventually be true.
• Under different conditions,when theloopinstruction of Algorithm 1 becomes true,the processor switches control from PM to projective to affine conversion states.
For the proposed DESIGN-I,Eq.(2)is used to calculate the total clock cycles.Similarly,the total cycles for DESIGN-II are calculated using Eq.(3).
In Eq.(2),6 determines the clock cycles for the affine to projective conversion.Them-1(6×m+8)shows the clock cycles for the PM computation in projective coordinates.More precisely,6 ×mare the clock cycles for modular multiplication instructions in point addition and double operations,wheremis the clock cycles for one modular shift and add multiplication method.A numerical value 8 shows the clock cycles for the modular addition and square operations in point addition and double instructions(total 8).Hence,the addition of modular multiplications(total six instructions),modular addition(total three instructions)and modular square(total five instructions)becomes the total of fourteen instructions in point addition and double operations.ForGF(2163),the proposed DESIGN-I takes 6 clock cycles for affine to projective conversion,159732 cycles for the PM computation in projective coordinates and 4164 cycles for projective to affine conversions.In projective to affine conversion,eachinversionrequires 1629 clock cycles and additional 906 cycles are needed to compute the remaining operations of projective to affine conversions.Consequently,the proposed DESIGN-I takes 163902 clock cycles for a public key generation(or for one PM computation).
As shown in Fig.1,the ECDH scheme consists of dual PM computation (one for a public key generation and another for shared key).Therefore,Eq.(3) confirms that our DESIGN-II needs 163902+163902 (327804) clock cycles for one shared key generation between two distinct users (as we mentionedUAandUBin Section 2.1).
The details for the implementation results and comparison to state-of-the-art is given in Section 4.1 and Section 4.2,respectively.
Our DESIGN-I and DESIGN-II overGF(2163)is implemented in Verilog HDL using the Vivado IDE tool.The implementation results are reported after running the post-place-and-route design process.A Xilinx Virtex-7 (xc7vx690t-3ffg1930) FPGA device[28]is used for logic synthesis and routing purposes.Before presenting our implementation results,we have reported area utilization of building blocks of DESIGN-I and DESIGN-II in Section 4.1.1.The details for implementation results of our DESIGN-I and DESIGN-II are described in Section 4.1.2.The area comparison of DESIGNI and DESIGN-II with the sum of Slices and LUTs (look-up-tables) of building blocks are given in Section 4.1.3.Finally,the comparison between DESIGN-I and DESIGN-II is presented in Section 4.1.4.
4.1.1 Area Utilization of Building Blocks of DESIGN-I and DESIGN-II
The area breakdown for building blocks in DESIGN-I and DESIGN-II are shown in Tab.1.Column one provides the name of the building-blocks of our proposed design.From columns two to four,the area utilization is presented in terms of FPGA LUTs,Slices and FFs(flip-flops).
Table 1:Area utilizations of building blocks of our proposed DESIGN-I and DESIGN-II over GF(2163)
As shown in Tab.1,the critical building blocks of DESIGN-I and DESIGN-II over Virtex-7 FPGA utilizes 77,38,4429 and 652 FPGA Slices for ADDER,SQUARE(including NIST Reduction),MULT (including NIST Reduction) and register array.Similarly,for identical building blocks,the LUTs utilization is 82,39,10531 and 1956.The sum of resources for building blocks is 5196(Slices),12608 (LUTs) and 1304 (FFs).Tab.1 reveals that the MULT unit (even with simple shift and add architecture) utilizes more hardware resources as compared to other building blocks.Moreover,the implementation results show that the modular square operation being much simpler in theGF(2m)field than with the array of bitwise Exclusive(OR)gates(in our implemented ADDER unit).
Concerning the sum of resources of building blocks of our DESIGN-I and DESIGN-II,apart from the routing multiplexers,the reported Slices (5196) and LUTs (12608) values in Tab.1 are still enough in large.Many area-constrained applications such as smart cards,IoT,automotive and several others demand lower hardware resources.In this regard,we have set design goals from balanced to area reduction in the Vivado IDE tool to minimize the hardware resources of our DESIGN-I and DESIGNII implementations.Therefore,a description of the results is provided in the following section.
4.1.2 Implementation Results for DESIGN-I and DESIGN-II
The detailed implementation results of our proposed DESIGN-I and DESIGN-II are shown in Tab.2.Column one provides the reference design.The area information in Slices,LUTS and FFs is given in columns two,three and four.Columns five and six provide the total clock cycles and operational clock frequency (inMHz),respectively.Latency is the time for computation of one required cryptographic operation and therefore,the latency (inμs) for one public key and a shared secret generation is given in columns seven to eight,respectively.Finally,the last column presents the total power consumption of our proposed DESIGN-I and DESIGN-II.Note that the reported values for area and clock frequency in Tab.2 are obtained from the Vivado IDE tool.The total clock cycles are calculated from Eqs.(2) and (3).For both DESIGN-I and DESIGN-II,the latency for one PM computation is calculated using Eq.(4).Similarly,we have used the same Eq.(4)or the calculation of ECDH latency of our DESIGN-II.We clarify that our DESIGN-I does not support ECDH operation,hence its latency is denoted by‘-’in Tab.2.
Table 2:Implementation results for DESIGN-I and DESIGN-II after post-place-and-route over GF(2163)on Xilinx Virtex-7(xc7vx690t-3ffg1930)FPGA device
As shown in Tab.2,the utilized hardware resources of DESIGN-I are 3983(Slices),10639(LUTs)and 2389 (FFs).Similarly,the utilized resources of DESIGN-II are 4037 (Slices),11713 (LUTs)and 2427 (FFs).The achieved clock frequency for DESIGN-I and DESIGN-II are 296MHzand 280MHz,respectively.The required clock cycle(for DESIGN-I)to generate one public key is 163902.Similarly,the clock cycle(for DESIGN-II)to generate one shared secret is 327804.Both DESIGN-I and DESIGN-II take the 553.7μsto generate one public key,while the DESIGN-II requires 1170.7μsto generate one shared key.As shown in the last column of Tab.2,the power consumption of our DESIGN-I and DESIGN-II is 29μWand 57μW,respectively.
4.1.3 Area Comparison of DESIGN-I and DESIGN-II with the Sum of Slices and LUTs of Building Blocks
The total sum of Slices and LUTs of building blocks is 1.30 (ratio of 5196 with 3983) and 1.18(ratio of 12608 with 10639)times higher than the Slices and LUTs of our DESIGN-I.Similarly,when the area resources of DESIGN-II are considered for comparison,the total sum of Slices and LUTs of building blocks is 1.28(ratio of 5196 with 4037)and 1.07(ratio of 12608 with 11713)times higher.This indicates that the use of different optimization strategies for synthesis of DESIGN-I and DESIGN-II results in lower resources.The reason is that we set the area optimization as synthesis goal in a Vivado IDE tool.
4.1.4 Comparison Between DESIGN-I and DESIGN-II
As far as the utilized area is concerned,DESIGN-I takes only the 3983,10639 and 2389 Slices,LUTs and FFs.The reported FPGA area values in terms of Slices,LUTs and FFs for DESIGN-II are 4037,11713 and 2427 that are comparatively 1.01(ratio of 4037 with 3983),1.10(ratio of 11713 with 10639) and 1.01 (ratio of 2427 with 2389) times higher than our DESIGN-I.The reason for higher hardware resources in DESIGN-II is the use of two additional routing multiplexers (M1andM2)and an ECDH controller to generate the shared key.On the other hand,the DESIGN-I is only responsible for the computation of a public key(or scalar multiplication).The clock cycles requirement of our DESIGN-I and DESIGN-II is 163902 and 327804 for public key and a shared secret generation.Our DESIGN-I can operate on a maximum clock frequency of 296MHzwhile DESIGN-II achieves 280MHz.For DESIGN-I and DESIGN-II,the time(or latency)required to compute one public key is 553.7μs.Time to perform a shared secret with our DESIGN-II is twice the time for a public key as in the ECDH scheme,the PM(or scalar multiplication)is essential to compute two times(one for public key and another for shared secret).On Virtex-7 FPGA,DESIGN-I and DESIGN-II consumes 29μWand 57μWfor generating the public key and a shared secret,respectively.
The comparison with the most compatible state-of-the-art architectures,published in[13,16-19,22,24],is given in Tab.3.The column one in Tab.3 provides the reference design.The implemented PM algorithms and key-agreement protocols are presented in column two.The implementation device for logic synthesis is provided in column three.Column four presents the utilized FPGA slices.Finally,the last two columns (five and six) provide the clock frequency (inMHz) and latency (inμs)parameters,respectively.In Tab.3,a representation‘-’means that the relevant information is not given in the reference design.
Table 3:Comparison with recent state-of-the-art PM and ECDH architectures over GF(2163)
4.2.1 Comparison of our DESIGN-I with[16]and[19]
As compared to the dedicated PM architecture of[16]on Virtex-5 FPGA,our DESIGN-I utilizes 1.09(3126 over 3122)times lower FPGA Slices.Moreover,in terms of clock frequency,our DESIGNI is 1.04 (ratio of 301 over 288) times faster than[16].On the other hand,the time required for one PM computation or public key generation in[16]is 22.2(ratio of 544.5 over 24.5)times lower than our DESIGN-I.There is always a tradeoff between area and performance(latency)parameters.On Virtex-7 FPGA,a high-speed and latency optimized architecture of[19]for PM computation utilizes 4150 FPGA Slices which is comparatively 1.04(ratio of 4150 over 3983)times higher than our DESIGN-I.Moreover,their architecture results in higher clock frequency(352MHz)when compared to our design(where we achieved 296MHz).Similar to[16],the PM architecture of[19]requires lower computational time as compared to our DESIGN-I.It is important to note that the main objective of this work was to succeed the minimum hardware resources(FPGA Slices,LUTs and FFs)without concentrating on other design parameters such as clock frequency,power,and latency.
4.2.2 Design Tradeoffs Between[13,17,18]and our DESIGN-I
As shown in Tab.3,the dedicated PM architectures (reported in[13,17,18]) over Virtex-5 and Virtex-7 FPGA devices show that there is always a tradeoff between several design parameters when implementing the cryptographic algorithms as a hardware accelerator.In the text that follows,a tradeoff by providing the comparison of dedicated PM designs (published in[13,17,18]) with our notion of a flexible crypto processor (DESIGN-I) is described.As reported earlier in this paper,flexibility determines the selection of ECC parameters according to the user’s needs(readers can turn to multiplexerM3in Fig.2).
On Virtex-7 FPGA,a 2-stage pipelined PM architecture of[13]utilizes 1.80 (ratio of 3983 over 2207) times lower Slices as compared to our DESIGN-I.The reason for the utilization of lower hardware resources is the use of only one multiplier architecture for both square and polynomial multiplication computations.The square operations are computed by providing similar inputs to the multiplier unit.Additionally,the use of a singular NIST reduction unit also leads to a decrease in hardware resources.Comparatively,we are using a separate MULT and SQUARE unit to operate polynomial multiplications and squaring operations,respectively.In addition,we have two NIST reduction units connected one after each MULT and SQUARE unit in our designs.These architectural differences result in an increase in hardware resources in our work.Due to pipelining in[13],the achieved clock frequency is a bit higher than our crypto processor design.Furthermore,the time to operate one PM computation is lower than our DESIGN-I.In[17],3657 FPGA Slices is used on Virtex-7 FPGA for the PM computation overGF(2163).This is comparatively 1.08(ratio of 3983 over 3657)times lower than our DESIGN-I.The clock frequency of their architecture is not presented.Thus,this comparison is not possible to provide.The calculated latency value for[17]is lower than our proposed DESIGN-I as this is a dedicated PM architecture,where the coordinates of the initial point on ECC was applied as a 163-bit constants(inside the core).In our DESIGN-I(shown in Fig.2),a routing multiplexerM3takes coordinates of an initial point on ECC as input from outside to DESIGNI.Therefore,for each public key generation,users can choose different ECC basepoints according to their needs when initiating a setup for public key generation.
As compared to the PM architecture of[18]on Virtex-5 FPGA,our DESIGN-I consumes 6.6(ratio of 3126 to 473)times higher FPGA Slices.The reason is the employment of two NIST reduction units,one after in each SQUARE and MULT unit in our architecture.Moreover,a segmented digit serial multiplier architecture is used in[18]while in our design,a shift and add multiplication method is used for polynomial multiplication.Similar to the area(FPGA slices),the dedicated PM architecture of[18]achieves higher clock frequency and requires lower computational time as compared to our DESIGN-I.One of the reasons for this is to consider flexibility when selecting different curve parameters as inputs to our DESIGN-I using routing multiplexersM1,M2andM3.An additional reason is the use of a higher number of input/output interfaces when taking coordinates of a public key as input from outside and producing either public key or shared secret as an output.On the other hand,the dedicated architecture of[18]uses the ECC parameters as constants inside their design.
Remarks:We are not asserting that our DESIGN-I outperforms (in terms of hardware area) in all state-of-the-art PM implementations.There is always a tradeoff when considering several design parameters concurrently for optimizations.Therefore,based on the comparisons presented above,we have noted that the hardware accelerators of ECC for only optimized for PM computation can result in little unfair measurements when scaling the same PM architecture to any ECC protocol (such as ECDH in this work).Therefore,a hardware architecture with the considerations of all four layers could be more beneficial.
4.2.3 Comparison of our DESIGN-II(ECDH)with[22]and[24]
The accelerator architectures of[22,24]for ECDH scheme are presented overGF(2233)while our ECDH design is provided overGF(2163).Although their architectures consider 1.42(ratio of 233 over 163)times higher key length when compared to our DESIGN-II.On the other hand,our DESIGN-II is 2.312(ratio of 289 over 125)and 4.48(ratio of 280 over 62.5)times faster as compared to[22]and[24]over Virtex-5 and Virtex-7 FPGA devices,respectively.When comparing the hardware resources,our DESIGN-II utilizes 2.99 (ratio of 35102 over 11713) times lower FPGA LUTs as compared to[22].The FPGA Slices used in[24]is 2.23(ratio of 4037 over 1809)times lower than our DESIGN-II.Apart from hardware resources (FPGA Slices and LUTs) and clock frequency,the time to generate one public key and a shared secret are much lower than the ECDH accelerator architectures of[22]and[24],respectively.In detail,a coprocessor architecture of[22]requires 8.88msto generate one shared secret while our DESIGN-II takes only the 1.13ms,which is comparatively 7.85 (ratio of 8.88 over 1.13) times lower.Similarly,the architecture of[24]on Virtex-7 FPGA requires 4.13msto generate one public key while our DESIGN-II takes only the 0.58ms,which is comparatively 7.12(ratio of 4.13 over 0.58)times lower.
This paper has presented a key-agreement architecture for the ECDH scheme overGF(2163)with the consideration of low hardware resources.Towards hardware resource reduction:(i)a simple shift and add multiplier is used and (ii) a polynomial inversion is computed using the Itoh-Tsujii algorithm with similar hardware resources of squarer and multiplier units.The proposed architecture is implemented in a Verilog HDL using Vivado IDE.The implementation results after the post-placeand-route are given on a Xilinx Virtex-7 FPGA.The utilized Slices overGF(2163)for one shared key generation is 4037.The clock cycles to compute one public key and a shared secret are 163902 and 327804.The time to operate one public key and a shared secret is 553.7msand 1170.7ms,respectively.The consumed power is 29μW(for one public key calculation)and 57μW(for singular shared secret computation).We have concluded that the hardware accelerators of ECC for only optimized for PM computation can result in little unfair measurements when scaling same architectures to any ECC protocol(such as ECDH in this work).Subsequently,a hardware architecture with the considerations of all four layers could be more beneficial to achieve better performance(s).
Acknowledgement:We extend our gratitude to the Estonian Aviation Academy,Tartu,Estonia for supporting and funding this research work.
Funding Statement:We acknowledge the support of Deanship of Scientific Research at King Khalid University for funding this work under grant number R.G.P.1/399/42.
Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.
我们致力于保护作者版权,注重分享,被刊用文章因无法核实真实出处,未能及时与作者取得联系,或有版权异议的,请联系管理员,我们会立即处理! 部分文章是来自各大过期杂志,内容仅供学习参考,不准确地方联系删除处理!