lists.openwall.net  lists / announce owlusers owldev johnusers johndev passwdqcusers yescrypt popa3dusers / osssecurity kernelhardening musl sabotage tlsify passwords / cryptdev xvendor / Bugtraq FullDisclosure linuxkernel linuxnetdev linuxext4 linuxhardening PHC  
Open Source and information security mailing list archives
 

Date: Mon, 10 Mar 2014 12:02:29 +0400 From: Solar Designer <solar@...nwall.com> To: discussions@...swordhashing.net Subject: multiply latency reduction via table lookups Bill  It is known that e.g. a 32x32>64 multiply can be computed via four 16x16>32 multiplies. Moreover, these four can run in parallel (they are then followed by some ADDs, but this is subcycle latency in ASIC). e.g. here's code I wrote for JtR (for portability to ancient platforms) back in 1990s: void mul32by32(int64 *dst, unsigned int m1, unsigned int m2) { dst>lo = (m1 & 0xFFFF) * (m2 & 0xFFFF); dst>hi = 0; add32to64m(dst, (m1 >> 16) * (m2 & 0xFFFF)); add32to64m(dst, (m2 >> 16) * (m1 & 0xFFFF)); dst>hi += (m1 >> 16) * (m2 >> 16); } (The same algorithm can be found in plenty of places. I just happened to reinvent it at the time.) Now, a 16x16>32 lookup table is still pretty large. A naive implementation of it would take 16 GiB. A smarter implementation would probably halve that, due to multiplication being commutative. (Can we do better yet?) Currently, the lookup latency for a table this large would probably be way in excess of 3 cycles (assuming a decent clock rate) that we've been assuming an optimal multiplication circuit needs. However, what if we apply the same approach recursively? With just one more application of it, we need multiplies as small as 8x8>16, which means a 128 KiB lookup table (one with many read ports, or multiple such tables), or perhaps twice smaller. Everything else is just ADDs, and fewer levels of them than a tableless multiplication circuit would contain. Thus, a 32x32>64 multiply appears to be equivalent to a bunch of parallel lookups from a 128 KiB table, followed by a few levels of ADDs (not too many?) Can this be used to perform 32x32>64 multiplication in fewer than 3 cycles at a decent clock rate (same as with a tableless multiplier)? Do e.g. CPUs use table lookups like this for multiplication already? Is this possibly how they (and some ASICs) achieve the 3 cycles already, meaning that we shouldn't worry about this allowing for < 3 cycles? If so, the die area consumed by fast multipliers may be more than we had estimated. Alexander
Powered by blists  more mailing lists