tl;dr: before implementing a solution for one of my projects, I did some pen-paper optimization and got a $\sim 2\,301\,620\,000\%$ computational speed-up.
Edit (2024.02.07): I published the polished code of IndexTree
.
You can find it here.
A significant gap between the theoretical and practical applications in Cryptology lies in the challenge of understanding “what” is realistically computable, or in simpler terms, “how much time does it take to compute X”.
In the realm of theory, computation complexity is meticulously calculated, proven, and expressed in asymptotic notation1, offering insights into the additional computations required with larger inputs. Intuitively, more input leads to more computations, translating to increased time spent computing. The computational complexity provides a mathematical description of these incremental computational demands.
Numerous well-established algorithms/primitives find their way into practical implementation across various languages, each employing distinct solving strategies. They all engage in a competitive empirical performance analysis where all costs, including time and memory, are meticulously measured and compared.
It’s an awesome process…
…until one realizes that only a (truly) minute fraction of “unknown” algorithms are effectively implemented, and an even smaller fraction prioritizes efficiency2. Often, the realm of what can be realistically computed remains a mystery. Unfortunately, ignorance tends to blind everyone, frequently leading to incorrect and oversimplified assumptions about computability.
Allow me to illustrate this point with an example from my ongoing work on addressing this project.
Trees (and graphs) stand as colossal entities in the world of computer science forming the go-to data structure for numerous computational challenges in cryptology, computer science, and beyond. Specifically, I’m interested in problems related to colossal trees, like the colossal grandeur of giant sequoias. The challenge is to search for all the leaves that satisfy specific conditions.
Let me bring up a concrete example that relates to the side-channel problem of retrieving an AES-128 secret key from partial information leaked by the permutation boxes, for example, the Hamming weight of the output3. The 128-bit secret key $k$ is split into 16 8-bit subkeys $k_i$ and, together with the appropriately split known input $x$ into $x_i$, gets permuted into $\phi(x_i \oplus k_i)$ of which Hamming weight is leaked which is the number of ones in the result thus it is a number between $0$ and $8$.
Since the input $x$ and how the permutation $\phi$ is computed are known, the possession of the Hamming weight allows anyone to retrieve a list of subkey candidates $(\hat{k}_i)_{i=1}^{16}$ for each subkey $k_i$.
We put everything together by constructing a tree where at each level $j$, we branch according to all the possible subkey candidates $(k_j)_{i=1}^{16}$. All the secret key candidates are obtained by considering all the possible combinations of all the subkey candidates $(\hat{k}_i)_{i=1}^{16}$ meaning that the total number of secret key candidates is the product of the cardinalities of all the subkey candidates $\prod_i \left\lvert(k_i)\right\rvert$.
How big can this amount be?
Well, the worst weight possible is $4$ of which there are $\binom{8}{4} = 70$ subkey candidates. If we get the worst weight possible, we get $70^{16}$ possible subkeys which is basically this number:
\[70^{16} = 332\,329\,305\,696\,010\,000\,000\,000\,000\,000 \simeq 3.32 \cdot 10^{29} \simeq 2^{98}\]For context and at the time of writing, Bitcoin’s network currently has4 a Hash Rate of $0.5 \cdot 10^{21} \sim 2^{69}$ hash per second. Since I love back-of-envelope calculation, if we manage to convince all the Blockchain’s miners to find AES keys for the worst possible case of my problem and one candidate check takes the same computation time as computing a hash5, then the whole Bitcoin’s miner community will take $2^{29}$ seconds to solve the problem, or better known as $\sim17$ years.
So… is there anything else we can do?
Well, for the AES key search, yes.
Following the natural steps of the encryption algorithm, the curious reader might point out that “you already got the leaks for the permutation’s output in the first round. What about the second round of leaks?” Third? … ? Last round?” These indeed are conditions that allow us to filter out the candidates: only the good candidates produce the same leaks as the correct secret key! There are $16$ new conditions for each round considered!
Note the crucial order: first you get the candidate list and later you filter it out. For the worst-case scenario, one must take all $2^{98}$ candidates one by one, simulate the encryption and check if the simulated leaks are the same as the original ones.
This is the ending point for many researchers/human beings out there. If you think the same, I believe this might be the time to move on to the next problem. Thank you for reading! We clearly discussed a problem with an unreasonable amount of computation to do…
…but if you are skeptical, feel free to read the next notes!
The first step is asking ourselves “How would a mathematician tackle the problem?”. Well, by drawing an abstract representation of the problem! Many (me included!) often forget the power of visualization problem solving, better known as the ability to solve a problem by looking at it without thinking too much about creating a mental image.
1
2
3
4
5
6
7
8
9
for cand1 in CandidatesList1:
for cand2 in CandidatesList2:
...:
for cand16 in CandidatesList16:
candidate = (cand1, cand2, ..., cand16)
if all ( check1(candidate), check2(candidate), ... , check16(candidate) ):
# Good candidate, store it somewhere
good.append(candidate)
# else Bad Candidate
Once we have a nice picture, the second step is to find equivalent solutions. Equivalency is an extraordinary mathematical property that often hides tips on how to improve the implementation computation complexity. Look it in this way: computing $16$ checks and later checking that all of them pass is no different than nesting the checks one after the other and, if any one of them fails, stop there.
1
2
3
4
5
6
7
8
9
10
11
12
for cand1 in CandidatesList1:
for cand2 in CandidatesList2:
...:
for cand16 in CandidatesList16:
candidate = (cand1, cand2, ..., cand16)
if check1(candidate):
if check2(candidate):
...:
if check16(candidate):
# Good candidate, store it somewhere
good.append(candidate)
# else Bad Candidate
All of this reduces the number of checks but not the number of candidates. If we see the problem as a tree, it would be nice if we could squeeze the checks in one of the intermediate levels so that if a check fails early, we can completely skip the rest of the subtree and move to the next branch!
1
2
3
4
5
6
7
8
9
10
11
12
for cand1 in CandidatesList1:
for cand2 in CandidatesList2:
if check1(cand1,cand2):
for cand3 in CandidatesList2:
if check2(cand1,cand2,cand3):
...:
for cand16 in CandidatesList16:
candidate = (cand1, cand2, ..., cand16)
if check16(candidate):
# Good candidate, store it somewhere
good.append(candidate)
# else Bad Candidate
However, to achieve this, the moved checks cannot be dependent on the subkey candidates of the subtree. Luckily for us, the considered problem has exactly this property!
Each one of the $16$ checks is dependent on a subset of subkeys thus potentially allowing to raise several checks into the tree structure. The only mandatory object to consider is a permutation of the indices which permute the tree levels and allows the checks to correctly be moved inside the tree. If we don’t permute, a check might only depend on the first $k_1$ and last subkey $k_{16}$ making it impossible to raise it in the tree, despite it being independent against the majority of the subkey’s candidates! By permuting, the tree will first consider candidates for $k_1$ and $k_{16}$ on the second level allowing the check to be placed here. Easy, right? Now we are able to navigate all the candidates faster because if a premature check fails, we might skip big branches of the tree!
1
2
3
4
5
6
7
8
9
10
11
12
for cand1 in CandidatesList1:
for cand5 in CandidatesList2:
if check3(cand1,cand5):
for cand10 in CandidatesList2:
if check1(cand1,cand5,cand10):
...:
for cand2 in CandidatesList16:
candidate = (cand1, cand2, ..., cand16)
if check7(candidate):
# Good candidate, store it somewhere
good.append(candidate)
# else Bad Candidate
One question: are we happy with writing $16$ nested loops with some checks here and there? Are we really happy about it?
I love the idea of an iterator: you must cycle an iterable structure and you let the iterator handle how to move in the structure. You only have to ask “Hei iterator, give me the next element”. Trees representing Cartesian products are so elegant to describe by only using the indices of the single components. Skipping subtrees is a matter of increasing the correct index and when a level is done, you reset the indices and increase a previous one, similarly to how the carry-over works in additions.
Our tree can be represented as a list of indices and we increase the current index position according to the check that fails allowing the tree’s navigation to be optimal6!
Here we have it, a classic iterator on a Cartesian product of indices with a weird way of
obtaining the next element since it might receive as input a request to skip to a future index.
The information needed to create such an iterator (that I like to call IndexTree
) are the
list of indices dimensions $[ n_j ]_{j=1}^{16}$ and a list of indices $[ s_j ]_{j=1}^m$
where the skip must occur.
For example, the third skip $s_3$ will denote the iterator that the third skip is done at the $s_3$-th
position.
All is nice, the drawing, the trees and the permutations…
…but are we really making it fast or is this syntactic sugar for useless complexity?
Measuring code efficiency empirically can be extremely misleading and hard to do in a sound way.
Compilators/interpreters often do weird magic, hardware handles memory according to
even weirder magic, temperature influences performances and so many other factors.
Sorry for the personal rant but I cannot find an article describing in high-details how the
comparison between code executions should be done rigorously…
…that is why I will do it my way (definitely not rigorous) and mainly to provide a rule-of-thumb
comparison.
First of all, I will focus more on the average amount of checks the code executes per second while tackling a (somehow) smaller problems. The worst possible problem of navigating the whole $70^{16}$ leaves sequoia is, IMHO, too big and somehow out-of-scope for doing a comparison between different solutions. To do so, I will go for several “worst in a somehow good” scenarios: $6$ leaks have a unique subkey candidate but the other $10$ are bad-bad with $70$ candidates each eaning there are this number of candidates:
\[1^6 \cdot 70^{10} = 2\,824\,752\,490\,000\,000\,000 \simeq 2.8 \cdot 10^{18} = 2^{61.3}\]which is a big number, if you ask me! If we follow Snowden’s “one trillion guesses per second”7 ($\sim2^{40}$) suggestion, this government would find all the key candidates in $2^{21.3}$ seconds, which is almost a whole month.
Second, implementation language.
Many years in academia, mainly working in theory8, made my first choice fall to Python.
Python is so good for prototyping in research: extremely good for writing quite lengthy scripts that compute some (quite basic and known) combinatorial number, forget about them and later rewrite them for the millionth time to get the same results. Or, another classic, you have a million different elements with different types (or meanings) and shouldn’t interact despite being represented in the same way…
…but basic Python doesn’t care and lets you mix field elements, integers and group elements as if we all lived in a free, anarchic Universe where formalism is a long list of suggestions. If this is not love, then what?
When I first encountered this problem years ago, I first wrote the infamous naive solution of creating one candidate at a time and doing all the checks before continuing to the next candidate. Almost immediately, I rewrote the algorithm to do the nested checks and made the whole loop work in parallel. Luckily, I lost the code but the memory of the average speed of this prototype is quite vivid: $2^{22}$ checks per second, implying an estimated time of $\sim 21\,462$ years to finish.
I said “luckily I lost the code” because it forced me to rewrite everything in Rust!
Rust was a curious language that I started reading and learning a couple of years ago but never with the idea of applying it to real problems or research. Until a couple of months ago, mainly motivated by the need to get new skills, show my coding abilities and try out the language to see how much it could do for me9. This was a perfect small prototype project to work on and push the envelope!
That is why I’m happy to announce that the serial version of the naive solution with the sequential checks, working on a single MacBook Air M1 core, clocks at $2^{24.7}$ checks/sec, an astounding $\sim6.5$ times faster than the parallel Python version and implying a “more reasonable” computing time of $\sim3\,303$ years!
WOW, my glued code almost all in the main
function, half the variables are a weird type-concussion
and $16$ nested for
cycle is indeed faster on a single core!
Time to make the cycle parallel by first putting together all the loops as an iterator on the
Cartesian product as the macro
iproduct
easily allows aaaaaannnd…
…error?
Oh yeah, there is a
limit
of $12$ lists you can put together with that macro.
Ok…
…so we skip parallel stuff (for now) and focus on how fast can a single core go.
That is why I refactored the code and made it more modular, isolating the weird iterator, the
AES functions used but still going for the nested checks.
The refactored code got $\sim2^{24.4}$ checks/sec, an astonishing $-20\%$ less efficient showing, once
and for all, that we should never structure code nicely but instead put everything into main
!
Joke aside, the loss is mainly related to the iterator which went from a long chain of nested loops
to a single custom iterator that must check and fix the index when it overflows.
However, with the IndexTree
iterator at hand, we are finally able to permute the tree levels, move
the checks and do all the theoretical optimizations.
I ran the code and went to cook lunch knowing that when I got back, the code would still be running and I could get the statistical data to add in this blogpost. An hour later I came back and I got an estimated execution time of $54$ seconds and $2^{55.5}$ checks/sec. A “small” increase of $\sim2\,301\,620\,000\%$.
I thought “come on… …something must be wrong!”10 but no, the results are correct. So, we confirmed paper optimization beats code optimization, right?
That was quite a journey for a simple revision of a solving algorithm and some good discussion points can be brought to your next date (or maybe not).
If the time allows, do the paperwork. Taking notes and solving the problem on “paper” before coding is a tool and a skill. Many solve tasks by directly writing code and slowly shape by rewriting it into the final shape. This works when there are time constraints (paper deadline, hackathon, bad superiors, etc.). Being good at solving problems on a different medium (notes, paper or even by voice) is an amazing start for allowing cooperation with other people, recording your thoughts for later or starting visualizing the problem.
Rust is “easier” than expected for crypto-oriented research problems. I got surprised about how easy it was for me to set up the whole solution framework and get minor problems with the compiler. I won’t argue that Python is way faster for prototyping but Rust allows us to really push the threshold on the computational side.
IndexTree
?
Well, I’m planning to clean up the code of the iterator and
release it as a proof of concept.
Clearly, there are better Rustaceans that know exactly the set of interfaces to use so my code
will just be a guideline for them!
My two cents: I doubt that one can implement Iterator
for IndexTree
as is because whenever
a skip is made, the whole IndexTree
is completely changed raising eyebrows regarding lifetime.
However, it should be possible to write IndexTree
as a normal iterator and use the nightly
advance_by
to achieve the same skipping at the cost of efficiency (the function is eager)
and of having to decide how to handle the indices representations: single number or list of
indices? Both have pros and cons.
If you know about the existence of this iterator, can you please let me know?
Edit (2024.02.07): I published the polished code of IndexTree
.
You can find it here.
Weird iterators.
Connected to the previous point, it would be nice to know if some “advanced iterators” already
exist for Rust.
IndexTree
is one of these iterators: an iteration of the elements in the Cartesian product
with the ability to skip entire branches.
Another one that comes to mind (and I implemented it in Python back in the day) is an iterator
called UpdList
on a list where new elements can be appended while iterating.
It works like this: one iterates on a list [1,2,3]
and, after consuming and computing
on 1
, realises that we must iterate on 4
too so the code can append it to what is left of
the iterator obtaining [2,3,4]
.
This might look like a weird feature for an iterator but it is extremely useful
(in research at least) when checking properties that must achieve some sense of closure: you
start with a few elements and later the algorithm creates new instances and appends
them to the list.
Specific checks and properties will make sure the list collapses and reduces until it disappears
and the loop terminates.
It is true that the iterator might not end if the conditions/properties/code does not allow it
but this is a pen-and-paper problem, not a code one11.
Of course, these are my thoughts, you might have different ones that I’m happy to hear and discuss!
the big O notation and related families of notation. ↩
IMHO, $\sim 80\%$ of cryptology papers propose new primitives and only $\sim10\%$ of them provide an implementation, usually only for comparison against other similar implemented protocols. ↩
leaking the Hamming weight of the permutation’s output is a classic in side-channel attacks. ↩
this is extremely not the case! My two cents: hashing is mainly over-optimized in over-optimized hardware which often cannot be repurposed to work for a different circuit (see checking AES key candidates). If we close an eye on this, checking a candidate is a lot less computation than a full hash computation! ↩
I’m deliberately hiding how to compute an optimal permutation. DYOR but the idea is to minimise the number of introduced dependencies from one level to the other while maximising the number of checks. ↩
here is a source. After a quick search, I found some machines that claim 500 trillion hash per second which is $\sim2^{49}$. True these monsters exist but I hardly doubt I will buy one for my research 😅 ↩
I was doing theoretical research not that I was theoretically working. I’m so funny sometimes! ↩
Beware, I’m definitely not an amazing programmer with several centuries of experience and know everything of quintillion different acronyms, standards and other software engineering optimization methods/cults. Given my background, I grew up tackling problems on paper until they were correctly solved and optimized. After that, I know how to write the code that I need to translate my paperwork into a different (programming) language! ↩
added only for dramatic effect. With the power of pen-and-paper maths, I already knew the code would be much faster than the previous one. ↩
recursion has the property since it is a matter of some pen-and-paper thoughts to find if the recursive call will reach a terminating case. ↩