Traversing a Giant Sequoia, Fast with Rust!

tl;dr: before implementing a solution for one of my projects, I did some pen-paper optimization and got a $\sim 2\,301\,620\,000\%$ computational speed-up.

Edit (2024.02.07): I published the polished code of IndexTree. You can find it here.

Realistically Computable

A significant gap between the theoretical and practical applications in Cryptology lies in the challenge of understanding “what” is realistically computable, or in simpler terms, “how much time does it take to compute X”.

In the realm of theory, computation complexity is meticulously calculated, proven, and expressed in asymptotic notation¹, offering insights into the additional computations required with larger inputs. Intuitively, more input leads to more computations, translating to increased time spent computing. The computational complexity provides a mathematical description of these incremental computational demands.

Numerous well-established algorithms/primitives find their way into practical implementation across various languages, each employing distinct solving strategies. They all engage in a competitive empirical performance analysis where all costs, including time and memory, are meticulously measured and compared.

It’s an awesome process…

…until one realizes that only a (truly) minute fraction of “unknown” algorithms are effectively implemented, and an even smaller fraction prioritizes efficiency². Often, the realm of what can be realistically computed remains a mystery. Unfortunately, ignorance tends to blind everyone, frequently leading to incorrect and oversimplified assumptions about computability.

Allow me to illustrate this point with an example from my ongoing work on addressing this project.

Traversing a Giant Sequoia

Trees (and graphs) stand as colossal entities in the world of computer science forming the go-to data structure for numerous computational challenges in cryptology, computer science, and beyond. Specifically, I’m interested in problems related to colossal trees, like the colossal grandeur of giant sequoias. The challenge is to search for all the leaves that satisfy specific conditions.

Let me bring up a concrete example that relates to the side-channel problem of retrieving an AES-128 secret key from partial information leaked by the permutation boxes, for example, the Hamming weight of the output³. The 128-bit secret key $k$ is split into 16 8-bit subkeys $k_i$ and, together with the appropriately split known input $x$ into $x_i$, gets permuted into $\phi(x_i \oplus k_i)$ of which Hamming weight is leaked which is the number of ones in the result thus it is a number between $0$ and $8$.

Since the input $x$ and how the permutation $\phi$ is computed are known, the possession of the Hamming weight allows anyone to retrieve a list of subkey candidates $(\hat{k}_i)_{i=1}^{16}$ for each subkey $k_i$.

We put everything together by constructing a tree where at each level $j$, we branch according to all the possible subkey candidates $(k_j)_{i=1}^{16}$. All the secret key candidates are obtained by considering all the possible combinations of all the subkey candidates $(\hat{k}_i)_{i=1}^{16}$ meaning that the total number of secret key candidates is the product of the cardinalities of all the subkey candidates $\prod_i \left\lvert(k_i)\right\rvert$.

How big can this amount be?

Well, the worst weight possible is $4$ of which there are $\binom{8}{4} = 70$ subkey candidates. If we get the worst weight possible, we get $70^{16}$ possible subkeys which is basically this number:

\[70^{16} = 332\,329\,305\,696\,010\,000\,000\,000\,000\,000 \simeq 3.32 \cdot 10^{29} \simeq 2^{98}\]

For context and at the time of writing, Bitcoin’s network currently has⁴ a Hash Rate of $0.5 \cdot 10^{21} \sim 2^{69}$ hash per second. Since I love back-of-envelope calculation, if we manage to convince all the Blockchain’s miners to find AES keys for the worst possible case of my problem and one candidate check takes the same computation time as computing a hash⁵, then the whole Bitcoin’s miner community will take $2^{29}$ seconds to solve the problem, or better known as $\sim17$ years.

So… is there anything else we can do?
Well, for the AES key search, yes.

Following the natural steps of the encryption algorithm, the curious reader might point out that “you already got the leaks for the permutation’s output in the first round. What about the second round of leaks?” Third? … ? Last round?” These indeed are conditions that allow us to filter out the candidates: only the good candidates produce the same leaks as the correct secret key! There are $16$ new conditions for each round considered!

Note the crucial order: first you get the candidate list and later you filter it out. For the worst-case scenario, one must take all $2^{98}$ candidates one by one, simulate the encryption and check if the simulated leaks are the same as the original ones.

This is the ending point for many researchers/human beings out there. If you think the same, I believe this might be the time to move on to the next problem. Thank you for reading! We clearly discussed a problem with an unreasonable amount of computation to do…

…but if you are skeptical, feel free to read the next notes!

Nested Rabbit Holes

The first step is asking ourselves “How would a mathematician tackle the problem?”. Well, by drawing an abstract representation of the problem! Many (me included!) often forget the power of visualization problem solving, better known as the ability to solve a problem by looking at it without thinking too much about creating a mental image.

1
2
3
4
5
6
7
8
9
for cand1 in CandidatesList1:
	for cand2 in CandidatesList2:
		...:
		for cand16 in CandidatesList16:
			candidate = (cand1, cand2, ..., cand16)
			if all ( check1(candidate), check2(candidate), ... , check16(candidate) ):
				# Good candidate,  store it somewhere
				good.append(candidate)
			# else Bad Candidate

Pseudo-code representation of the intuitive solution: an astonishing 16 nested loops, a recreation of the key candidate and an `if` statement that all the 16 checks passes.

Once we have a nice picture, the second step is to find equivalent solutions. Equivalency is an extraordinary mathematical property that often hides tips on how to improve the implementation computation complexity. Look it in this way: computing $16$ checks and later checking that all of them pass is no different than nesting the checks one after the other and, if any one of them fails, stop there.

1
2
3
4
5
6
7
8
9
10
11
12
for cand1 in CandidatesList1:
	for cand2 in CandidatesList2:
		...:
		for cand16 in CandidatesList16:
			candidate = (cand1, cand2, ..., cand16)
			if check1(candidate):
				if check2(candidate):
					...:
						if check16(candidate):
							# Good candidate,  store it somewhere
							good.append(candidate)
			# else Bad Candidate

Well, at least the checks are now split and... OH MY! After 16 levels of nested loops, here appear other 16 levels of nested `if`s statements.

All of this reduces the number of checks but not the number of candidates. If we see the problem as a tree, it would be nice if we could squeeze the checks in one of the intermediate levels so that if a check fails early, we can completely skip the rest of the subtree and move to the next branch!

1
2
3
4
5
6
7
8
9
10
11
12
for cand1 in CandidatesList1:
	for cand2 in CandidatesList2:
		if check1(cand1,cand2):
		for cand3 in CandidatesList2: 
			if check2(cand1,cand2,cand3):
				...:
				for cand16 in CandidatesList16:
					candidate = (cand1, cand2, ..., cand16)
					if check16(candidate):
						# Good candidate,  store it somewhere
						good.append(candidate)
		# else Bad Candidate

The dream is to split and move the checks in between the loops so that w...OH MY! After 16 levels of nested loops, other 16 levels of nested `if`s statements appear!

However, to achieve this, the moved checks cannot be dependent on the subkey candidates of the subtree. Luckily for us, the considered problem has exactly this property!

Indexing a Tree

Each one of the $16$ checks is dependent on a subset of subkeys thus potentially allowing to raise several checks into the tree structure. The only mandatory object to consider is a permutation of the indices which permute the tree levels and allows the checks to correctly be moved inside the tree. If we don’t permute, a check might only depend on the first $k_1$ and last subkey $k_{16}$ making it impossible to raise it in the tree, despite it being independent against the majority of the subkey’s candidates! By permuting, the tree will first consider candidates for $k_1$ and $k_{16}$ on the second level allowing the check to be placed here. Easy, right? Now we are able to navigate all the candidates faster because if a premature check fails, we might skip big branches of the tree!

1
2
3
4
5
6
7
8
9
10
11
12
for cand1 in CandidatesList1:
	for cand5 in CandidatesList2:
		if check3(cand1,cand5):
		for cand10 in CandidatesList2: 
			if check1(cand1,cand5,cand10):
				...:
				for cand2 in CandidatesList16:
					candidate = (cand1, cand2, ..., cand16)
					if check7(candidate):
						# Good candidate,  store it somewhere
						good.append(candidate)
		# else Bad Candidate

If you think the code looks similar to the previous one, you are right! Before it was a dream while now, after permuting the indexes and checks, we manage to make the dream become reality!

One question: are we happy with writing $16$ nested loops with some checks here and there? Are we really happy about it?

Weird (?) Iterators

I love the idea of an iterator: you must cycle an iterable structure and you let the iterator handle how to move in the structure. You only have to ask “Hei iterator, give me the next element”. Trees representing Cartesian products are so elegant to describe by only using the indices of the single components. Skipping subtrees is a matter of increasing the correct index and when a level is done, you reset the indices and increase a previous one, similarly to how the carry-over works in additions.

[Cartesian Product Universe]
I have (cand1[i1],cand2[i2],...,cand16[i16])
Hei iterator, next!
Iterator: (cand1[i1],cand2[i2],...,cand16[i16+1])

Me: why is this the next?

---
(cand1[i1],cand2[i2],...,cand16[i16 + 1]) <---> (i1,i2,...,i16+1)
---

[Index Product Universe]
I have (i1,i2,...,i16)
Hei iterator, next!
Iterator: (i1,i2,...,i16+1)

Hei iterator, skip on the second index.
Iterator: (i1,i2+1,0,...,0)

The two universes are equivalent but handling indices is easier mentally, memory-wise and computationally.

Our tree can be represented as a list of indices and we increase the current index position according to the check that fails allowing the tree’s navigation to be optimal⁶!

Here we have it, a classic iterator on a Cartesian product of indices with a weird way of obtaining the next element since it might receive as input a request to skip to a future index. The information needed to create such an iterator (that I like to call IndexTree) are the list of indices dimensions $[ n_j ]_{j=1}^{16}$ and a list of indices $[ s_j ]_{j=1}^m$ where the skip must occur. For example, the third skip $s_3$ will denote the iterator that the third skip is done at the $s_3$-th position.

def IndexTree:
	vector current_index;
	vector max_indices;
	vector skips;

	def new([n_1,n_2,...,n_k], [s_1,s_2,...,s_m]):
		max_indices = [n_1,n_2,...,n_k];
		current_index = (0,0,...,0);
		skips = [s_1,s_2,...,s_m];
	
	def fix_index:
		# Fix the current index
		# For each position i, the i-th index must be between 0 and n_i
		# If not, modulo and carry over the surplus
			
	def skip(i):
		current_index[skips[i]-1] += 1;
		for i in skips[i]..k:
			current_index[i] = 0;
		fix_index()
		
	def increase:
		current_index[k-1] += 1;
		fix_index() 
	
	def next():
		return current_index;

`IndexTree` is created with the list of dimensions for each index and a list of skips. The index increases via a `skip` or a normal `increase` and it is later fixed into the correct form. Calling `next` will return the current index without increasing. The solving algorithm will take the responsibility of doing so.

All is nice, the drawing, the trees and the permutations…

…but are we really making it fast or is this syntactic sugar for useless complexity?

Test Code on a Smaller Sequoia

Measuring code efficiency empirically can be extremely misleading and hard to do in a sound way. Compilators/interpreters often do weird magic, hardware handles memory according to even weirder magic, temperature influences performances and so many other factors. Sorry for the personal rant but I cannot find an article describing in high-details how the comparison between code executions should be done rigorously…
…that is why I will do it my way (definitely not rigorous) and mainly to provide a rule-of-thumb comparison.

First of all, I will focus more on the average amount of checks the code executes per second while tackling a (somehow) smaller problems. The worst possible problem of navigating the whole $70^{16}$ leaves sequoia is, IMHO, too big and somehow out-of-scope for doing a comparison between different solutions. To do so, I will go for several “worst in a somehow good” scenarios: $6$ leaks have a unique subkey candidate but the other $10$ are bad-bad with $70$ candidates each eaning there are this number of candidates:

\[1^6 \cdot 70^{10} = 2\,824\,752\,490\,000\,000\,000 \simeq 2.8 \cdot 10^{18} = 2^{61.3}\]

which is a big number, if you ask me! If we follow Snowden’s “one trillion guesses per second”⁷ ($\sim2^{40}$) suggestion, this government would find all the key candidates in $2^{21.3}$ seconds, which is almost a whole month.

Second, implementation language.
Many years in academia, mainly working in theory⁸, made my first choice fall to Python.

Python is so good for prototyping in research: extremely good for writing quite lengthy scripts that compute some (quite basic and known) combinatorial number, forget about them and later rewrite them for the millionth time to get the same results. Or, another classic, you have a million different elements with different types (or meanings) and shouldn’t interact despite being represented in the same way…

…but basic Python doesn’t care and lets you mix field elements, integers and group elements as if we all lived in a free, anarchic Universe where formalism is a long list of suggestions. If this is not love, then what?

When I first encountered this problem years ago, I first wrote the infamous naive solution of creating one candidate at a time and doing all the checks before continuing to the next candidate. Almost immediately, I rewrote the algorithm to do the nested checks and made the whole loop work in parallel. Luckily, I lost the code but the memory of the average speed of this prototype is quite vivid: $2^{22}$ checks per second, implying an estimated time of $\sim 21\,462$ years to finish.

I said “luckily I lost the code” because it forced me to rewrite everything in Rust!

Rust was a curious language that I started reading and learning a couple of years ago but never with the idea of applying it to real problems or research. Until a couple of months ago, mainly motivated by the need to get new skills, show my coding abilities and try out the language to see how much it could do for me⁹. This was a perfect small prototype project to work on and push the envelope!

That is why I’m happy to announce that the serial version of the naive solution with the sequential checks, working on a single MacBook Air M1 core, clocks at $2^{24.7}$ checks/sec, an astounding $\sim6.5$ times faster than the parallel Python version and implying a “more reasonable” computing time of $\sim3\,303$ years!

WOW, my glued code almost all in the main function, half the variables are a weird type-concussion and $16$ nested for cycle is indeed faster on a single core! Time to make the cycle parallel by first putting together all the loops as an iterator on the Cartesian product as the macro iproduct easily allows aaaaaannnd…

…error?
Oh yeah, there is a limit of $12$ lists you can put together with that macro.

Ok…
…so we skip parallel stuff (for now) and focus on how fast can a single core go.

That is why I refactored the code and made it more modular, isolating the weird iterator, the AES functions used but still going for the nested checks. The refactored code got $\sim2^{24.4}$ checks/sec, an astonishing $-20\%$ less efficient showing, once and for all, that we should never structure code nicely but instead put everything into main!

Joke aside, the loss is mainly related to the iterator which went from a long chain of nested loops to a single custom iterator that must check and fix the index when it overflows. However, with the IndexTree iterator at hand, we are finally able to permute the tree levels, move the checks and do all the theoretical optimizations.

I ran the code and went to cook lunch knowing that when I got back, the code would still be running and I could get the statistical data to add in this blogpost. An hour later I came back and I got an estimated execution time of $54$ seconds and $2^{55.5}$ checks/sec. A “small” increase of $\sim2\,301\,620\,000\%$.

I thought “come on… …something must be wrong!”¹⁰ but no, the results are correct. So, we confirmed paper optimization beats code optimization, right?

Take Away & Future Questions

That was quite a journey for a simple revision of a solving algorithm and some good discussion points can be brought to your next date (or maybe not).

Everything is realistically computable until the opposite can be proven/shown. A big part of the (research) problems out there are not standard and it might be hard to know if they are hard to compute. DYOR before giving up because:
1. it is a good mental exercise to keep the brain active;
2. the precise study of a problem increases experience and expertise;
3. a solution might have implications useful to others!
If the time allows, do the paperwork. Taking notes and solving the problem on “paper” before coding is a tool and a skill. Many solve tasks by directly writing code and slowly shape by rewriting it into the final shape. This works when there are time constraints (paper deadline, hackathon, bad superiors, etc.). Being good at solving problems on a different medium (notes, paper or even by voice) is an amazing start for allowing cooperation with other people, recording your thoughts for later or starting visualizing the problem.
Rust is “easier” than expected for crypto-oriented research problems. I got surprised about how easy it was for me to set up the whole solution framework and get minor problems with the compiler. I won’t argue that Python is way faster for prototyping but Rust allows us to really push the threshold on the computational side.
IndexTree? Well, I’m planning to clean up the code of the iterator and release it as a proof of concept. Clearly, there are better Rustaceans that know exactly the set of interfaces to use so my code will just be a guideline for them! My two cents: I doubt that one can implement Iterator for IndexTree as is because whenever a skip is made, the whole IndexTree is completely changed raising eyebrows regarding lifetime. However, it should be possible to write IndexTree as a normal iterator and use the nightly advance_by to achieve the same skipping at the cost of efficiency (the function is eager) and of having to decide how to handle the indices representations: single number or list of indices? Both have pros and cons.

If you know about the existence of this iterator, can you please let me know?

Edit (2024.02.07): I published the polished code of IndexTree. You can find it here.
Weird iterators. Connected to the previous point, it would be nice to know if some “advanced iterators” already exist for Rust. IndexTree is one of these iterators: an iteration of the elements in the Cartesian product with the ability to skip entire branches. Another one that comes to mind (and I implemented it in Python back in the day) is an iterator called UpdList on a list where new elements can be appended while iterating. It works like this: one iterates on a list [1,2,3] and, after consuming and computing on 1, realises that we must iterate on 4 too so the code can append it to what is left of the iterator obtaining [2,3,4]. This might look like a weird feature for an iterator but it is extremely useful (in research at least) when checking properties that must achieve some sense of closure: you start with a few elements and later the algorithm creates new instances and appends them to the list. Specific checks and properties will make sure the list collapses and reduces until it disappears and the loop terminates. It is true that the iterator might not end if the conditions/properties/code does not allow it but this is a pen-and-paper problem, not a code one¹¹.
Release the Code. After cleaning it up, I will share it here! Before throwing it on a public repo that most probably no one will ever see or use, I would like to see if the whole solving framework can be properly generalised into a more plug-and-play crate. I already know that somewhere, someone has a different problem that would be easily solved using all the arguments/algorithms/code of this post.

Of course, these are my thoughts, you might have different ones that I’m happy to hear and discuss!

Footnotes

the big O notation and related families of notation. ↩
IMHO, $\sim 80\%$ of cryptology papers propose new primitives and only $\sim10\%$ of them provide an implementation, usually only for comparison against other similar implemented protocols. ↩
leaking the Hamming weight of the permutation’s output is a classic in side-channel attacks. ↩
according to Blockchain.com (source). ↩
this is extremely not the case! My two cents: hashing is mainly over-optimized in over-optimized hardware which often cannot be repurposed to work for a different circuit (see checking AES key candidates). If we close an eye on this, checking a candidate is a lot less computation than a full hash computation! ↩
I’m deliberately hiding how to compute an optimal permutation. DYOR but the idea is to minimise the number of introduced dependencies from one level to the other while maximising the number of checks. ↩
here is a source. After a quick search, I found some machines that claim 500 trillion hash per second which is $\sim2^{49}$. True these monsters exist but I hardly doubt I will buy one for my research 😅 ↩
I was doing theoretical research not that I was theoretically working. I’m so funny sometimes! ↩
Beware, I’m definitely not an amazing programmer with several centuries of experience and know everything of quintillion different acronyms, standards and other software engineering optimization methods/cults. Given my background, I grew up tackling problems on paper until they were correctly solved and optimized. After that, I know how to write the code that I need to translate my paperwork into a different (programming) language! ↩
added only for dramatic effect. With the power of pen-and-paper maths, I already knew the code would be much faster than the previous one. ↩
recursion has the property since it is a matter of some pen-and-paper thoughts to find if the recursive call will reach a terminating case. ↩

On This Page