Latest Posts (11 found)
Higashi 1 weeks ago

Go generate meets vibes: vibe code Go one interface at a time using govibeimpl

Vibe-code Golang one interface at a time. During the holidays, I was working on a personal project in Go, and I wanted to use AI to help me do a few things (e.g. implement a downloader that downloads a file from Google Drive). However, I’m not a huge fan of having AI IDEs creating new directory structures or introducing abstractions that I need to read through and understand. Instead, I thought it would be cool to: And then I thought it would be great to combine this with where for every interface i define, i can just attach a tag so AI can fill in the rest at compile time. Therefore, I built govibeimpl (https://github.com/yuedongze/govibeimpl), a CLI tool that works with go generate that allows me to tag an interface for AI to implement, and seamlessly integrate that as part of my development flow. define an interface that i will need to use expect AI to write an impl to that interface i can just expect i will receive an instance of that interface at run time perhaps i’ll need to read about the api contract to see what concrete data types i need to pass in and read out profit i guess

2 views
Higashi 1 months ago

AI should only run as fast as we can catch up

Recently I have spoke with two of my friends who all had fun playing with AI. Last month, I met with Eric, a fearless PM at a medium size startup who recently got into vibe coding with Gemini. After getting familiarized with Gemini, Eric was genuinely amazed by how AI quickly turns prompt into playable web applications. It served great purpose as a first prototype to communicate ideas to designers and engineers. But Eric really wanted to skip those steps and directly ship it to prod. But he couldn’t really understand that Gemini actually built a single-page HTML file that merely looks like a working app. Sadly, one cannot build a reliable enterprise product out of this. And there is really no effective way for Eric to catch up on these technical details and outpace the engineering team himself. Last week, I had coffee with Daniel, a senior staff engineer who recently grew fond of AI coding and found it to be the true force multiplier. Daniel was skeptical of AI at first, but lately he hasn’t wrote a single line of code for months already. What he does is just precisely prompt the AI to create new components in an existing framework (involving Kafka, postgres, AuthN/Z, and k8s infra stuff) and adhering to certain preexisting paradigms. He would just spot-check the correctness of AI’s work and quickly spin up local deployments to verify it’s indeed working. Later, he pushes the changes through code review process and lands those features. All without writing a single line of code and it’s production ready just as if he wrote them himself. To Daniel, building and shipping things fast and scalable is simpler than ever. After speaking with Eric and Daniel, I suddenly feel that there is an overarching theme around the use of AI that we can probably interpolate out of the stories here. And after pondering for a weekend, I think I can attempt to describe it now: it’s the problem of reliable engineering - how can we make AI work reliably . With the AI superpower, one can task it to do all crazy things on the internet with just typing a few lines of prompt. AI always thinks and learns faster than us, this is undeniable now. However, to make AI work actually useful (not only works, but reliable and trustworthy), we also need to catch up with what the AI does as quickly as possible. It’s almost like - we need to send the AI off to learn and think as fast as possible, but we also need to catch up as soon as possible to make it all relevant. And the speed we catch up things is critical to whether AI can help us effectively do these tasks. For the case of Daniel, he can spot-check and basically just skim through AI’s work and know for sure it’s doing the right thing with a few simple tests steps to verify, hence his results are more reliable. Whereas for Eric, he needs to basically learn software development from the bottom up to comprehend what the AI has done, and that really doesn’t give him the edge to outpace engineering teams to ship features reliably by himself. To generalize the problem again, I think for all the tasks we do, we can break them down into two parts: learning/creation and verification. Basically doing the task and checking if the task is done right. Interestingly, this gives us a good perspective to our relationship with AI on performing such tasks. Effort wise, if verification « learning/creation , one can very effectively check AI’s work and be confident about its reliability. If verification ~= learning/creation , one spends equal amount of time checking AI’s work. It’s not a big win, maybe AI becomes a good automation script to cut down some boilerplate. If verification » learning/creation , one cannot be sure about AI’s work that easily, and we are in the vibe-land. A very good example of the first category is image (and video) generation. Drawing/rendering a realistic looking image is a crazily hard task. Have you tried to make a slide look nicer? It will take me literally hours to center the text boxes to make it look “good”. However, you really just need to take a look at the output of Nano Banana and you can tell if it’s a good render or a bad one based on how you feel. The verification is literally instantaneous and effortless because it’s all encoded as feeling or vibes in your brain. “Does this look right?” probably can be answered in the span of milliseconds by your vision cortex. There is also no special knowledge required - human beings have been evaluating visual images since birth , hardwired into our instincts. The significant cost asymmetry can greatly explain why AI image generation exploded. If we can look for similar scenarios, we can probably identify other “killer” use cases of AI as well. However, if we go down into the bottom of the spectrum where verification becomes more intense - requiring domain knowledge, technical expertise, industry know-hows to tell if the AI is producing slop or not, we will enter this dark age of piling verification debt. More things are being created, but we are lagging behind to check if any of it actually works to our satisfaction. If an organization keeps vibe-coding without catching up with verification, those tasks can quickly end up as “debts” that needs to be verified. When verification becomes the bottleneck, dangerous things can happen if we still want to move fast - we will risk ourselves running unverified code and having unexpected side effects that are yet to be validated. It can also apply to other fields - imagine asking AI to craft a new vaccine and you don’t want to wait for FDA to use it. I’ve come across a few blog posts that talks about Verification Debt already. I think it’s genuinely a good problem for technical leaders to have in their mind in this era. AI can only reliably run as fast as we check their work. It’s almost like a complexity theory claim. But I believe it needs to be the case to ensure we can harvest the exponential warp speed of AI but also remain robust and competent, as these technologies ultimate serve human beings, and us human beings need technology to be reliable and accountable, as we humans are already flaky enough ;) This brings out the topic of Verification Engineering. I believe this can be a big thing after Context Engineering (which is the big thing after Prompt Engineering). By cleverly rearranging tasks and using nice abstractions and frameworks, we can make verification of AI performed tasks easier and use AI to ship more solid products the world. No more slop. I can think of a few ideas to kickoff verification engineering: I believe whoever figures out ways to effectively verify more complex tasks using human brains, can gain the most benefit out of the AI boom. Maybe we need to discard traditional programming languages and start programming in abstract graph-like dataflow representations where one can easily tell if a thing is done right or wrong despite its language or implementation details. Maybe our future is like the one depicted in Severance - we look at computer screens with wiggly numbers and whatever “feels right” is the right thing to do. We can harvest these effortless low latency “feelings” that nature gives us to make AI do more powerful work. How to craft more technicall precise prompts to guide AI to surgically do things, rather than vibing it. How to train more capable technical stakeholders who can effectively verify and approve what AI has done. How to find more tasks that are relatively easy to verify but rather hard to create. How to push our theoretical boundaries of what things we can succinctly verify (complexity theory strikes again).

0 views
Higashi 1 years ago

I built a Coffee Beans Tracker that does OCR inventory management, with GPT and no code

I brew coffee and I buy a lot of beans. I’m also a lazy person that doesn’t want to type in all the details in one of the available apps and track things to the grams. Luckily I found out that GPT-4o (free tier) is good enough to take a few photos I uploaded of a bag and parse it’s roaster, varietal, origin, tasting notes, roast date, etc. I tried with a dozen different bags and it works pretty well - even when it makes mistakes I can correct them with natural language conversation, which I enjoyed. Then I had to figure out storage. I don’t want to rely on the chat history / context as a way to store info as that can easily get lost. I discovered GPT Actions. My immediate thought is to give GPT some access of an online scratchpad that it can read and write and call it a day. That’s how I ended up integrating it with GitHub Gist. I created an empty gist file, and generated an app access token that reads/writes gist files. I hardcoded this token to GPT Actions, (struggled quite a bit on teaching it how to use the gist API), and voila it was able to write stuff to my gist file! I simply populated some example content in that file - a CSV encoded format of coffee beans with the first row describing the columns that I’m interested in. GPT is then smart enough to figure out read this CSV, parse how much coffee I got to answer queries, as well as append a new bag to the end of CSV, and even sort its entries! It also made some terrible mistakes that I had to manually correct, but these are CSV syntax mistakes and it doesn’t seem to bother GPT at all. Here is a history of changes GPT (and I) made to that gist file if you are interested. Then I talked about it with friends, and they all of sudden got interested and wanted to try this too. I can’t share my GPT because it hardcodes my API tokens - they will be writing onto my beans list instead of theirs. Moreover, GPT is unreliable, even if I manage to separate users by a specific column, I’m sure my friend can find a way to trick GPT to just wipe the entire gist file and ruin the fun. Last weekend I was thinking about how to make this multi-tenant. I think GPT Actions attaches some user identifier on each request but I’m not so sure whether it’s reliable. Then I found that you can also do OAuth in GPT Actions instead of a plain API token - ideally this should allow users to log in to their own account and claim their own piece of storage somewhere. I found a blog post from OpenAI about how to authenticate GPT Actions with AWS Cognito user pool. I decided this is it - I just fired up an AWS account and did the full server less stack: Cognito for oauth/user pool, API Gateway for gating access, Lambda for processing requests and giving GPT read/write to its scratchpad (literally a string that I save) keyed under Cognito user ID, and a DynamoDB that stores the data. After tinkering with it for 1-2 hours, and I got something working. I can log in / register an account with Cognito, and then GPT gets that persistent scratchpad to do its thing. It worked for me and it worked for my partner’s phone as well. I’m happy that I didn’t do any serious development work throughout the process and GPT was able to approximately do all the things that I intended. I would like to claim that I wrote NO CODE for this, but I had to write a simple get/put adapter the Lambda function, but most of that are copy paste from a tutorial anyways ;) There you have it. I really like how you can just prototype some idea this easy by giving a GPT a scratchpad. This can be repurposed to do not just coffee management of course. Maybe I can think of other better ideas down the road. Here is a setup link for who’s interested trying out this new multi-tenant version of my bean tracker. Sorry for being a bit sketchy on everything but the idea is to make sure I don’t do any serious dev work and still achieve the functionality.

0 views
Higashi 2 years ago

A Gentle Introduction of NTT - Part III: The Kyber Trick

In the previous post, we have seen that NTT is a special way of evaluating a polynomial (i.e. converting a polynomial from coefficient form to evaluation form) such that the evaluation points are all the $d$-th roots of unity. We then examined a few efficient ways (recursive and iterative) to implement NTT and iNTT routines. Lastly we learned that by carefully picking the right twiddle factors (generators) as roots of the polynomial modulus, we can perform modular polynomial multiplication in NTT form in the same degree as the multiplicand (without the need to extend the degree and perform additional reduction). Putting everything we’ve learned so far, we can put together an efficient algorithm for multiply polynomials in a ring in $O(n \log{n})$ time. The recipe is as simple as: Perform NTT on polynomial $P, Q$ with the roots of the polynomial modulus $\phi$ as twiddle factors and obtain $\hat{P}, \hat{Q}$. Note that this requires the modulus to have a special structure such that there exists a generator that generates all of its roots. Perform coordinate-wise multiplication on $\hat{P}, \hat{Q}$ and obtain $\hat{R}$. Perform Inverse NTT on $\hat{R}$ and recover the final multiplication result $R = P \times Q$. From the previous post, we also saw that there are two commonly used cyclotomic polynomial moduli in NTT - $x^d - 1$ that corresponds to circular convolution (CC) , and $x^d + 1$ that corresponds to negative-wrapped convolution (NWC) . In this post, we will continue our discussion on the Negative-Wrapped Convolution (NWC). One of the key speed-up for polynomial multiplication brought by NTT is that, if the evaluation points happen to be the roots of the polynomial modulus (in our case, $x^d \pm 1$), then the multiplication result is automatically reduced to the target degree, and hence we don’t have to double the length/degree of the polynomial we are multiplying to preserve the higher degree coefficients for the reduction. In the case of $x^d - 1$, we found out that the distinct powers of a $d$-th primitive root of unity $\omega_d^i$ (or alternatively, the set of field elements generated by the generator $\omega_d$) also happens to be all the roots for the polynomial modulus: How we obtained this result is through recursively splitting the $x^d - 1$ term, and applying the fact that $\omega_d^\frac{d}{2} \equiv -1$ to convert the addition term after the split back into a subtraction term so we can split further. For example, $x^\frac{d}{2} + 1 \rightarrow x^\frac{d}{2} - \omega_d^\frac{d}{2}$. We can split totally $\log{d}$ times and end up with $d$ unsplittable degree-1 terms in which we can easily extract the roots of the original polynomial from. As seen from the last post, we can represent the splitting operation as a binary tree shown below (ignoring the $\mathbb{Z}_q[x]/$ part of the expression and $n = d$): Now moving to the case of NWC. We perform the same kind of binary split and result in the following tree ($\omega_d = \phi_{2d}^2$): One problem with the same approach that we pointed out last time is that, starting from the second layer from the top of the tree, the degree of the $\omega$ term on the right is always one half of the degree of the $x$ term on the left. This means that we will run out of $\omega$ terms to split with before we have fully splitted the $x$ terms down to degree-1. As seen on the second layer from the bottom of the tree, $x^2 - \omega_d$ cannot be broken down further, until the point we introduced a finer generator $\phi_{2d}$ which can help us split $\omega_d$ one more time. With $\phi_{2d}$ as the new $2d$-th root of unity, we found the roots of the original polynomial, which are distinct odd powers of the new generator. The roots for the NWC case indeed work for our polynomial multiplication task. However, we are putting more constraints on the properties of the underlying field $\mathbb{Z}_q$: to make sure everything is well-defined, we need to make sure that the $2d$-th primitive root of unity actually exists in the field. So far, we’ve kind of taken it for granted assuming that $\omega_d$ exists in $\mathbb{Z}_q$. However, it doesn’t necessarily mean that we can also find the $2d$-th one. For example, the field $q$ used in Kyber is 3329, in which number 17 is a qualifying 256-th primitive root of unity (one can easily check that $17^{256} \equiv 1 \text{ mod } 3329, 17^{128} \equiv -1 \text{ mod } 3329$). Sadly, the 512-th one doesn’t exist. If we happen to work in the NWC polynomial modulus, we can only perform degree-128 polynomial multiplications! If we really want to do degree-256 polynomial multiplications, we will have to make the underlying field much larger. The smallest field that gives us a 512-th primitive root of unity is 12289 (with number 3 as an example generator such that $3^{512} \equiv 1 \text{ mod } 12289, 3^{256} \equiv -1 \text{ mod } 12289$), which is the field used in another key encapsulation system called NewHope. The bump from 3329 to 12289 to accommodate degree-256 polynomial multiplication can cause potential performance impacts. Sice $3329 \approx 2^{11.7}$ and $12289 \approx 2^{13.6}$, we are looking at a 2-bit increase of the representation of the underlying field size, and a 4x timing/memory degrade if we have to pre-compute things in the field and store them offline. This naturally brings us to the next big question: Can we do better? The short answer to the question is: Yes, but not without paying some price. To avoid paying the extra cost in the field, we will just have to figure out a way to work without the assumption of the existence of $\phi_{2d}$. Now we are back in the binary split tree, and stuck at the layer where we are left with a term that looks like $x^2 - \omega_d$. To get us unblocked, here’s some life advice: If you cannot have what you want, you should want what you can have. If the best thing we can have is the existence of a primitive $d$-th root of unity $\omega_d$, then using the same NTT with NWC approach, we will see what we can do with it. Here is a simple idea: let’s constrain everything we work with to only be $\frac{d}{2}$-degree polynomials, and define the NTT operation on them. Because we only have $\frac{d}{2}$ coefficients, we only need half the evaluations. Since we know $\omega_d$ already, we can easily set $\omega_\frac{d}{2} = \omega_d^2$ to be the new $\frac{d}{2}$-th primitive root of unity, and conversely $\phi_d = \omega_d$. We call this operation the half-NTT , as it operates on input vectors half the size of our intended targets. Now, applying everything we’ve known previously, we can now do degree-$\frac{d}{2}$ polynomial multiplication coordinate-wise without reductions with $x^\frac{d}{2} + 1$ as the polynomial modulus. Effectively, we halve the size of $d$ to make sure the $2d$-th primitive root of unity exists in the field. How can we make use of the newly defined half-NTT procedures? Here is another intuition: can we break down polynomial multiplications in degree-$d$ into several polynomial multiplications in degree-$\frac{d}{2}$? Take a degree-4 polynomial $P$ as an example: We see that we can reorder the terms and factor the common parts and essentially interpret a polynomial as a combination of a polynomial with only even coefficients and a polynomial with only odd coefficients. This is very similar to the FFT-like recurrence relation trick that we did in the previous post to speed up NTT. The important part is that, since each of the smaller polynomials $P^\text{even}, P^\text{odd}$ has only half of the coefficients of $P$, we know that it has degree $\frac{d}{2}$. Therefore, our half-NTT procedure would work on them! Let’s now set up the half-NTT procedure programmatically. We first import the efficient iterative NTT/iNTT implementations from the previous post. We also directly work in the Kyber field $q = 3329$ to demonstrate how it works IRL. Next, we define the split procedure as well as the half-NTT trick. Now, let’s try to multiply two degree-$d$ polynomials using this decomposition, and just see how far we can get: We use $R(x)$ to represent the final result polynomial after the reduction, and also decompose it into even and odd subparts for comparison. By simply staring at the expression, we can come up with a solution for $R^\text{odd}$: We were able to deduce that because of the parity of the powers of $x$ here, as the other terms can only produce even powers of $x$. If the top expression holds, the bottom expression will also hold. It also appears that we can efficiently compute everything in this expression , as $P^\text{even/odd} \cdot Q^\text{odd/even}$ can be multiplied coordinate-wise after half-NTT, and the addition is trivial. We now define $R^\text{odd}(x)$ in the half-NTT domain, using $\tilde{R^\text{odd}}$ to represent the half-NTT of its vectorized coefficients \(\mathbf{NTT}^\text{NWC}_{\omega_\frac{d}{2}} (\mathbf{r}^\text{odd})_j\). Below is the corresponding implementation. Now, we subtract the equality from the original expression, and we can get a similar solution for $R^\text{even}$ as well: The first term $P^\text{even} \cdot Q^\text{even}$ can be done via half-NTT just like above. However, the second term is a little bit interesting: it is equivalent to say that the second term is the product of $P^\text{odd} \cdot Q^\text{odd}$ and a predetermined fixed-degree polynomial $K(x) = x$. Therefore, we can simply multiply them coordinate-wise in half-NTT form. Let’s take polynomial $F = P^\text{odd} \cdot Q^\text{odd} \text{ mod } (x^d + 1)$ and $\mathbf{f}$ as the vectorized coefficient representation of $F$. We inspect it in its half-NTT form: where $\psi_d$ is a $d$-th primitive root of unity, and $\omega_\frac{d}{2} = \psi_d^2$. Now, let’s try to write the half-NTT form of $K$. We see that the half-NTT terms of $K$ are as simple as a series of powers of $\psi_d$! Now, let’s write down the coordinate-wise multiplication of $F$ and $K$: Therefore, we can write down the overall expression for $R^\text{even}$, again using the $\tilde{R^\text{even}}$ abbreviation for the half-NTT procedure. We notice that, the multiplier $\psi_d^{2j+1}$ is different for distinct $j$ index of the half-NTT vector. Let’s now implement this part as well. Congratulations! We have just demonstrated computing the even part and the odd part of the product of two degree-$d$ polynomials $P, Q$, using only half-NTT subroutines that can work with degree-$\frac{d}{2}$ polynomials in the NWC field of $(x^\frac{d}{2} + 1)$. Now, let’s formally define the steps we need to take in order to multiply the two degree-$d$ polynomials $P, Q$: Fix a $d$-th primitive root of unity $\omega_d$ in field $\mathbb{Z}_q$. Instantiate the half-NTT procedure using $\omega_d^2$ as the twiddle factor (hence handling polynomials half the size). Split $P, Q$ into the even and odd polynomials, apply half-NTT to convert into NTT domain and obtain $\tilde{P^\text{even}}, \tilde{P^\text{odd}}, \tilde{Q^\text{even}}, \tilde{Q^\text{odd}}$ with half the size of the original polynomial. For each coordinate $j$, compute $\tilde{R^\text{even}}_j = \tilde{P^\text{even}}_j \cdot \tilde{Q^\text{even}}_j + \psi_d^{2j+1} \tilde{P^\text{odd}}_j \cdot \tilde{Q^\text{odd}}_j$. For each coordinate $j$, compute $\tilde{R^\text{odd}}_j = \tilde{P^\text{even}}_j \cdot \tilde{Q^\text{odd}}_j + \tilde{P^\text{odd}}_j \cdot \tilde{Q^\text{even}}_j$. Convert $\tilde{R^\text{even}}, \tilde{R^\text{odd}}$ back to the coefficients form $R^\text{even}, R^\text{odd}$ via half-iNTT. Recover the final polynomial coefficients by interleaving coefficients of $R^\text{even}, R^\text{odd}$: Let’s now do a performance analysis of our new polynomial multiplication algorithm. Looking at the code, the first and last parts are splitting and joining the polynomials. This is utmost linear $O(d)$ cost, or even with no cost if we are purely reindexing the original polynomials without actually splitting. The second and fourth sections are performing two half-NTT (or inverse) operations, which have a runtime of $O (d \log{d})$ given the recurrence relation we show in the last post. This is more or less similar to what we have before. The third section is the part that’s slightly different. Compared to the original coordinate-wise single multiplication over $d$ elements which is linear $O(d)$ cost, we are doing totally 5 field multiplications and two additions over $\frac{d}{2}$ elements which is $O(2.5d)$ cost. However, asymptotically, we can still say it’s linear $O(d)$ cost. Overall, we see that the new approach performs pretty much the same asymptotically! However, practically, we are spending 2.5x more time in the coordinate-wise multiplication due to the fact that we have to work with smaller degree polynomial multiplications. This is the additional price we are paying for choosing a more compact field. Practically, there’s one more thing we can potentially optimize, which is the two half-NTT calls on the even and odd polynomials. Since they are the same half-NTT operation sharing the same twiddle factors, we can combine two transformations into one to halve the iterations needed for the step. Once we put the two half-NTTs side-by-side, we would realize that this looks identical to the full degree-$d$ iterative NTT, but without doing the first layer such that the even and the odd coefficients are not being coupled. We can also fold all the scaling factors into the iterative process, such that we don’t need any prior/post modification on the input/output. Eventually, with all the bells and whistles applied, we finally reach the (i)NTT implementation used in Kyber from its specification : Kyber’s implementation looks slightly different from our code earlier. This is due to the following reasons: Kyber works in bit-reversed order in its NTT form, and normal order in its coefficient form. Our implementation is the opposite due to the way the iterative NTT circuit is built. This will not affect the correctness or performance of the implementation, as long as it’s consistent throughout the software. Like the optimizations we mentioned above, Kyber merges the multiplication with $\psi_d^{\pm i}$ with the iterative NTT procedure, so no pre/post processing is needed. Kyber also performs all the operations in-place, so it never actually “splits” the polynomial into two. Before we wrap up, I will have to admit that I intentionally omitted a large concept when writing this post. We’ve managed to come this far only via working with polynomials in the coefficient form and the evaluation form. However, if we want to generalize the trick used in Kyber and further reduce the field, like maybe to the point that we can only handle degree-$\frac{d}{4}$ multiplications, then we will perhaps run into a lot of difficulties interpreting everything in the way we used to do. The missing concept here is called the Chinese Remainder Theorem (CRT) form . I will maybe write another post that deep dives into the topic, but all we need to know right now is that there exists an isomorphism between the CRT form and the evaluation form, which means anything we’ve able to achieve with evaluating polynomials have another corresponding meaning in the CRT realm. This isomorphism allows us to generalize the Kyber trick further. If you were able to read to this part, huge congrats! This is all you need to know about NTT in a smaller field used in Kyber. Let’s do a quick recap of what we have achieved in this post: First, we started off with the results from the last post: NTT defined in the CC modulus and the NWC modulus. To enable coordinate-wise multiplication between polynomials, the NTT procedure needs to be slightly different depending on the binary split tree produced by the polynomial modulus. Second, we looked into the problem that in order to multiply degree-$d$ polynomials in the NWC modulus, there has to exist some $2d$-th primitive root of unity $\psi_{2d}$. Without this, we cannot split the second last layer in the binary split tree, and cannot obtain the roots to evaluate on. However, having $\psi_{2d}$ is a stronger assumption than having $\omega_d$, and it will make the underlying field much larger than the CC case (3329 => 12289). Third, we took a step back, and start thinking about whether we can still do multiplications in NWC modulus without the existence of $\psi_2d$. The solution we had is to define a half-NTT procedure that supports polynomial multiplication with degree-$\frac{d}{2}$. With the half-NTT procedure, we decompose the degree-$d$ polynomial into its even and odd subparts, each with dimension-$\frac{d}{2}$. Multiplication is defined as several cross-multiplications and additions between the smaller subparts. Lastly, we implemented the half-NTT and cross-multiplication trick, and reasoned that its runtime complexity is asymptotically the same but practically a bit more expensive than vanilla NWC NTT. We also saw how Kyber implements this trick in the end. At this point, we have covered most of the interesting items about NTT. What we’ve learned will build a solid foundation for efficient implementation of polynomial-based cryptosystems , such as Kyber and Falcon. If you are still curious about NTT, I would definitely recommend checking out this amazing survey paper which inspired this series. Up next, we will take a stab at more efficient implementation of NTT. Here is a hint of what we will look at: Notice that in our current implementation, there exists many \% operators that performs modular reduction in the finite field $\mathbb{Z}_q$. However, the reductions themselves are not that easy to do, and also hard to be done in a secure way. We will be checking out a few strategies to make the reductions behave nicer. See you next time! Perform NTT on polynomial $P, Q$ with the roots of the polynomial modulus $\phi$ as twiddle factors and obtain $\hat{P}, \hat{Q}$. Note that this requires the modulus to have a special structure such that there exists a generator that generates all of its roots. Perform coordinate-wise multiplication on $\hat{P}, \hat{Q}$ and obtain $\hat{R}$. Perform Inverse NTT on $\hat{R}$ and recover the final multiplication result $R = P \times Q$. Fix a $d$-th primitive root of unity $\omega_d$ in field $\mathbb{Z}_q$. Instantiate the half-NTT procedure using $\omega_d^2$ as the twiddle factor (hence handling polynomials half the size). Split $P, Q$ into the even and odd polynomials, apply half-NTT to convert into NTT domain and obtain $\tilde{P^\text{even}}, \tilde{P^\text{odd}}, \tilde{Q^\text{even}}, \tilde{Q^\text{odd}}$ with half the size of the original polynomial. For each coordinate $j$, compute $\tilde{R^\text{even}}_j = \tilde{P^\text{even}}_j \cdot \tilde{Q^\text{even}}_j + \psi_d^{2j+1} \tilde{P^\text{odd}}_j \cdot \tilde{Q^\text{odd}}_j$. For each coordinate $j$, compute $\tilde{R^\text{odd}}_j = \tilde{P^\text{even}}_j \cdot \tilde{Q^\text{odd}}_j + \tilde{P^\text{odd}}_j \cdot \tilde{Q^\text{even}}_j$. Convert $\tilde{R^\text{even}}, \tilde{R^\text{odd}}$ back to the coefficients form $R^\text{even}, R^\text{odd}$ via half-iNTT. Recover the final polynomial coefficients by interleaving coefficients of $R^\text{even}, R^\text{odd}$: Kyber works in bit-reversed order in its NTT form, and normal order in its coefficient form. Our implementation is the opposite due to the way the iterative NTT circuit is built. This will not affect the correctness or performance of the implementation, as long as it’s consistent throughout the software. Like the optimizations we mentioned above, Kyber merges the multiplication with $\psi_d^{\pm i}$ with the iterative NTT procedure, so no pre/post processing is needed. Kyber also performs all the operations in-place, so it never actually “splits” the polynomial into two. First, we started off with the results from the last post: NTT defined in the CC modulus and the NWC modulus. To enable coordinate-wise multiplication between polynomials, the NTT procedure needs to be slightly different depending on the binary split tree produced by the polynomial modulus. Second, we looked into the problem that in order to multiply degree-$d$ polynomials in the NWC modulus, there has to exist some $2d$-th primitive root of unity $\psi_{2d}$. Without this, we cannot split the second last layer in the binary split tree, and cannot obtain the roots to evaluate on. However, having $\psi_{2d}$ is a stronger assumption than having $\omega_d$, and it will make the underlying field much larger than the CC case (3329 => 12289). Third, we took a step back, and start thinking about whether we can still do multiplications in NWC modulus without the existence of $\psi_2d$. The solution we had is to define a half-NTT procedure that supports polynomial multiplication with degree-$\frac{d}{2}$. With the half-NTT procedure, we decompose the degree-$d$ polynomial into its even and odd subparts, each with dimension-$\frac{d}{2}$. Multiplication is defined as several cross-multiplications and additions between the smaller subparts. Lastly, we implemented the half-NTT and cross-multiplication trick, and reasoned that its runtime complexity is asymptotically the same but practically a bit more expensive than vanilla NWC NTT. We also saw how Kyber implements this trick in the end.

0 views
Higashi 2 years ago

Announcing my new blog

If you have known me before, you might have come across my cryptography-related blog posts either from Medium or my personal blog . Unfortunately,, I forgot to update my credit card info and hence didn’t renew the domain :/ And guess what? Someone else bought the domain and now I can never have my domain back. I tried negotiating a good price for it though: But looks very unlikely. Therefore, I have acquired a new domain , and have ported over most of the recent blog posts. I will be gradually moving over some of the Chinese blog posts here as well, under 中文博客 tab. Have fun exploring! Do let me know if you notice and weirdness with LaTeX rendering as I’m switching from KaTeX to MathJax.

0 views
Higashi 2 years ago

A Gentle Introduction of NTT - Part II: The Number Theoretic Transform

Previously, we took a look at the problem of polynomial multiplication. Specifically, we saw that we can view polynomial multiplications as a form of convolution . To make things easier to compute, we often define a polynomial modulus, making it into a ring . The modulus defining the ring in our context is often special, making the convolution either circular or negative wrapped . At the end of the last post, we saw a naive approach at speeding up computation of convolutions - by converting the polynomials into evaluation form so convolution is as simple as coordinate-wise multiplication. We end up on the problem of the conversion between coefficient form and evaluation form which is expensive and becomes the new bottleneck. In this post, we will now look at an efficient solution: Number Theoretic Transform , or NTT for short. Images used in this post are sourced from this great survey on NTT - Number Theoretic Transform and Its Applications in Lattice-based Cryptosystems: A Survey . So, what is NTT? If we look at Wikipedia , NTT is a particular flavor of Discrete Fourier Transform (DFT) over a finite field. I could write another 2 to 3 pages worth of content to dive into Fourier Transform, its continuous and discrete variants, and the Fast Fourier Transform (FFT) algorithm. But I think it might not be the best use of our time in this particular post about NTT. Therefore, if you want to know more about how DFT works, I’d highly recommend checking out this amazing tutorial by 3Blue1Brown , and this follow-up video by Reducible . Instead, we will take a Computer Science or programming perspective to look at what NTT is. Given a length-$d$ vector $\mathbf{a} = {a_0, a_1, \dots, a_{d-1}}$, a carefully-selected prime modulus finite field $\mathbb{Z}_q$, and a primitive $d$-th root of unity $\omega_d$ (meaning that $\omega_d^d \equiv 1 (\text{mod } q)$, and that the smallest positive $i$ such that $\omega_d^i$ satisfies this relation is $d$), NTT defines the following transformation: Essentially, for a length-$d$ vector, NTT transforms that vector into another length-$d$ vector, but computes some kind of summation of all terms in the original vector mixed with the root of unity $\omega_d$. Although this NTT expression looks intimidating at the first glance, we can actually find very interesting similarities between this and something we have seen before. Let’s compute the first element of $NTT(\mathbf{a})$: Similarly, let’s compute the second element: If we read through our previous blog post in this series, we can immediately notice that these expression are no different from evaluating the polynomial $a(x)$ at points $\omega_d^0$ and $\omega_d^1$. We can easily see that the other entries of $\hat{\mathbf{a}}$ correspond to evaluation of $a(x)$ at the remaining powers $\omega_d^j$. Since $\omega_d$ is a primitive $d$-th root of unity, we know that all its powers within $[0, d-1]$ are distinct. The proof is also simple: assume the powers of $\omega_d$ are not unique, then we will find two different powers $\omega_d^i = \omega_d^j$ where $i \ne j, 0 \le i < j \le d-1$. If this is true, we can derive $\omega_d^{j-i} = \frac{\omega_d^j}{\omega_d^i} = 1$. However, this clearly contradicts with the primitive root requirement, where the smallest positive power of $\omega_d$ that equals to 1 is $d$, hence such collision cannot exist. Therefore, when we interpret NTT as polynomial evaluation at powers of $\omega_d$, we know that all the evaluation points are distinct! We know that NTT is no different from evaluating a degree-$(d-1)$ polynomial at $d$ distinct points. In the previous post, we saw the technique of polynomial interpolation that will recover the original polynomial from the distinct $d$ evaluations. Similarly, there is an inverse transformation iNTT that takes us back from the NTT (evaluation) form to the regular coefficient form. The Inverse Number Theoretic Transform is defined quite beautifully: Basically, we take the inverse of the original $d$-th root of unity $\omega_d^{-1}$, and apply the exact same NTT algorithm on the NTT form, plus a minor tweak where we need to apply a scaler $d^{-1}$ to properly scale the coefficients down. For now, we are not going to be super concerned about what Inverse NTT does. We just assume this subroutine magically works and converts the NTT form back to the coefficient form. Now we have seen the naive implementation of NTT and Inverse NTT, something has caught our attention: the code for NTT is again some double iterative summation over all coefficients, which immediately screams $O(d^2)$ complexity. Our sole intention of going to the NTT domain is to avoid the $O(d^2)$ cost of convolution, therefore, we have to figure out ways to speed up this routine, or we will get no benefit at all. Fortunately, this is where the magic of Fast Fourier Transform (FFT) starts to shine. We are not going to go in details of what FFT is, but let’s pay attention to the following expansion and grouping of the NTT algorithm. To make things more readable, we use $NTT_{\omega_k}$ to represent NTT algorithm for dimension-$k$ vector using primitive $k$-th root of unity. Essentially, what the expansion above does is to separate out the even-indexed coefficients in $\mathbf{a}$ from the odd ones, and the expression becomes two shorter NTT subroutines that uses the primitive $(d/2)$-th root of unity. We have thus found ourselves a recurrence relation ! To make things nice, we will always work with some degree $d$ that’s a power of 2. Therefore we are sure that $d$ will be splittable all the way till 1. And at the bottom of the recursion, NTT of some length-1 vector is just itself - think of evaluating a degree-0 polynomial, which is just the constant coefficient. Looking at the recurrence relation, at each layer, we are doing the job of one addition plus one multiplication by a constant $\omega_d^j$, therefore it’s $O(d)$ work to perform coordinate-wise addition and scaling. We break the task into 2 smaller tasks, each with half the size, therefore going down $O(\log{d})$ levels. This will get us $O(d \log{d})$ total runtime complexity, which is a great improvement from $O(d^2).$ However, the expression above only applies for $0 \le j < \frac{d}{2}$, as once $j$ exceeds the boundary, the behavior becomes undefined (you cannot get the $(d/2+1)$-th index of a NTT that’s only $d/2$ long). But what happens for points beyond $\frac{d}{2}$? Let’s give it a try: if we look closely at the original expression evaluated at point $j’ = \frac{d}{2} + j$: One crucial observation here is that since $w_d^d \equiv 1$, we know that $w_d^\frac{d}{2} \equiv -1 (\text{mod } q)$. This is because the two roots of 1 are itself and -1, and since $w_d^\frac{d}{2} \ne w_d^d$ due to the definition of the primitive root of unity, it has to be the other root that corresponds to -1. Then we can replace all the occurence of $\omega_d^\frac{d}{2}$ with simply -1, and the NTT at point $\frac{d}{2} + j$ magically equals to the NTT at point $j$ with the symbol before the odd term flipped. With this new finding in mind, here is the plan to implement the recursive NTT: Now, let’s implement the improved subroutines to supercharge NTT. This recursive algorithm is also called a Radix-2 Cooley-Tukey NTT algorithm , named after Cooley and Tukey who (re)discovered the FFT algorithm in 1965. After completing the recursive version of the NTT algorithm, we ask ourselves again: can we do better? The answer is always: Yes! In fact, although recursion is a nice way to express the recurrence relation, it isn’t the most efficient way, considering the number of nested function calls will cost us in the call stack and the recomputation of certain values. Therefore, can we come up with an iterative version of NTT ? In order to do so, we are going to follow the instructions given by Chapter 30.3 in Introduction to Algorithms . First, let’s try to expand the recursive call hierarchy of an example input of length-8. From the expansion, we can see at the bottom, pairs of two elements are grouped together: $(a_0, a_4), (a_2, a_6), (a_1, a_5), (a_3, a_7)$. Interestingly, the difference between the indices of the two elements are all 4, half of the total length. Walking upwards the tree, every parent node combines the elements inside the leaf nodes in two different ways: $even \pm \omega^j odd$, thus generating a result vector that has twice the size as its leaves. This kind of structure can be depicted with a diagram that shows how the two parts of the input (even and odd) can be combined together and form two outputs. Given the shape of the data flow, this operation is commonly referred as the butterfly operation . In particular, what we have here is the Cooley-Tukey butterfly . Now, we have two ideas: Recursion is really an iterative grouping of elements in a particular order. The “grouping” operation is the Cooley-Tukey butterfly operation. Combining both ideas, we can sketch out a “circuit” design for the NTT procedure: And the iterative version of the NTT algorithm can be described by interpreting the circuit diagram from left to right: Now, this procedure requires this weird shuffling of the input to match the base level of the recursion. But interestingly, this shuffle order is actually just the bit-reverse of the index . Why? Because the last bit of a number decides whether it’s even or odd. And in the recursive step, we separate elements in the array based on their parity - or the last bit. And at the next step, we look at their second last bit and group them up accordingly. Another interesting approach to understand this is to think that we are sorting the indices of the array by looking at the least significant bit first, then progress all the way to the most significant bit - hence the sorted order is just bit-reverse of the original order. With the bit-reversal helper in place, we can move onto implementing the iterative NTT. Now we have mastered the Cooley-Tukey butterfly network. The inverse transformation uses a circuit gadget that reverses (or unmixes) the result from a CT butterfly. This gadget is called the Gentleman-Sande butterfly as shown in the diagram below. The inverse transform is exactly reversing the entire CT butterfly network, and thus forming the GS butterfly network. Essentially, for the iNTT circuit, we just perform the Gentleman-Sande butterfly operation to undo the Cooley-Tukey butterfly operation at the layers from right to left in the perspective of the NTT circuit. We will code up the iNTT algorithm below. Notice the very subtle differences of this compared to the algorithm above. In fact, because they are so similar, often when implemented in hardware, they can share most of the circuitry, plus/minus some tweaks to adjust the omega values and the strides. And finally… We are here, with an efficient NTT/iNTT algorithm that takes a polynomial into the NTT domain and roundtrips it back. If you are able to follow this post until this point, big congrats to you! As this is already a big achievement ;) Now that we have defined the efficient subroutines that performs NTT and iNTT, the next question naturally rises: how do we use this latest and greatest piece of work to perform the calculation we really care about: circular convolution and negative wrapped convolution of two polynomials $P, Q$? Fortunately, we can quickly use the formula we derived at the end of the last post to achieve this. Here we use $\mathbf{p}, \mathbf{q}$ to represent the vector of coefficients in polynomial $P, Q$ that have size $d$. Extend the vectors $\mathbf{p}, \mathbf{q}$ to be length $2d$ by filling the latter extended part to be zero. $\mathbf{p}' = [p_0, \dots, p_{d-1}, 0, \dots, 0], \mathbf{q}' = [q_0, \dots, q_{d-1}, 0, \dots, 0]$ Perform $NTT$ on the extended vector $\mathbf{p}', \mathbf{q}'$ and obtain $\hat{\mathbf{p}'}, \hat{\mathbf{q}'}$. Multiply $\hat{\mathbf{p}'}, \hat{\mathbf{q}'}$ component wise. Apply $iNTT$ to get the result: $\mathbf{r} = iNTT(\hat{\mathbf{p}'} \otimes \hat{\mathbf{q}')}$, where $\otimes$ represents coordinate-wise multiplication. Intepret the result $\mathbf{r}$ as a degree $2d - 2$ polynomial, and reduce it by the polynomial modulus $x^d \pm 1$. The coefficients of the reduced polynomial $\mathbf{r}_\text{red}$ is the final result. And… it works! However, although it’s working great, the additional reduction operation always feels a bit cumbersome. It looks a bit out-of-place and inelegant. What if we can somehow do the reduction also as part of the NTT ? Let’s expand on how we actually perform reduction operation first. Consider the evaluation form of polynomial $P$ before and after reduction. Here, $k$ is some constant value which makes sure the order of $P_\text{red}$ is smaller than the order of $\phi$. We can see that reducing a polynomial by another polynomial is really subtracting copies of the modulus from the original polynomial. Here is an interesting idea: if the point $x$ where we evaluate happens to be a root of $\phi$, i.e., $\phi(x) = 0$, then in the evaluation form the reduction has no effect! This means, even without an explicit reduction step, the resulting polynomial when we convert back to the coefficient form is “automagically” reduced. This is nice, but how do we apply this idea to the entire NTT procedure? In two steps: This is good news. There is even a better news: since the coefficients will be automatically reduced and thus have order less than $d$, we don’t need to evaluate $2d$ points anymore. Instead, we can always work with $d$ points as they can uniquely define any order $d-1$ polynomial. Next, we can plug in the reduction polynomial $x^d - 1$ for the case of circular convolution, and see whether we can make this auto-reduction magic happen. Right now, all we know is that $d$ is a power of two, so it can be divided many times before it reaches 1. Before we introduce any tricks, let’s try to expand the terms and look for the root ourselves. Till this point, we have already fully expanded the leftmost term down to $x-1$ and cannot reduce any further. The remaining terms are higher order addition terms and it looks like we cannot make any progress. However, since we know that $\omega_d^\frac{d}{2} \equiv -1$, we can rewrite the term $x^\frac{d}{2} + 1 \rightarrow x^\frac{d}{2} - \omega_d^\frac{d}{2}$, which allows us to split further. By recursively doing the reduction and replacement of positive terms with $-\omega_d^i$ powers, we are able to split the polynomial and derive the following tree structure. On the leaf layer of the tree, we essentially factored the polynomial modulus $x^d$ (written as $x^n$ in the diagram) into $d$ degree-1 terms. Each term thus represents a root to the reduction polynomial, which is the point we want NTT to evaluate at, to avoid doing additional reduction step. Note that on the diagram above, it says something about “CRT map”. Don’t worry about what it is for now, as we will discuss more in detail about the similarities between CRT mapping, polynomial reduction and polynomial evaluation in the next post. But now we have the comprehensive list of roots that we want to evaluate at: Since we know that $\omega_d^\frac{d}{2} = -1$, and obviously $\omega_d^0 = 1$, we can rewrite the list as: Hence, this is the same as the entire set generated by the $d$-th primitive root of unity, which is what our existing NTT routing has been doing all the time. TL;DR: No change is needed! By applying the plain NTT operation, we already obtain evaluations of the polynomial at the roots of the polynomial modulus, so we can just do the component-wise multiplication in the regular length-$d$ NTT form, and convert it back. Here is the implementation: Now let’s try the same trick on the polynomial modulus $x^d + 1$ for the NWC case. We apply a slightly different replacement trick: We omit the full reduction step and only show the front. However, one can quickly notice a discrepancy of this reduction tree versus the previous one: When expanding $x^d - 1$, the degree of $x$ and $\omega_d$ are always the same - always $x^\frac{d}{2^k} \pm \omega_d^\frac{d}{2^k}$. However, when we are expanding $x^d + 1$, the degree of $\omega_d$ is always half of that of $x$. Since we know that $d$ can only be divided $log_2(d)$ number of times, it means, when at the second layer from the bottom, the $\omega_d$ term would already have an odd degree - like $x^2 \pm \omega_d$. Even by subsituting it with it’s negation form, the degree is still odd and it doesn’t allow us to split further. If we cannot split the polynomial to a multiple of degree-1 terms, then we cannot directly infer the roots from the expression itself. Looks like we’ve hit a dead end. But let’s not give up just now. There are actually two options to move forward: Work with this reduction and try to see if there are some properties we can use to help with NWC. Introduce some more structure and assumptions to this problem to make the last split feasible. We are going to look at Approach 2 this time. We will try to cover Approach 1 in the next post. How can we further split the term $x^2 - \omega_d$? Obviously, if we can find the square root of $\omega_d$ in the underlying field $\mathbb{Z}_q$, then we solve the problem. Going along that thought, let’s assume there exists a $2d$-th primitive root of unity $\psi_{2d}$ such that $\psi_{2d}^2 = \omega_d$, then we can do the last split as $x^2 - \omega_d \rightarrow (x + \psi_{2d}) (x - \psi_{2d})$. Using $\psi_{2d}$ as the new order-$2d$ generator, we can rewrite the entire reduction tree as: On the bottom layer, we see the roots are essentially power of $\psi_{2n}$ that splits the odd powers of $\omega_d$. Therefore, the $d$ odd powers $\{psi_{2d}^{2i+1}\}_{i \in [d]}$ can uniquely define the evaluation points we need to make NWC happen without additional reduction. The even powers of $\psi_{2d}$ are also powers of $\omega_d$, so they aren’t present in the last layer. Now let’s formulate the NTT algorithm that uses odd powers of $\psi_{2d}$: We can see that, evaluating the polynomial at the odd powers of $\psi_{2d}$ is equivalent of multiplying the coefficients of the polynomial with powers of $\psi_{2d}$ beforehand, and then feed into the regular length-$d$ NTT algorithm that uses $\omega_d$. Conversly, the inverse transformation would be first inverting via the regular NTT, and then apply the inverse of $\psi_{2d}^i$ to each coefficient correspondingly. At last, let’s try to implement this new version of the NTT: In order for this approach to work, a crucial requirement is that there has to exist such $\psi_{2d}$ in the underlying finite field $\mathbb{Z}_q$. This immediately means that even if we are not doing length-$2d$ NTTs, to multiply degree $d$ polynomials without reduction, the underlying field has to be large enough such that we could do length-$2d$ NTTs. We can safely ignore this quirk for now. But this will soon come haunt us when we look into real life Lattice-based Cryptographic Schemes, such as Kyber. But no worries for now, we will get you ready for that in the next post ;) First of all, congratulations to you if you have made to this part of the blog post! I’m glad that you have finally understood the inner workings (and beauty) of NTT! To conclude, we have covered the following in this blog post: In our next post, we will dive into real life use cases of NTT, such as the CRYSTALS-Kyber cryptosystem that has been standardized by NIST as the next generation post-quantum key-encapsulation mechanism. We will also take a look at tips and tricks to make NTT run faster and with smaller footprints. See you all next time! Define the base case (NTT of a single element is itself). Break the input vector into even and odd parts, and recursively call NTT on them with the $\omega$ term squared. For $0 \le j < \frac{d}{2}$, fill the output array with $NTT_{\omega_{d/2}}(\mathbf{a_j}^\text{even}) + \omega_d^j NTT_{\omega_{d/2}}(\mathbf{a_j}^\text{odd})$. For $\frac{d}{2} \le j < d$, fill the output array with $NTT_{\omega_{d/2}}(\mathbf{a_j}^\text{even}) - \omega_d^j NTT_{\omega_{d/2}}(\mathbf{a_j}^\text{odd})$. Note that step 3 and 4 can be combined in a single loop to avoid recomputing the two terms. Recursion is really an iterative grouping of elements in a particular order. The “grouping” operation is the Cooley-Tukey butterfly operation. Reorganize the input in this special interleaved order. Apply the first level CT butterfly operation with $\omega_2^j$ (2-th root of unity as the recursive step deals with array of length 2). Apply the second level CT butterfly operation on the previous result with the 4-th root of unity $\omega_4^j$. Note that this is equivalent to the previous step, except we doubled the stride of the butterfly operation. Iteratively apply the CT butterfly operation until the end. In the last layer, the stride length should be half of the total vector length (4 in our case). Extend the vectors $\mathbf{p}, \mathbf{q}$ to be length $2d$ by filling the latter extended part to be zero. $\mathbf{p}' = [p_0, \dots, p_{d-1}, 0, \dots, 0], \mathbf{q}' = [q_0, \dots, q_{d-1}, 0, \dots, 0]$ Perform $NTT$ on the extended vector $\mathbf{p}', \mathbf{q}'$ and obtain $\hat{\mathbf{p}'}, \hat{\mathbf{q}'}$. Multiply $\hat{\mathbf{p}'}, \hat{\mathbf{q}'}$ component wise. Apply $iNTT$ to get the result: $\mathbf{r} = iNTT(\hat{\mathbf{p}'} \otimes \hat{\mathbf{q}')}$, where $\otimes$ represents coordinate-wise multiplication. Intepret the result $\mathbf{r}$ as a degree $2d - 2$ polynomial, and reduce it by the polynomial modulus $x^d \pm 1$. The coefficients of the reduced polynomial $\mathbf{r}_\text{red}$ is the final result. We find $d$ distinct roots of the reduction polynomial $x^d \pm 1$. We evaluate the degree-$(d-1)$ polynomial at those $d$ distinct points. Profit. I mean really now we don’t need the reduction step, and the result will be auto-reduced. Work with this reduction and try to see if there are some properties we can use to help with NWC. Introduce some more structure and assumptions to this problem to make the last split feasible. What NTT really is - it’s summation form, as well as understanding it as evaluating and interpolating polynomials. How to efficiently run NTT - using Cooley-Tukey and Gentleman-Sande butterfly network, as well as doing them iteratively via a bit-reversal shuffle. Lastly, how to use NTT to perform convolution in cyclic and negative wrapped scenarios. More importantly, how to save the extra reduction step by picking the right generators (in many places, $\omega_d$ and $\psi_{2d}$ are also referred to as “twiddle factors”).

0 views
Higashi 2 years ago

A Gentle Introduction of NTT - Part I: Polynomial Multiplications as Convolutions

Polynomials are everywhere. In fact, in Computer Science, they are really just fancy lists of numbers that have a particular way of doing arithmetic. For example, adding two polynomials is as simple as just summing up their coefficients. Multiplying a polynomial by a constant is just multiplying every coefficient by that constant. However, when we want to multiply polynomials, things get slightly more complicated. Suppose we have two degree-$(d-1)$ polynomials, $P(x), Q(x)$: \(P(x) = \sum_{i=0}^{d-1} p_i x^i\) \(Q(x) = \sum_{i=0}^{d-1} q_i x^i\) The classical way of performing polynomial multiplications is to expand out the summation, and multiply everything in $P(x)$ with everything in $Q(x)$ to form a degree $2d-2$ polynomial: where $c_k$ is the sum of product of coefficients in $P(x), Q(x)$ such that $c_k = \sum_{i+j=k} p_i q_j$. This summation is commonly known as the convolution of the coefficients of $P, Q$. Now we can code up a simple implementation for the naive polynomial multiplication. Here we represent polynomials as lists of their coefficients, with the first element corresponding to the lowest degree coefficient (constant), and the last element to the highest degree. However, we are still far from finished. Next, we will introduce a few interesting tweaks/constraints to the polynomials to make things work a little bit more conveniently. In the field of cryptography, we typically work with finite fields, and the most common one is simply $\mathbb{Z}_q$, i.e., integers modulo a prime number $q$. We define the polynomials $P, Q$ described above with coefficients $p_i, q_i \in \mathbb{Z}_q$ as belonging to the set $\mathbb{Z}_q [x]$. The interesting thing about working in a finite field is that the polynomial coefficients “wrap around” when being multiplied. So regardless of how much polynomial arithmetic we perform, the coefficients of the polynomial can still be bounded in a fixed range, which is very attractive from an implementation perspective. However, one can quickly notice an item that is “not so positive” in the current setting - which is that the degree of the polynomial will only grow larger and larger as we perform more multiplications. That means we need to have longer arrays to store coefficients and more complex convolutions every time a multiplication is performed. It would be great if the degree of the polynomials could “wrap around” just like the coefficients. As it turns out, there’s another algebraic structure that’s exactly for that - Polynomial Quotient Rings. Just like in the base field $\mathbb{Z}_q$ where we would take modulo $q$ after every arithmetic operation, we can also take modulo some polynomial $\phi(x)$ after every polynomial operation. The resulting polynomial’s degree would never be larger or equal to the degree of $\phi(x)$. We call such structure $\mathbb{Z}_q [x] / (\phi(x))$. In the context of lattice-based Cryptography, $\phi(x)$ is chosen to be a polynomial looking either like $x^d - 1$ or $x^d + 1$. This kind of polynomial is usually referred to as a cyclotomic polynomial . Next we will examine how the two choices of the cyclotomic polynomial influence the “wrap around” behavior of the resulting ring. Before we look into the cyclotomic polynomial $\phi(x)$, let’s see how to manually perform a polynomial multiplication over a polynomial ring. The recipe is actually very simple: we first perform the normal polynomial multiplication, resulting in a polynomial with high degree; then, we perform a reduction operation where we take the high degree polynomial modulo the polynomial modulus, and obtain a low degree result. Let’s look into a particular example: As we see from the example above, the multiplication part was straightforward cross-coefficient products. The reduction part was iteratively removing copies of $x^4+1$ from the high degree expression until the resulting degree was smaller than the modulus. Now, let’s take a look at the case when $\phi(x) = x^d - 1$. If we take any polynomial modulo $x^d - 1$, it’s equivalent to removing multiples of $x^d - 1$ from the polynomial until the resulting polynomial has degree lower than $d$. For example, for a term like $c_d x^d$, after the reduction, it becomes $c_d$, as it removes $c_d$ copies of $x^d - 1$, and the negative one term caused $c_d$ to shift from the degree-$d$ coefficient to the degree-0 coefficient. Similarly, the term $c_{d+1} x^{d+1}$ will be reduced to $c_{d+1} x$, due to the negative one term in the modulus. We can gradually see that any terms that exceed degree-$d-1$ will essentially “wrap around” and become smaller degree. With such behavior, the ring formed by $\mathbb{Z}_q [x] / (x^d - 1)$ has a very interesting property called cyclic convolution . To illustrate this property, let’s compute the convolution of two degree-$(d-1)$ polynomials $P, Q$ again. Upon obtaining the resulting degree-$(2d-2)$ polynomial $\sum_{i=0}^{2d-2} c_i x^i$, we take modulo $x^d - 1$, which essentially means summing the degree-$i$ for coefficients $i>d-1$ with the degree-$(i-d)$ coefficients: As we can see, it is almost as if the terms higher than the modulus degree $d$ end up showing up at the other end of the table and stacking on top of the lower coefficients, hence the name cyclic . Conversely, the other modulus $\phi(x) = x^d + 1$ results in what is called negative wrapped convolution , or sometimes negacyclic convolution. Obviously, because the sign of the one term flipped, now when removing copies of the modulus from a polynomial, what wraps around will be the negation of the original high degree term coefficient that got reduced. Therefore, the overall expression in the NWC case is: From the test results, we can see that both implementations (CC, NWC) lead to a way of multiplying polynomials while preserving the underlying field size and order. Throughout the post, we have been working with the naive approach of multiplying polynomials, i.e., convolution of two dimension-$d$ lists. Although it works, we can immediately notice a performance issue with convolution-based approaches - it requires iterating over all pairs of coefficients which results in runtime complexity of $O(d^2)$. In the example of Kyber where the polynomials have degree of 256, doing a naive multiplication between elements in Kyber would require as many as $256^2 = 65536$ coefficient multiplications! It is obvious that a quadratic cost like this would not scale. Fortunately enough, polynomials are objects that can not only be expressed in their normal (coefficient) form like the summation above, but also in evaluation form. Consider the same polynomials $P, Q$ as above. Instead of describing the coefficients $p_i, q_i$, we can evaluate the polynomials at distinct points $x_i$: \(P(x_0), P(x_1), \dots, P(x_d)\) \(Q(x_0), Q(x_1), \dots, Q(x_d)\) As long as we have $d+1$ distinct evaluations of a polynomial, one can efficiently recover the polynomial’s coefficient form through a process called Lagrange Interpolation . A common choice is to pick $x_i$ to just be numbers from 0 to $d$, such that we have $P(0), P(1), \dots, P(d)$. Now, coming back to the topic of polynomial multiplication, we see that, in the evaluation form, multiplication is strictly component-wise : This means, the quadratic cost when it comes to polynomial multiplication can be thus reduced to $O(d)$ linear cost (one multiplication per evaluation point)! Next, we implement a fast polynomial multiplication in the evaluation form. 🎉🎉🎉 Tada! We can now perform multiplication in the evaluation form, which indeed only requires linear $O(d)$ cost. Once we have the evaluation form implementation for the naive convolution, the next natural question is how do we deal with cyclic/negative wrapped convolutions. There is a super simple approach we can follow: we perform the naive multiplication in evaluation form, and receive a degree $2d-2$ polynomial. Then we reduce this polynomial by our quotient polynomial $x^d \pm 1$ and get the right result. In fact, this approach is already good enough because the reduction at the end is $O(d)$ - just a linear scan in the resulting list and perform some “wrap around” field additions or subtractions. The approaches we have described above have a few obvious catches: 2x more evaluation points : Since the final polynomial has degree $2d-2$ as we multiply two degree-$(d-1)$ polynomials, we need twice as many ($2d-1$, to be exact) evaluation points in order to be able to fully recover it back to its coefficient form. This means, we will have to first evaluate $P, Q$ at $2d-1$ points instead of $d$ to begin with. Cost of converting to/from coefficient form : The cost we saved in multiplication must show up somehow somewhere else. Indeed, the cost goes into how we effectively evaluate polynomials (coefficient form -> evaluation form), and interpolate polynomials (evaluation form -> coefficient form). The first catch is relatively fine - it is still linear compared to the degree of the polynomials, which is much better than the quadratic cost of the naive approach. The second catch is where things get slightly more problematic, as evaluating a polynomial at a single point is already at least $O(d)$ even when we apply some of the commonly used optimizations. (One of the optimizations is called Horner’s method , which allows evaluating polynomials in $O(d)$ time.) Besides, in this post the entire interpolation is done via the black-box method provided by SciPy, and we have no idea what complexity it has! It turns out that this is the make-or-break moment - if we can find out a way to efficiently transition between the two forms, we can solve the problem of efficient polynomial multiplication. One easy hack to relieve the cost is to keep polynomials in the evaluation form throughout the computation - in fact, maybe we should never go back to coefficient form unless we have to. This can ensure that we have a minimum number of conversions between the two forms. But this is not a topic we want to go into yet… The systematic and efficient way to perform this conversion is a procedure called the Number Theoretic Transform , or NTT for short. Up until this point, we have already paved the foundation for understanding the reason why we need NTT and what NTT can help us with. In our next post, we will dive right into NTT and see how it helps us with the aforementioned problem. 2x more evaluation points : Since the final polynomial has degree $2d-2$ as we multiply two degree-$(d-1)$ polynomials, we need twice as many ($2d-1$, to be exact) evaluation points in order to be able to fully recover it back to its coefficient form. This means, we will have to first evaluate $P, Q$ at $2d-1$ points instead of $d$ to begin with. Cost of converting to/from coefficient form : The cost we saved in multiplication must show up somehow somewhere else. Indeed, the cost goes into how we effectively evaluate polynomials (coefficient form -> evaluation form), and interpolate polynomials (evaluation form -> coefficient form).

0 views
Higashi 3 years ago

PKPDSS: Permuted-Kernel-Problem-based Digital Signature Scheme

Based on: PKP-IDS, Permutation based identification protocol. Idea: PKP is a problem where one is give a random matrix $A$ and a vector $v$. Then they are asked to find a permutation $\pi$ such that $A v_\pi = 0$, i.e. $v_\pi$ is the right kernel of $A$. Finding such permutation $\pi$ is an NP-hard problem. However, we could use a 5-round public-coin interactive protocol to prove knowledge of the secret permutation $\pi$. We will try to explore how PKP-IDS works, and then follow Beullens et. al. 2018 to compress PKP-IDS into PKP-DSS. The keygen of PKP-IDS is the following. First, we need to sample a random matrix $A$ and a secret permutation $\pi$. Then we need to find a right kernel $w = Ker(A)$ such that $Aw = 0$. We apply the inverse permutation on $w$ and get $v = w_{\pi^{-1}}$. Then our public key is simply $(A, v)$ and our secret key is the permutation $\pi$. A few interesting observations that Beullens et. al. 2018 has pointed out are: With high probability, we can convert a random matrix $A$ of dimension $m \times n$ into form $[I_m \vert A’]$ where $A’ \in \mathbb{F}_p^{m \times n-m}$. Then we no longer need to memorize the left hand side $I_m$ identity matrix so we can reduce the communication cost. Instead of applying Gaussian Elimination to $A$ in order to find its right kernel $w$, we could do the other way around. First, we randomly sample vector $v$ and every column but the last for matrix $A$. We leave the last column of matrix $A$ blank. We then sample a random permutation $\pi$ and set $w = v_\pi$. Essentially, we get something like: Where $a_{i,j}$ is the i-th row and j-th column of matrix $A$ and $a_{i,n-1}^*$ is the blank column. We can see that, for each row $i$, we can make $\sum_{j=0}^{n-1} a_{i,j} w_j = 0$. Since $a_{i,n-1}^*$ is left uncomputed, we can simply compute the inverse and obtain: Note $w^{-1}_{n-1}$ can be computed once and stored in memory. Then, it’s obvious to show that: Therefore, we can do the dimension reduction trick together with the sampling trick, and obtain the key materials required for the IDS protocol. After KeyGen, we are given a public key $pk = (A, v)$ and a secret key $sk = \pi$. The next step is to perform the actual PKP-IDS scheme to achieve a zero-knowledge proof that Alice has knowledge of the secret $\pi$. The entire protocol has 5 rounds: Upon receiving both commitments, Bob generates a random challenge $c \in \mathbb{F}_p$ and sends back $c$. Alice computes $z \leftarrow r_\sigma + cv_{\pi\sigma}$ and sends $z$. Bob generates a random coin flip $b \leftarrow {0,1}$ and sends $b$. Alice sends $\sigma$ if $b=0$ or $\pi\sigma$ if $b=1$. We use a hash function to model the commitment scheme here. Given a hash function $H$: PKP-IDS is a scheme is a public-coin protocol. This means that the only thing that Bob provides during the protocol is nothing but randomly sampled bits. This implies that we could use Fiat-Shamir heuristic to compress the interactive protocol into a non-interactive protocol by producing Bob’s challenge via hashing Alice’s transcript. However, due to the $\frac{p+1}{2p}$ soundness error of the PKP-IDS scheme, we need to repeat this protocol $N$ times in order to achieve a reasonable soundness error. First, Alice generates the $(A, v), \pi$ instance. Then, she samples $\sigma$ and generates the two commitments $C_0, C_1$. Alice is then supposed to send over $C_0, C_1$ to Bob and receive a challenge $c$. Instead of doing that, we use Fiat-Shamir to apply a hash on Alice’s transcript and sample the challenge, i.e. $c \leftarrow H(C_0 \Vert C_1)$. A potential strategy would be modding down the randomness bytes into the finite field and take the remainder. Upon deriving a challenge $c$, Alice will then compute the response $z$ using the derived $c$. Following the same idea of Step 2, Alice can now use the combination of the transcript to derive the next challenge $b$, i.e. $b \leftarrow H(C_0 \Vert C_1 \Vert z)$. Lastly, Alice answers the second challenge with either $\sigma$ or $\pi_sigma$ and sends the entire transcript to Bob. Upon receiving the non-interactive proof, Bob can follow the same steps as Alice to derive the challenge $c$ and $b$ respectively and invoke the same verify function in the interactive protocol to validate the proof. Right now, the non-interactive proof above simply proves the knowledge of the secret premutation $\pi$. However, we would want to not only prove that, but also somehow bind this proof with a specific message $m$, such that only the ones who know $\pi$ could generate this proof. This is therefore called digital signature. How this is typically done is to encode the message $m$ as part of the randomness of the commitment scheme, and later reveal $m$ as part of opening the commitment. In our implementation, it’s as simple as putting the message into the transcript before hashing and generating the challenges. In other words: With high probability, we can convert a random matrix $A$ of dimension $m \times n$ into form $[I_m \vert A’]$ where $A’ \in \mathbb{F}_p^{m \times n-m}$. Then we no longer need to memorize the left hand side $I_m$ identity matrix so we can reduce the communication cost. Instead of applying Gaussian Elimination to $A$ in order to find its right kernel $w$, we could do the other way around. First, we randomly sample vector $v$ and every column but the last for matrix $A$. We leave the last column of matrix $A$ blank. We then sample a random permutation $\pi$ and set $w = v_\pi$. Essentially, we get something like: First, Alice sample a random permutation $\sigma \in S_n$ and a random vector $r \in \mathbb{F}_p^n$. Then Alice generates a commitment of the following two items: $C_0 \leftarrow Com(\sigma, Ar)$, which is the commitment of matrix $A$ blinded by random vector $r$ using $\sigma$ as randomness. $C_1 \leftarrow Com(\pi\sigma, r_\sigma)$, which is the commitment of random vector $r$ (permuted by $\sigma$) using $\pi\sigma$ as randomness. Upon receiving both commitments, Bob generates a random challenge $c \in \mathbb{F}_p$ and sends back $c$. Alice computes $z \leftarrow r_\sigma + cv_{\pi\sigma}$ and sends $z$. Bob generates a random coin flip $b \leftarrow {0,1}$ and sends $b$. Alice sends $\sigma$ if $b=0$ or $\pi\sigma$ if $b=1$. If $b=0$, Bob can compute the first commitment himself by $C^* \leftarrow Com(\sigma, Az_{\sigma^{-1}})$. If $b=1$, Bob can compute the second commitment himself by $C^* \leftarrow Com(\pi\sigma, z - cv_{\pi\sigma})$. Bob checks if the commitment he computes matches the original commitment sent from Alice and outputs Yes/No. First, Alice generates the $(A, v), \pi$ instance. Then, she samples $\sigma$ and generates the two commitments $C_0, C_1$. Alice is then supposed to send over $C_0, C_1$ to Bob and receive a challenge $c$. Instead of doing that, we use Fiat-Shamir to apply a hash on Alice’s transcript and sample the challenge, i.e. $c \leftarrow H(C_0 \Vert C_1)$. A potential strategy would be modding down the randomness bytes into the finite field and take the remainder. Upon deriving a challenge $c$, Alice will then compute the response $z$ using the derived $c$. Following the same idea of Step 2, Alice can now use the combination of the transcript to derive the next challenge $b$, i.e. $b \leftarrow H(C_0 \Vert C_1 \Vert z)$. Lastly, Alice answers the second challenge with either $\sigma$ or $\pi_sigma$ and sends the entire transcript to Bob.

0 views
Higashi 5 years ago

Fully Homomorphic Encryption Part Three: Three Strawmans for the FHE Scheme

In my previous post, I went over how Lattice-based crypto works, as well as what Learning With Error (LWE) Problem is. In the end, we looked at how Regev Encryption works by putting the LWE problem together with an encryption scheme. Hopefully, everyone should have a pretty solid understanding about these fundamental building blocks. Now we are finally ready to battle the archnemesis - building the actual FHE scheme . Before we begin, let’s do a quick recap of what we have learned before. In the last post, we went over what the LWE problem is and why it matters. Basically, LWE defines some large matrix $ A $, and a hidden vector $ s $. We are only given $ A $, and the product with error ($ As + e $), and the task is to recover the hidden vector $ s $ from the two inputs. Later, we are also given the Decisional LWE (DLWE) version of the problem, which asks us to distinguish between the LWE problem instance ($ A, As + e $) and just a uniformly sampled random matrix. Both SLWE and DLWE are conjectured to be hard . To better understand why DLWE is hard, here is a useful diagram that we can look at: By looking at the top of the diagram, we can see how we constuct the product with error term $ As + e $. Notice that the dimensions match nicely. Now, we put $ A $ and $ As + e $ together (as shown on the bottom). Since SLWE is conjectured to be hard, we cannot really recover $ s $ from $ As + e $, and it will just look like some random values to us. Therefore, our view of the problem $ A, As + e $ is computationally equivalent to a $ m \times n + 1 $ matrix with uniformly sampled random values in $ \mathbb{Z}_q $. Last time, we also took a look at how Regev Encryption works. Regev Encryption basically maps the value of a bit (0 or 1) onto two sides of a ring finite field . Then it applies the LWE product with error $ As + e $ onto the value. Because of the hardness of DLWE, the product with error term $ As + e $ is indistinguishable from a randomly sampled vector $ v $. Therefore, applying the LWE product with error onto the original value is equivalent to applying an One-Time-Pad onto the original value, therefore guaranteeing its security. Upon decryption, we use the knowledge of the secret vector $s $ to remove the One-Time-Pad. However, since the original product with error also carries an error term, our result is also offsetted by this error term . The diagram above visualizes how the error term will affect the decrypted result. We see that the original value is either going to be 0 or $ q/2 $. Once we remove the LWE OTP from the ciphertext, we are left with some errors that will shift the value up or down the ring. However, as long as the absolute value of the error does not exceed $ q/4 $, it will be safe for us because the resulted value will always stay on the same side of the ring. Therefore, keeping the error bound under $ q/4 $ will guarantee the decryption to succeed . This is also the reason why we enforce $ q/4 > mB $ in our encryption parameter setup. Finally, let’s review the definition of FHE once again! We can basically treat the FHE problem as if an user has some secret input $ x $ and would like the cloud to obliviously compute function $ f(\cdot) $ for him. First, the user will use an FHE scheme to encrypt his secret input and send $ Enc(x) $ to the cloud. Then, the cloud will use the FHE algorithms to apply function $ f $ on the original ciphertext and obtain $ Enc(f(x)) $. Finally, the user decrypts the encrypted result and obtains $ f(x) $. The formal definition of a valid FHE scheme contains 4 fundamental algorithms : Alongside the four algorithms, a valid FHE scheme also needs to satisfy 3 properties : We have also learned about the 4 different levels of FHE : Last time we also talked about that with this notion of Bootstrapping introduced by Gentry in 2009, we can turn suitable Leveled FHE schemes into real FHE schemes . The GSW scheme is indeed built up from this way. Since I promised you that we will go over how GSW works in detail, today we will be putting all these building blocks together and build up our first Leveled FHE Scheme . Once this is done, we’ll talk about applying Bootstrapping (presumably in the next post). Most of these materials are covered in detail in the previous two posts, as here is just a brief recap and emphasis of these key concepts. If you are still uncomfortable with these concepts, feel free to return to the previous posts again and check them out. That was a quick recap session for the stuff that we learned so far. Now let’s construct a Leveled FHE scheme ! In order to achieve LFHE , we will be building it up in three steps. Namely, we will talk about 3 straw man protocols that gradually build up to our final LFHE system. Long story short. Let’s begin. According to the definition, if we have a scheme that can successfully evaluate addition and multiplication using the $ Eval $ algorithm, that scheme is fully homomorphic. I hereby propose our first construction of a FHE scheme - using very simple linear algebra. The $ KeyGen $ algorithm is very simple: we simply randomly select a vector $ \vec{s} $, and that will be our secret key. $ Enc(\vec{s}, \mu) \rightarrow C $: For encryption of message $ \mu $ (a multi-bit number), we will simply output a matrix $ C $ such that: \(C \cdot \vec{s} = \mu \cdot \vec{s}\) If you know linear algebra, you will instantly recognize this expression! Namely, $ \vec{s} $ is the eigenvector of the matrix $ C $, and $ \mu $ is the eigenvalue . Basically the encryption algorithm becomes - find a matrix $ C $ such that it has eigenvector $ \vec{s} $ and eigenvalue $ \mu $. $ Dec(\vec{s}, C) $: Decryption is very easy. We basically multiply $ C $ with $ \vec{s} $ and obtain a multiple of the original vector $ \vec{s} $. All we need to do is to find the correct multiplier $ \mu $ such that $ C \cdot \vec{s} = \mu \cdot \vec{s} $. Now let’s see how the $ Eval $ algorithm works. Namely, to get FHE, we just need to get addition and multiplication functionalities work: $ Eval(+, C_1, C_2) $: In order to evaluate addition of two ciphertexts , we can simply output $ \tilde{C} \leftarrow C_1 + C_2 $. The correctness is also easy to verify: \((C_1 + C_2) \cdot \vec{s} = C_1 \cdot \vec{s} + C_2 \cdot \vec{s} = \mu_1 \cdot \vec{s} + \mu_2 \cdot \vec{s} = (\mu_1 + \mu_2) \cdot \vec{s}\) $ Eval(\times, C_1, C_2) $: To evaluate multiplication of two ciphertexts, we can also easily output $\tilde{C} \leftarrow C_1 \cdot C_2$. Again, we can see the correctness of multiplication as well: \((C_1 \cdot C_2) \cdot \vec{s} = C_1 \cdot (C_2 \cdot \vec{s}) = C_1 \cdot \mu_2 \cdot \vec{s} = \mu_2 \cdot C_1 \cdot \vec{s} = \mu_2 \cdot \mu_1 \cdot \vec{s} = (\mu_1 \cdot \mu_2) \cdot \vec{s}\) Surprisingly, this scheme is capable of evaluating both additions and multiplications ! If the first try would worked, we would be so happy and I can gracefully end my post here. Although this scheme is perfect at both additive and multiplicative homomorphism, due to some very unfortunate reasons, this perfectly homomorphic construction is invalid . The reason why this construction is not a valid FHE scheme is due to a glaring flaw: Given matrix $ C $, it is really easy to find its eigenvector $ \vec{s} $ and eigenvalue $ \mu $! Although this scheme is perfect, it also has no hardness at all - so we can barely call it an “encryption” scheme. In fact, cracking this “encryption” scheme is as simple as using Gaussian Elimination to easily recover $ \vec{s} $ and $ \mu $. So how do we keep this great scheme yet make it harder? We do exactly the same thing as we do to system of linear equations: Add noise . Now, just like how we converted system of linear equations to LWE , let’s add some noise to our original construction. Generally, we would be getting some relation like this: \(C \cdot \vec{s} = \mu \cdot \vec{s} + \vec{e}\) Notice that this relation is almost the same as the previous construction except for that we introduced an additional error noise term $ \vec{e} $ to the equation. Now, let’s formally define our second noisy construction. $ KeyGen(1^\lambda) $: We will formally define the key generation algorithm here. First, we will sample a random vector $ \tilde{s} \in \mathbb{Z}_q^{n-1} $. Then, we vertically append a -1 to $ \tilde{s} $ to obtain $ \vec{s} \leftarrow \begin{pmatrix}\tilde{s}\-1\end{pmatrix} $. We output both $ \tilde{s}, \vec{s} $ as the secret key. $ Enc(\tilde{s}, \mu \in {0, 1}) $: We will use single bit encryption again for this time. To encrypt a bit $ \mu $, we will need to first construct an LWE instance . First, we randomly sample matrix $ A \stackrel{R}{\leftarrow} \mathbb{Z}_q^{n \times (n-1)} $. We also sample the error vector $ \vec{e} \stackrel{R}{\leftarrow} x_B^n $. Then, we will output the ciphertext $ C $: \(C = \underbrace{(A, A \cdot \tilde{s} + \vec{e})}_\text{LWE Instance} + \overbrace{\mu \cdot I_n}^\text{message} \in \mathbb{Z}_q^{n \times n}\) Here $ I_n $ is the identity matrix with dimension $ n \times n $. Notice that on the left hand side of the expression, we are concatenating matrix $ A $ and vector $ A\tilde{s} + \vec{e} $ together into a square matrix with dimension $ n \times n $. Due to the DLWE assumption mentioned in the beginning, we can treat this as a random One-Time-Pad to the actual encrypted value $ \mu \cdot I_n $ on the right side. $ Dec(\vec{s}, C) $: Now when we try to decrypt the ciphertext $ C $. We will simply compute $ C \cdot \vec{s} $ and then output: \(\text{output: } \begin{cases} 0, & \text{if } \mid\mid C \cdot \vec{s} \mid\mid \text{ is small}\\ 1, & \text{otherwise} \end{cases}\) The correctnesss proof for the decryption is also pretty straight-forward. Let’s expand the multiplication $ C \cdot \vec{s} $: \(\begin{align*} C \cdot \vec{s} &= ((A, A \cdot \tilde{s} + \vec{e}) + \mu \cdot I_n) \cdot \vec{s}\\ &= (A, A \cdot \tilde{s} + \vec{e}) \cdot \vec{s} + \mu \cdot I_n \cdot \vec{s}\\ &= A \cdot \tilde{s} - (A \cdot \tilde{s} + \vec{e}) + \mu \cdot \vec{s}\\ &= \mu \cdot \vec{s} - \vec{e} \end{align*}\) Since vector $ \vec{s} $ is simply the original vector $ \tilde{s} $ stacked on top of a $ -1 $, multiplying $ (A, A \cdot \tilde{s} + \vec{e}) $ and $ \vec{s} $ is equivalent of multiplying matrix $ A $ by vector $ \tilde{s} $ and then subtract $ A \cdot \tilde{s} + \vec{e} $ from it. In the end, we get something that is similar to the relation that we started with . Since the error is sampled randomly with an upper bound on its absolute value, we can safely flip the minus sign and the equation will still hold. Since this relation $ C \cdot \vec{s} = \mu \cdot \vec{s} \pm \vec{e} $ is very similar to the original eigenvector/eigenvalue relation. We ususally call $ \vec{s} $ an approximate eigenvector of matrix $ C $. So far so good. Now let’s see how to evaluate addition and multiplication on this new system. $ Eval(+, C_1, C_2) \rightarrow C_1 + C_2 $: Evaluating addition is the same as before - we just add the two ciphertexts together . If we try to decrypt the addition result, it will look like the following: \(\begin{align*} (C_1 + C_2) \cdot \vec{s} &= C_1 \cdot \vec{s} + C_2 \cdot \vec{s}\\ &= \mu_1 \cdot \vec{s} + \vec{e}_1 + \mu_2 \cdot \vec{s} + \vec{e}_2\\ &= (\mu_1 + \mu_2) \cdot \vec{s} + \underbrace{(\vec{e}_1 + \vec{e}_2)}_\text{small error term} \end{align*}\) We see that when we add $ C_1 $ and $ C_2 $ together, we can efficiently obtain the encryption of $ \mu_1 + \mu_2 $, with a small caveat that the error term for this encryption now is $ \vec{e}_1 + \vec{e}_2 $ . However, since we know that the error vectors are small and sampled from a bounded distribution $ x_B $, we can safely treat this as a manageable small term and there’s no problem at all. $ Eval(\times, C_1, C_2) \rightarrow C_1 \cdot C_2 $: We follow the original construction and multiply the two ciphertext together to evaluate multiplication. Expanding the decryption of a multiplication result gives us: \(\begin{align*} (C_1 \cdot C_2) \cdot \vec{s} &= C_1 \cdot (C_2 \cdot \vec{s})\\ &= C_1 \cdot (\mu_2 \cdot \vec{s} + \vec{e}_2)\\ &= \mu_2 \cdot C_1 \cdot \vec{s} + C_1 \cdot \vec{e}_2\\ &= \underbrace{\mu_1 \cdot \mu_2 \cdot \vec{s}}_\text{Result} + \underbrace{\mu_2 \cdot \vec{e}_1}_\text{small error term} + \underbrace{C_1 \cdot \vec{e}_2}_\text{large error term!} \end{align*}\) From the expansion, we see that we do obtain the multiplication of the original plaintexts, plus a small error term $ \mu_2 \cdot \vec{e}_1 $. This term is small because our plaintext $ \mu $ is binary, and the error sampling in the vector $ \vec{e} $ are all bounded by a small $ B $. However, the final term $ C_1 \cdot \vec{e}_2 $ is really bad. As the values in $ C_1 $ are not bounded by any distribution are directly sampled from the finite field $ \mathbb{Z}_q $ , its product with the error product is also unbounded ! This means that the last term can easily get huge and blow up the noise tolerance in our scheme. Adding noise into our original construct does help with semantic security, because now it’s computationally hard to recover the approximate eigenvector and eigenvalue given the matrix $ C $. However, this additional noise also destroys our perfect homomorphism. While the noise term when evaluating addition is ok, multiplication is the big problem here. We currently have this $ C_1 \cdot \vec{e}_2 $ unbounded term that might blow up the noise on the ciphertext, and decryption will fail. Since the error vector $ \vec{e}_2 $ is bounded by an upper bound $ B $, the real problem is that the values in encryption matrix $ C $ is unbounded in $ \mathbb{Z}_q $. Therefore multiplying these unbounded values to a bounded noise vector will give us unbounded noise! If we can somehow bound the values in $ C $ to be small and still make it perfectly random , then our problem will be easily solved! Namely, we want to make the infinity norm of matrix $ C $ small: $ \mid\mid C \mid\mid_\infty = small $, which means every value in matrix $ C $ is small. However, we also want to maintain the randomness of matrix $ C $, so it’s not too deterministic and destroys the security. So what’s the secret sauce that’s missing from here? Binary Decomposition to the rescue! If we have a number $ x \in \mathbb{Z}_q $, the way to reduce its infinity norm while maintaining its original value is to convert it into binary representation . We define: \(\hat{x} = (x_0, x_1, \dots, x_{log(q)-1}) \in \{0, 1\}^{log(q)}\) $ \hat{x} $ is the binary decomposition of the original value $ x $. Namely, a binary decomposition will convert a number in $ \mathbb{Z}_q $ into $ log(q) $ bits, and each bit will only have value $ {0, 1} $. Now, if we look at the infinity norm of $ \hat{x} $, we can see that it’s atmost 1! Therefore it will satisfy our small norm requirement . The other reason why binary decomposition is the best candidate is because reconstructing $ x $ from the decomposition $ \hat{x} $ can be expressed as a linear transformation . Given $ x_0, \dots, x_{log(q)-1} $, we can efficiently reconstruct: \(x = \sum_{i=0}^{log(q)-1} x_i \cdot 2^i\) Thus, this reconstruction can also be expressed as a matrix . Next, we extend the notion of binary decomposition to vectors and matrices . For a given vector $ \vec{x} \in \mathbb{Z}_q^n $, the binary decomposition $ \hat{x} $ is basically the decomposition of every single element in $ \vec{x} $. \(\hat{x} = (\underbrace{x_{0,0}, \dots, x_{0, log(q)-1}}_\text{log(q)-1 elements}, x_{1, 0}, \dots, x_{1, log(q)-1}, \dots, x_{n-1, 0}, \dots, x_{n-1, log(q)-1})\) Totally, there will be $ n \times log(q) $ elements in the decomposed vector $ \hat{x} $. Similarly, for matrix $ C $, we can simply apply the vector binary decomposition to every row in matrix $ C $ to obtain its binary decomposition $ \hat{C} $. \(C = \begin{pmatrix}C_1\\\vdots\\C_m\end{pmatrix} \rightarrow\hat{C} = \begin{pmatrix}\hat{C_1}\\\vdots\\\hat{C_m}\end{pmatrix}\) When we want to reconstruct $ C $ from $ \hat{C} $, we can write the reconstruction linear transformation of every element in $ C $ as a matrix $ G $, such that: \(C = \hat{C} \cdot G\) Where $ G $ is the reconstruction matrix . In some papers, we will also write the linear decomposition as the inverse of the reconstruction matrix $ G^{-1}(\cdot) $, so $ \hat{C} = G^{-1}(C) $. Now with all the linear decomposition stuff at our disposal, let’s see if we can add this into our FHE construction. From the previous section, we know that the term $ C_1 \cdot \vec{e}_2 $ is unbounded and might introduce too much noise. We also learned that by applying binary decomposition $ G^{-1}(\cdot) $ to a matrix, we can efficiently obtain a decomposed matrix $ \hat{C} $ with very small infinity norm . Now, let’s apply this idea and see if we can improve the last construction a bit better. $ KeyGen(1^\lambda) $: The key generation process is still the same. We will randomly sample $ \tilde{s} $ and obtain $ \vec{s} \leftarrow \begin{pmatrix}\tilde{s}\-1\end{pmatrix} $. Both $ \tilde{s}, \vec{s} $ are used in the encryption scheme. $ Enc(\tilde{s}, \mu \in {0, 1}) $: The encryption algorithm is pretty much the same as before. We first construct an LWE instance by selecting $ m = n \cdot log(q) $. Then we sample the matrix $ A \stackrel{R}{\leftarrow} \mathbb{Z}_q^{m \times (n-1)} $ as well as the error vector $ \vec{e} \stackrel{R}{\leftarrow} x_B^m $. Then, we will compute the ciphertext $ C $: \(C = (A, A \cdot \tilde{s} + \vec{e}) + \mu \cdot G \in \mathbb{Z}_q^{m \times n}\) Finally, instead of outputting the ciphertext $ C $ directly, we apply our binary decomposition to the ciphertext and output $ \hat{C} \leftarrow G^{-1}(C) $ instead. $ Dec(\vec{s}, \hat{C}) $: In order to decrypt the decomposed ciphertext $ \hat{C} $, first we will apply the reconstruction matrix $ G $, and then we will multiply it by the private key $ \vec{s} $: \(\begin{align*} \hat{C} \cdot G \cdot \vec{s} &= C \cdot \vec{s}\\ &= (A, A \cdot \tilde{s} + \vec{e}) \begin{pmatrix}\tilde{s}\\-1\end{pmatrix} + \mu \cdot G \cdot \vec{s}\\ &= A \cdot \tilde{s} - A \cdot \tilde{s} - \vec{e} + \mu \cdot G \cdot \vec{s}\\ &= \mu \cdot G \cdot \vec{s} - \vec{e} \end{align*}\) Once we obtain the final term, which is a vector of $ m $ elements, we can just look at the first element of this result vector. If the magnitude of the first element is small (say below $ q/4 $), then we output $ \mu = 0 $. Else we output $ \mu = 1 $. If we expand how we get the first element of the result vector, we can get something like this: \((\mu \cdot G \cdot \vec{s})_1 = \mu \cdot (\vec{G})_1 \cdot \vec{s} = \mu \cdot (1, 0, \dots, 0) \cdot \begin{pmatrix}s_1\\\vdots\\s_{n-1}\\-1 \end{pmatrix} = \mu \cdot s_1\) Notice that the first row of the matrix $ G $ is basically an one-hot vector with 1 as the first term and the rest to be 0. Because $ s_1 $ is simply a random number sampled in $ \mathbb{Z}_q $, it has very high probability to be a large number. Therefore, if $ \mu \cdot s_1 + \vec{e}_1 $ is small, it means that $ \mu $ is 0 with very high probability . Finally… We are at the point to examine whether our third attempt construction can correctly evaluate addition and multiplication. Addition is actually pretty easy in this case, because this scheme didn’t change much in the construction. Therefore, we can look at the decryption of an addition evaluation: \(\begin{align*} (\hat{C_1} + \hat{C_2}) \cdot G \cdot \vec{s} &= \hat{C_1} \cdot G \cdot \vec{s} + \hat{C_2} \cdot G \cdot \vec{s}\\ &= \mu_1 \cdot G \cdot \vec{s} - \vec{e}_1 + \mu_2 \cdot G \cdot \vec{s} - \vec{e}_2\\ &= \underbrace{(\mu_1 + \mu_2) \cdot G \cdot \vec{s}}_\text{addition of plaintext} - \underbrace{(\vec{e}_1 + \vec{e}_2)}_\text{small error term} \end{align*}\) Notice that we can get a pretty nice encryption for $ \mu_1 + \mu_2 $ with a small error term $ \vec{e}_1 + \vec{e}_2 $. This is exactly the same as before. Multiplication is the real deal here. Now let’s see how the decryption of a multiplication evaluation would look like: \(\begin{align*} (\hat{C_1} \cdot \hat{C_2}) \cdot G \cdot \vec{s} &= \hat{C_1} \cdot (\hat{C}_2 \cdot G) \cdot \vec{s}\\ &= \hat{C_1} \cdot C_2 \cdot \vec{s}\\ &= \hat{C_1} \cdot (\mu_2 \cdot G \cdot \vec{s} + \vec{e}_2)\\ &= \mu_2 \cdot \hat{C_1} \cdot G \cdot \vec{s} + \hat{C_1} \cdot \vec{e}_2\\ &= \mu_2 \cdot (\mu_1 \cdot G \cdot \vec{s} + \vec{e}_1) + \hat{C_1} \cdot \vec{e}_2\\ &= \underbrace{(\mu_1 \cdot \mu_2) \cdot G \cdot \vec{s}}_\text{multiplication of pt} + \underbrace{\mu_2 \cdot \vec{e}_1}_\text{small error term} + \underbrace{\hat{C_1} \cdot \vec{e}_2}_\text{relatively small} \end{align*}\) We see that the result is basically the same as the previous construction, except for the fact that now we have the last term as $ \hat{C_1} \cdot \vec{e} 2 $. Since $ \hat{C_1} $ is a binary decomposition, the infinity norm is bounded by 1: $ \mid\mid \hat{C_1} \mid\mid \infty \le 1 $. Therefore, the utmost amount of error introduced in this term is $ n \cdot log(p) \cdot B $ . This error bound is not ideal, but at least it’s good enough for us to have some error tolerance. If we choose the right LWE problem instance, then we can get $ L $ steps of multiplication evaluation before this ciphertext reaches its noise threshold. The diagram above depicts how the noise distribution grows as the circuit goes deeper. Once we reach $ L+1 $ steps of multiplication evaluation, the noise boundaries of 0 and 1 will overlap, and at that point, decryption doesn’t work anymore. Therefore, this third construction gives us a Leveled Fully Homomorphic Encryption Scheme ! This scheme allows us to compute circuits that have utmost $ L $ multiplication gates in series. Compared to the two previous posts, I must say that this one is much more hardcore and contains lots of mathematical equations and details. Though I tried my best to make this as simple as possible, it might still confuse some of you. Feel free to read through the three straw man constructions again to consolidate the knowledge of how FHE scheme is constructed from this “ approximate eigenvector ” method. If you successfully made it through here and understood most of it, congratulations ! You’ve just mastered how FHE works and is built from ground-up. This is really big - considering FHE is still a super new field that hasn’t existed ~10 years ago. But… What next? We have together created a Leveled FHE Scheme from the ground up. And this is actually how the GSW Scheme work. (Although GSW Scheme used a public-key variant of the encryption scheme, but the rest is exactly the same.) The next step is definitely trying to make this LFHE a true FHE! Now we can only evaluate up to $ L $ levels of multiplication. But what if we want to do some encrypted machine learning that might take up to unbounded levels of multiplication? The key idea to convert an LFHE scheme like GSW to true FHE is Bootstrapping . Bootstrapping is an idea proposed by Gentry that efficiencly “refreshes” the noise of a ciphertext . Basically, if we evaluate multiplication too many times and this ciphertext is approaching its noise boundary, we can apply Bootstrapping to obtain a fresh ciphertext that contains the same plaintext but significantly less noise . In the next post, let’s take a deep dive into how Bootstrapping works and turn our leveled GSW Scheme into an true FHE. Most content of this post were summarized from my notes for CS355. You can locate all the awesome lecture notes from CS355 here . Thanks again to Dima, Florian, and Saba for teaching this amazing course! $ KeyGen(1^\lambda) \rightarrow sk $: The key generation algorithm that generates the secret key for this instance. $ Enc(sk, \mu \in {0,1}) \rightarrow ct $: The encryption algorithm that encrypts a single bit $ \mu $. $ Dec(sk, \mu \in {0, 1}) \rightarrow \mu $: The decryption algorithm that recovers $ \mu $ from the ciphertext. $ Eval(F, ct_1, …, ct_l) \rightarrow \tilde{ct} $: The evaluation algorithm that evaluates arbitrary functionality $ F $ over a set of $ l $ ciphertexts. Correctness : The scheme needs to be correct so that after evaluation, the ciphertext can be successfully converted back to the intended plaintext. Semantic Security : The ciphertext produced by this encryption scheme needs to be indistinguishable. This defends the scheme against eavesdroppers. Compactness : The output length of the $ Eval $ algorithm needs to be small and independent from the size of the evaluated functionality $ F $. Partially Homomorphic Encryption : This is for schemes that are either additively or multiplicatively homomorphic. For example, RSA and ElGamal. Somewhat Homomorphic Encryption : This is for schemes that are PHE first and shows very weak traits of the other missing homomorphic property. For example, ElGamal in Pairing-friendly groups. Leveled Fully Homomorphic Encryption : This is for schemes that are already fully homomorphic, but impose a complexity upper bound $ L $ such that the circuit evaluated for the functionality $ F $ must be within the boundary. Ususally the bound $ L $ is pretty low. Fully Homomorphic Encryption : This is for schemes that are trully fully homomorphic. There is no complexity upper bounds and thus can evaluate any arbitrary functionalities.

0 views
Higashi 5 years ago

Fully Homomorphic Encryption Part Two: Lattice-based Crypto and the LWE Problem

Last time, we went through the overview of what FHE is, the different stages towards FHE, and the brief history of it. I think at this point, we should be pretty comfortable with understanding what FHE is and its potential applications. We concluded the last post by alluding to this GSW FHE Scheme . It’s the 3rd Gen FHE Scheme based on the LWE Problem assumption which stems from Lattice-based Cryptography . Although these topics might sound pretty archaic and distant, I claim that it’s actually not that hard to understand . In fact, with just simple knowledge in linear algebra, we can fully grasp the idea of the GSW FHE Scheme. In this post, let’s together review some fundamental concept of Lattice-based Crypto and the LWE Problem so we can build up our knowledge to the actual GSW Scheme later. So what the heck is Lattice-based Crypto ? According to Wikipedia: Lattice-based cryptography is the generic term for constructions of cryptographic primitives that involve lattices, either in the construction itself or in the security proof. Lattice-based constructions are currently important candidates for post-quantum cryptography . Word. Lattice-based Crypto is basically the Crypto that uses lattices . Sounds like a very convenient explaination. However, for us who don’t really understand what lattices are, this concept sounds very confusing. To be honest, I still haven’t fully grasped the definition of lattices. If we look at its definition on Wikipedia again, it involves some geometry and group theory concepts which is also hard to comprehend. I think this is probably why this topic scared away lots of pondering readers. Therefore, I decided to take a more practical approach to introduce this concept using my own understanding of it. Disclaimer: To more hard-core readers, my interpretation of this concept might not be entirely correct. Here I’m just trying to provide a more friendly perspective of this problem. At this point, I would assume that we all have some linear algebra background. (If not, I highly recommend watching the YouTube series on Linear Algebra by 3Blue1Brown .) We all know what a basis vector is - if we say a linear space has basis vectors $ b_0, b_1 $, it means that every single element in that space can be represented as a linear combination of those two vectors. On the other hand, the span of the basis vectors must also cover the entire linear space. If we have some basis vectors $ b_0, b_1 $ that spans a linear space, any vector $ v $ in this space then can be expressed as: \(v = c_0 \cdot b_0 + c_1 \cdot b_1\) We know that since the coefficients $ c_0, c_1 $ can be any number, all the possible choices of these numbers will eventually create an area which represents the span of the basis vectors . However, what if we were to add one more limitation: all the coefficients $ c_0, c_1 $ must be integers ? Then, all the possible choices of these coefficients no longer create an area. Instead, they create a discrete, sparse grid of dots, where every dot represents a unique integer linear combination of the basis vectors . To imagine this, check out the image down below: Basically, instead of having an area of infinite vector choices, now we have a grid of finite vector choices that are linear combinations of the basis vectors of this linear space where the coefficients are all integers. We refer to this grid-like system as an Integer Lattice . Although we are describing a two-dimensional lattice structure, it’s perfectly fine to extend it into multiple dimensions. In fact, in order to make this into a hard problem that’s suitable for cryptography, we need to have plenty of dimensions. So what’s special about Integer Lattices? Turns out it’s the “Integer” part that makes this interesting. Previously, when we talk about the span of the basis vectors, since the coefficients that multiplies the basis vectors can be any number , we can basically express every single vector in the linear space with a set of unique coefficients . Now since we limit the coefficients to be integers only, we can no longer represent every single element in the original linear space using these basis vectors . Suppose we have basis vectors: \(b_0 = \begin{bmatrix} 1 \\ 0 \end{bmatrix},b_1 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}\) And we would like to represent the vector $ v = \begin{bmatrix} 2.6 \ 3.1 \end{bmatrix} $. There isn’t a valid pair of integer coefficients that allows us to do this. However, here comes the interesting part: Since we cannot perfectly represent the vector with a pair of integer coefficients, what about a very close approximation ? The problem becomes whether we can find a pair of integer coefficients $ c_0, c_1 $ such that the approximation of the vector in the Integer Lattice has the shortest distance to the vector itself in the lattice system . In the example above, the closest we can get is when we set $ c_0 = 3, c_1 = 3 $. The approximation vector is $ \begin{bmatrix} 3 \ 3 \end{bmatrix} $. This problem is usually known as the Closest Vector Problem (CVP) . And it’s proven to be a very hard (NP-hard!) problem to solve, once we have lots of dimensions and a handful of basis vectors. This problem serves a prelude to the Learning With Errors problem. Now we should have a pretty good understanding of integer lattices! Let’s switch gears for a bit and talk about the main topic of this post: Learning With Errors . Before we get into the specifics, I want to quickly do a warmup detour - a retrospection of high school algebra . In high school algebra class, we might have encountered problems where we need to solve a system of equations . Namely, we are given a set of equations such as: \(3x_1 + 4x_2 + x_3 = 0\\ 4x_1 + 2x_2 + 6x_3 = 1\\ x_1 + x_2 + x_3 = 1\) In order to solve this kind of problems, we are taught the Row Reduction Method . Basically the idea is to use one equation to subtract another equation, so we can end up in new equations. In this example above, if we subtract the first row with the third row, we can obtain $ 2x_1 + 3x_3 = -1 $. Eventually, after we do enough of these row reductions, if the problem is set up correctly, we can end up with some pretty nice expressions of $ x_1 = 1, x_2 = -1 $, etc. And we can plug that number back into the existing equations to solve for the other unknown variables in the system. When we systematically learn about Linear Algebra in college, we tend to express this kind of operations in term of matrices . For example, the system of equations presented above can be converted into: \(Ax = b\\ \begin{bmatrix} 3 & 4 & 1\\ 4 & 2 & 6\\ 1 & 1 & 1 \end{bmatrix} \cdot \begin{bmatrix} x_1\\ x_2\\ x_3 \end{bmatrix} = \begin{bmatrix} 0\\ 1\\ 1 \end{bmatrix}\) The row reduction operation therefore, can be applied on matrices $ A $ and $ b $ to obtain the solution for vector $ x $. This method is also commonly called as Gaussian Elimination . We can formally define the Gaussian Elimination Problem to be: Given matrix $ A $, and its product $ b = Ax $, can we recover the unknown vector $ x $ . Since solving systems of equations is a problem taught in high school, we all know that it’s not really a hard one to solve. If the problem is well-defined, then after a few row reductions, we can easily obtain the results. Any computer can solve a well-defined Gaussian Elimination problem pretty efficiently. However, what if the original system of equation is somehow noisy ? Namely, what if we also add an additional noise vector to the $ Ax $ product, so instead we get something like: \(Ax + e = \tilde{b}\) Where $ e $ is some noise vector that has the same dimension as the output $ \tilde{b} $. If we are given the matrix $ A $, and its product with errors $ \tilde{b} = Ax + e $, can we still efficiently recover the unknown vector $ x $ ? This problem, where we are trying to “learn” the unknown vector $ x $ from its product with the matrix $ A $ with some error offset $ e $, is commonly called the Learning With Errors (LWE) problem . If you are a careful reader, you might find very interesting similarities between the LWE problem and the CVP problem we discussed in previous sections. We are both trying to find a unique “coefficients” vector $ x $ such that it best approximates the target vector $ \tilde{b} $. The set of basis vectors in LWE is described by the matrix $ A $. Therefore, if CVP is conjectured to be NP-hard, then LWE can also be considered NP-hard under reasonable parameters, since it is a valid instance of CVP. Now, let’s formally define LWE again. First, we need to introduce a set of notations for the LWE problem: Now, here is the formal definition of LWE: \(LWE(n, m, q, x_B): \text{ Search Version}\\ \text{Let } A \stackrel{R}{\leftarrow} \mathbb{Z}_q^{m \times n}, s \stackrel{R}{\leftarrow} \mathbb{Z}_q, e \stackrel{R}{\leftarrow} x_B.\\ \text{Given } (A, As + e) \text{, find } s' \in \mathbb{Z}_q^n \text{ s.t. } \mid\mid As' - (As + e) \mid\mid_\infty \le B\) Allow me to paraphrase with my own words again. Basically, in a LWE problem (Search Version), we need to define the dimensions of the matrix $ A $ by $ m \times n $, where $ m $ represents number of equations, and $ n $ represents the number of unknowns in the system. We also need to define the size of the finite field as $ q $, where $ q $ should be a large enough integer that is larger than any of the computations we’ll need to do. Finally, we need to define the error upper bound $ B $, so the random error we sample can be bounded in a reasonable range. The LWE problem is basically asking us to recover the unknown vector $ s $ using the knowledge of the matrix $ A $ and the product with error $ As + e $ . Notice that we have a lot of parameters here. LWE has a lot more parameters than the other common problems we encounter in cryptography, such as the Discrete Log Problem in the Diffie-Hellman key exchange protocol. Here, let’s see what these parameters are really used for again: Usually, when we use LWE in practice, we will predefine $ m, B, q $ as a function of $ n $, so we don’t need to manually set all these tedious parameters. Also correct initialization of these parameters can guarantee that our LWE problem has a unique solution with high probability . Notice that we keep calling the previous LWE problem construct as the Search Version . This is because in the problem we are given the matrix $ A $ and the product with error $ As + e $ and asked to recover $ s $. However, to prove security in cryptography, we usually tend to use the Decisional Version of LWE . The Decisional LWE (DLWE) is basically the same setup as the Search Version. However, instead of trying to recover $ s $, our task becomes much simpler: to distinguish between a valid LWE problem instance versus some randomly sampled values . \(LWE(n, m, q, x_B): \text{ Decisional Version}\\ \text{Let } A \stackrel{R}{\leftarrow} \mathbb{Z}_q^{m \times n}, s \stackrel{R}{\leftarrow} \mathbb{Z}_q, e \stackrel{R}{\leftarrow} x_B, v \stackrel{R}{\leftarrow} \mathbb{Z}_q^m.\\ \text{Distinguish } (A, As+e) \text{ from } (A, v).\) The DLWE Assumption basically states that, given an LWE problem instance $ (A, As + e) $, it is impossible to distinguish it from sampling a random vector $ (A, v \stackrel{R}{\leftarrow} \mathbb{Z}_q^m) $. The reasoning behind the assumption is fairly straightforward: Since it’s hard to recover the unknown vector $ s $ from the LWE problem instance, then this vector $ As + e $ just looks like a randomly sampled vector to us. Therefore, we can’t call the difference even if this $ As + e $ term is actually replaced by a real random vector $ v $. Actually, this is not the first time we see both the search version and decisional version of a same problem in cryptography. Similar concepts were used when we evaluate the security of the Diffie-Hellman Key Exchange Protocol . Since Diffie-Hellman Protocol is based on exponentiation of group elements , the general assumption is that performing a discrete log on a group element is hard . Namely, if we are given $ (g, g^a, g^b) $, we are asked to compute $ g^{ab} $. Doing this inavoidabily involves extracting either $ a $ or $ b $ from the exponentiation, which is equivalent of doing a discrete log. We call this kind of problem the Computational Diffie-Hellman Problem (CDH) . When proving semantic security of the DH protocol, we usually prefers the decisional version - the Decisional Diffie-Hellman Problem (DDH) . Namely, in this problem, we are presented either $ (g, g^a, g^b, g^{ab}) $ or $ (g, g^a, g^b) $ and another random group element $ g^r $. Our task is to distinguish whether we have $ g^{ab} $ or $ g^r $. If we cannot efficiently distinguish these two candidates if the problem is instantiated with reasonable security parameters, the DDH problem is considered hard . However, if we want to compare the hardness of DDH and CDH, turns out solving CDH is normally harder than solving DDH . This is because of a powerful operation called the Pairing operation, which takes two group elements and maps their product of exponents into another group. (We have covered Pairings in the previous post.) With the help of pairings, we can easily crack DDH: \(e(g^a, g^b) = e(g^{ab}, g) = g_T^{ab}\) We can simply apply the pairing operation $ e(\cdot, g) $ to the last unknown group element, and compare the result with $ e(g^a, g^b) $. If the last element is indeed $ g^{ab} $, then the equality will hold. With the help of pairings, solving DDH is trivial . Therefore, we need to make sure that the Diffie-Hellman protocol is conducted in a safe group that no Pairing operations can be done. This special property is kind of annoying , because it gives us a bad lower bound of the problem hardness . We cannot efficiently relate the hardness of CDH and DDH, and we need to be extra careful when instantiating a key exchange protocol. Therefore, we normally characterize the hardness of DDH by its average performance. Something worth noticing here is that, the DLWE problem is equally hard as the SLWE problem . Because both problems rely on the same assumption that we cannot recover unknown vector $ s $, and there is no known attack like the Pairing operation here. This is an amazing property for cryptographic constructions. Now, we can characterize the hardness of DLWE by the worst-case performance ! If you made it to this line, it means we fully understood how LWE works now. Yay! For the next step, let’s actually put together all the knowledge we have learned so far and use lattice-based problems to actually do some cool latticed-based encryption . One of the most prominent examples of Lattice-based Encryption Schemes is the Regev Encryption . It is invented by Regev in 2005, and it uses the standard assumptions of the DLWE problem. Regev Encryption is an “ElGamal” style public key encryption scheme . Let’s see its formal definition. The Correctness property of the encryption scheme is pretty straightforward. We can expand the computation of $ \tilde{x} $ to: \(\tilde{x} = c_1 - c_0 \cdot s\\ = r^Tb + q/2 \cdot x - r^TAs\\ = r^T(As + e) - r^TAs + q/2 \cdot x\\ = r^Te + q/2 \cdot x\) Basically, the value of $ \tilde{x} $ is the original bit $ x $ times $ q/2 $ and then overlayed with some error term $ r^Te $. Since we know that every element in the error vector is bounded by a upper bound $ B $, and $ r $ is a random binary vector, the worst case error we can get is simply $ mB $. Since we enforced that $ q/4 > mB $, we know that this error is bounded in a very small region with respect to the size of $ q $. Therefore, we can guarantee that the decryption will always succeed . Namely, the error term is tractable and bounded by a negligible amount which won’t impact with decryption at all. Next, we take a look at its Security property. In order to prove that this scheme is secure, we need to use a very powerful tool in cryptographic proofs: the Hybrid Arguments . This is really nothing fancy. Hybrid arguments allow us to achieve a certain proof via incremental steps. Let’s first construct the first hybrid, which is basically the view of an eavesdropping adversary that sees the entire encryption transcript . \(H_0 : pk = (A, b = As + e), c_0 = r^TA, c_1 = r^Tb + q/2 \cdot x\) Now, we can replace a part of the public key and arrive at the second hybrid: \(H_1 : pk = (A, v \stackrel{R}{\leftarrow} \mathbb{Z}_q^m), c_0 = r^TA, c_1 = r^Tb + q/2 \cdot x\) We know that one cannot efficiently distinguish between $ H_0 $ and $ H_1 $, because of the DLWE assumption. Now, let’s further optimize $ H_1 $ and arrive at the final hybrid $ H_2 $: \(H_2 : pk = (A, v \stackrel{R}{\leftarrow} \mathbb{Z}_q^m), c_0 \stackrel{R}{\leftarrow} \mathbb{Z}_q^n, c_1 \stackrel{R}{\leftarrow} \mathbb{Z}_q\) Now, we basically replaced all the ciphertext terms with random samples. Because of the Leftover Hash Lemma , we can effectively treat $ H_2 $ to be equivalent as $ H_1 $. Finally, we combine the hybrids and arrive at the conclusion that one cannot effiently distinguish between $ H_0 $, which is the Regev Encryption transcript, and $ H_2 $, which is basically random samples that can be simulated. This proves the semantic security of this encryption scheme. To review, we first described what Integer Lattices are, and talked about the hard CVP problem. Then we switched gear to talk about solving systems of equations which can be generalized as Gaussian Elimination . After that we added noise to the problem and that gave us the Learning With Errors problem. We gave formal definition to the LWE problem along with its parameters. Next, we talk about how to transition from SLWE to DLWE , and why is it better compared to DDH/CDH . Finally, we concluded the post with an intro and proof to the Lattice-based Encryption Scheme - Regev Encryption . And now we are here. If you made it through here, we have just gone through all the fundamental building blocks that will get us FHE! Since this post is probably already too long, I shall conclude here. For the next post, we will take a look at how to construct a Leveled FHE scheme with noise . Most content of this post were summarized from my notes for CS355. You can locate all the awesome lecture notes from CS355 here . Thanks again to Dima, Florian, and Saba for teaching this amazing course! We view $ \mathbb{Z}_q $ as integers in range $ (-q/2, q/2) $ as a finite field . We define $ \mid\mid e \in \mathbb{Z}^m \mid\mid_\infty $ as the infinity norm operation on vector $ e $ with $ m $ elements. The operations returns the element in vector $ e $ that has the greatest absolute value . Namely, $ \mid\mid e \mid\mid_\infty = max_i^m \mid e_i \mid $. Finally, we define $ x_B $ as a random variable that has its distribution bounded by a number $ B $ . Namely, every sample of this random variable will definitely be smaller than $ B $. $ \forall x \leftarrow x_B^m: \mid\mid x \mid\mid_\infty \le B $. $ n $ is commonly referred as the security parameter ($ \lambda $). Therefore, the more unknown variables we have in the system, the harder the LWE problem becomes. $ m $ is usually a polynomial of $ n $. Namely $ m = poly(n) $ and $ m » n $. The larger $ m $ goes, the easier the LWE problem gets. $ q $ is usually $ poly(n) $. For example, we can set $ q $ to be $ O(n^2) $. $ B « q $, we want the error bound to be negligible to the size of $ q $, so we can properly recover the unknown vector. $ KeyGen(1^\lambda) $: The KeyGen algorithm will first create an LWE problem instance. Namely, it will sample $ A \stackrel{R}{\leftarrow} \mathbb{Z}_q^{m \times n}, s \stackrel{R}{\leftarrow} \mathbb{Z}_q^n, e \stackrel{R}{\leftarrow} x_B^m $ and set the product with error term to be $ b = As + e $. We also need to enforce that $ q/4 > mB $ for the correctness property. Finally, it sets the private key $ sk = s $, and the public key $ pk = (A, b) $. $ Encrypt(pk, x \in {0, 1}) $: The encryption algorithm encrypts a single bit at a time . Just like ElGamal, it will first sample some random binary nonce vector $ r \stackrel{R}{\leftarrow} \mathbb{Z}_2^m \in {0,1}^m $ , then it will set the first part of the ciphertext $ c_0 $ to be $ r^TA $, and the second part of the ciphertext $ c_1 $ to be $ r^Tb + \lfloor q/2 \rfloor x $. Finally, it outputs both parts $ (c_0, c_1) $ as the final ciphertext. $ Decrypt(sk, ct=(c_0, c_1)) $: To decrypt the ciphertext, first we compute $ \tilde{x} = c_1 - c_0 \cdot s $. Then we will check whether $ \mid \tilde{x} \mid < q/4 $. If so, we output $ x = 0 $, else we output $ x = 1 $.

0 views