Speaker 2
And so let's dig into how these instructions are transformed into circuits and that. And the kind of underlying proof system that you're using is from the Plonky 3 toolkit. How does that work? Yeah.
Speaker 1
So basically the particular protocol we're using is known as the Fry protocol. And in general, we're using the Starks kind of line of work. So basically how it works is, well, okay, Starks are basically a way of proving these kind of statements using hash-based cryptography. So there's also snarks, which people might have heard of. And okay, I mean, this is getting... Snarks are a type of snark, but usually what's happening... So popular snark protocols include like GROT16 and plonkish KZG. And historically, basically um a lot of proving systems have used elliptic curve cryptography halo 2 also does this um or the halo 2 fork that a lot of people in the ethereum ecosystem use um so yeah there's there's kind of two two major families that are commonly used which is like elliptic curve based cryptography and then hash based cryptography and then this we do all our stuff over with using hash based cryptography and so we use like Starks and Frye. Underneath the hood what basically happens is so for each instruction we have a main CPU circuit, where basically what the circuit looks like is you have, like, a bunch of rows in a table. So each instruction is a single row. And basically, you can think of, like, a row as, like, approximately, like, the current state of the world when you want to execute that instruction. And then the next row is the next state of the world after executing that instruction. And for every single instruction, we have a bunch of constraints that say, like, okay, the transition from this current row the next row needs to follow certain properties, and then we write a circuit to constrain those properties. So for example, for addition, we constrain that the next row's destination register value is equal to the addition of the previous row's input register operand values and we write some circuit to basically constrain that addition opcode. So we'll say something like okay when the instruction is an addition opcode then you apply these kind of constraints and we do that for every single opcode. Now this is like a really really oversimplification of what's going on. Like, in reality, behind the scenes, we have our CPU, we call it a table. So we have like our CPU table or circuit. Now, if you had to put all of the information for all of the operands in that one table, your table would be super wide. Like it would just have like a lot of information. And that's like very slow to prove. And what's interesting is like, you're only ever usually dealing with one, or you're only ever dealing with one operand at a time. So if you have kind of all the logic and every single row to deal with every single possible operand, but you're only using a subset of it, then it's not like super efficient. So in practice, what we do is we actually have different tables for every opcode. So for example, we have a different table or circuit for addition, where basically we constrain that like, that table is like actually pretty small, it's not that wide it just constrains that like if I have two operands and I add them it results in a third operand and it just has all the logic and all the related information you need to actually do that constrain and then we use this lookup argument called the log derivative log derivative lookup argument it's also informally called the logup argument. Basically, what that does is in our CPU circuit, we just look up that like our two operands plus and the third operand are actually like additions. So we look that up in our add table. And then the real benefit of this is that we get to save a lot of area. So in the CPU, our CPU can be like pretty thin because it's just doing these lookups to the particular table that is used for the particular op brand at hand. And I think that's much closer to the reality of what's going on. But yeah, so then underneath the hood, what's happening is like then you take all these tables um you do this like variant of fry called like multi-table fry where you can basically generate one fry proof for a bunch of these different um circuits at once and then at the end you get one stark proof for all these different tables that prove your statement that
Speaker 2
makes sense and then you guys also do this sharding thing right for programs that are particularly large um
Speaker 1
yeah exactly so if you want to run a program that is like really really long There's kind of a limit to how long your table can be. So you couldn't have like a billion rows in your table because in the protocol, what happens is there's like an FFT step that's like N log N. And so you'll like run out of memory. Like you'll run into like very practical limits very quickly where it's like, you'll just run out of memory on your CPU and stuff like that. So sometimes we'll want to prove statements that have a billion rows. Like if there's a billion instructions being executed. That's actually pretty common. So for example, if you want to do an if you want to prove the execution of an Ethereum block, it can generally be a couple hundred million instructions to even a couple billion instructions in some cases. And so, and we want to be able to prove programs of like arbitrary length. So what we do is we break up each program into shards. So we first execute the program, we'll get a billion instructions. We'll chunk them into shards of like four million instructions each, roughly. And then we'll generate proofs of, you know, each shard's execution, which is pretty simple. It's just like, okay, the program counter at the beginning of this shard was this. We just run all the instructions. At the end, the program counter was this. And we'll do that for every single chard. And then we'll combine all the chard proofs together. So each chard gets its own proof. And then we will take two chard proofs, and we use this recursion system where basically we verify those two proofs in a stark itself. And we verify that the last program counter from your first shard is equal to the star program counter of your second shard. And so then if those two proofs are both true, then you know that you have a proof that basically works with the first program counter of the first shard and the last program counter of the next shard. So it covers the whole range. And then you get one proof. So basically you take two Stark proofs and then you verify them both in a Stark. And then at the end you get one Stark proof. And we do this for many proofs. So for example, sometimes in certain programs we'll have like 100 chards or 1000 chards. You can kind of do this in a tree-like structure. So you just do two, two, two at a time, you get the next layer of your tree, you do it again. And then at the end, you just end up with one Stark proof. That's the full execution of the entire program. So
Speaker 2
cool. It's wild that this stuff is real. Yeah,
Speaker 1
it was a lot of pain to implement. It's very difficult. Yeah. But actually, and then for the recursion thing, what's interesting is, so normally you could think, okay, well, how would you implement the recursion thing? Maybe the most naive way is you just write a Rust program, and you write a Rust program to verify two Starks, and then at the end, you just that program to verify two Starks, and then you get a new proof. You're basically using SP1 on itself recursively, kind of. Now, that actually does not work, because when you try to do that, there's way too much overhead. The RISC-V instruction said it's going to be way too many cycles to actually do that. in practice, what we do is we made a new VM. So we made our own instruction set that's specialized for recursion. So that's specialized for verifying start proofs. And then we actually have a recursion VM where basically we have another VM that verifies two of these like Stark proofs. And we wrote a program in our own VM so then we had to build like a baby compiler to basically like write a Stark verifier in our own DSL, compile it, and I'm using like air, compile it to our own recursion ISA, and then we had to write a bunch of constraints in logic to constrain our own recursion VM. So not only did we build a RISC-V VM and all the circuits for that, we also had to come up with our own recursion ISA, build a DSL to write programs in it, build a compiler to compile to the recursion ISA, and then a whole bunch of circuits to constrain the recursion VM and recursion ISA to then do recursion. So yeah, I mean, that's actually very, very difficult as well. But yeah, that's the whole system. That's
Speaker 2
wild. So you're also using Plonky 3, though. So where is the kind of interface between what you don't do and you're pulling in from an external library?
Speaker 1
Yeah, so Plonky 3 is this really great library built by the Polygon team, in particular Daniel Lubrov and the Polygon Zero team at Polygon. And basically, it's an implementation of fry, in particular, the multi-table fry. So we use Ponky3 as our implementation of multi-table fry. So we did all the constraining, we did all the recursion stuff, and then when we actually want to run the fry protocol, we call like Poanke3's methods for that. Yeah. And
Speaker 2
so one thing about this general field is that it's evolving very quickly, right? The performance characteristics are continuously improving. So what's involved with you kind of upgrading your proof system over time, if you want to do that? Yeah,
Speaker 1
so the field does, and ZK does move really quickly, which if you're building a system, it is like very, it's hard to balance like taking your existing system and squeezing all the performance you can out of it versus like trying to make a bold bet on like okay we're going to change like an upgrade to the latest stuff i will say within the field i feel like there has been kind of a consolidation around small field hash based schemes so historically there's been a lot of teams working on this like elliptic curve based schemes and the hard the the kind of challenging part with elliptic curve stuff is you generally need to work over a very large field so like basically your circuit um is over this like 252 or 254 bit field which is pretty large and the reason a large field is bad is like in your normal cpu like your laptop or whatever. Generally, arithmetic happens over 32 bits or 64 bits. And when you're trying to run operations over a very large field that's like 252 bits, it's very slow to do on a normal CPU or even a GPU. So there has been a trend for a while where people are just making the field smaller and smaller and that results in like really great performance. So for example, like going from 252 bits to like 31 bits, you just divide those numbers. It's kind of like an 8x and it's not exactly correct, but like basically you can think of like, oh, I made my field 8x smaller. Okay, my proving performance goes up by 8x. And so our proof system is over the baby bear field, which is a 31-bit field. And then, okay, one thing is, so you can't do elliptic curves over a 31-bit field. Now, thankfully, the Starkware people, you been this line of work with hash-based schemes. And you can do hash-based cryptography over the small field. And so that's kind of like the technique that us and a lot of other people in the industry are converging on. So for example, ZK Sync, Polygon ZKEBM, RIS0, Valida. A lot of people have been using these smaller hash-based fields. Plonky2 library. Yeah, all those things use small field with hash-based cryptography. So, or yeah, I guess I didn't answer your question.
Speaker 2
So the general architecture will remain the same even, well, there's this tension of squeezing as much as you can with what you're building around the proof system. But even if you want to upgrade it, it seems like the field is generally converging on one architecture, which maybe makes it easier to kind of swap this out in the future.
Speaker 1
Yeah, I think another thing is, for example, we use Plonky 3, which is an open source library, and there's also a bunch of consolidation around that. And SP1 itself is also very open source. Like, the proof system is fully open source. I think we've tried to make it, like, a pretty nice developer experience. We've actually had a bunch external teams fork it. So for example, the Lurk team, they have a new name now, I think it's called Argument. They forked it, they added their own pre-compiles, which maybe we'll talk about in a little bit, and they've kind of contributed back to sp1 and so you kind of get these like open source shelling points around plonky3 and like sp1 and that also i think helps with upgrading it so for example there's this new hash based uh proof system called circle stark that uses a different field than the current one we use. And Plonky3 is implementing it. And then when Plonky3 implements it, you could imagine that we just are able to pull it into SP1 and upgrade to it. So that's, I think, something I find really magical about crypto is everything's open you can have these, like, really positive open source collaborations that you don't really see happening in Web2 or other industries. And so I'm also hopeful that that's something that can help us kind of, like, you know, be at the bleeding edge of all this stuff and collaborate across ecosystems to have all of us be at the bleeding edge. Right.
Speaker 2
Like if a lot of teams are using SP1 in their applications, then a new proof system to kind of get distribution would go and do the work themselves for getting integrated into SP1. Yeah,
Speaker 1
yeah, exactly. Yeah.
Speaker 2
I was talking with uh, my friend, Zach Oberont earlier, who I think works with you on a project. And, and he was, before we get into the ZK rollup stuff, which I do want to talk about, one of the things he said was that it's been really impressive to see how you've kind of improved the performance by orders of magnitude. how have you been able to do that and maybe this ties into some of the pre-compile stuff but again what's an intuition around where you're looking for those improvements how you're kind of like iterating on the system yeah
Speaker 1
so one of SB1's biggest innovations was the system of precompiles. And so what a precompile is, is you can kind of think of it philosophically as a way to have, like, ZK assembly. And, again, that's in air quotes. Like, you have a very specialized circuit for a certain very expensive operation. So, for example, the Katchak hash function or the Shaw hash function or verifying, you know, an eCDSA signature, you have a specialized circuit for doing that operation. And so instead of paying a bunch of instructions in your CPU, so for example, for verifying an eCDSA signature, if I were to just write it in normal Rust, I think it will take a million cycles. And with our system, basically what you do is you have this specialized circuit for verifying an ECDSA signature that does kind of like a lot of the heavy lifting. And it like inlines a lot of like memory access. It does a lot of stuff that like is really specialized. And then what happens is basically you take those million cycles and you turn it into 40,000 cycles. So that's like a 20x. And intuitively, you can think of it very similar to assembly. That's kind of why I made that analogy where it's like normally even when you and I write normal programs we just write in normal code and then underneath the hood actually oftentimes even for cryptographic libraries like if you look at Rust crypto or any implementation of like any cryptographic library they'll have a lot of like pretty ugly assembly code that's like specialized to your CPU architecture that's like making use of vector instructions that is for that particular computation since it's so sensitive and it gets called a lot of times, you really hand write it to squeeze every last bit of performance out. And so you can think of our precompile system as doing basically exactly that, where we hand wrote a circuit very specialized to these operations and yeah, got it working. that's our precompiles and in practice like our order the reason why sp1 is so fast often by an order of magnitude is that these precompiles are super super effective so for example a program that normally would take two billion cycles in sp1 when you use precompiles, it'll take like 300 million. So that's like, you know, almost 10x right there. And generally, when you have 10x less cycles, it's also 10x less proving time. And I think one of our key insights, and I often think actually some of the best insights sound so obvious or like in retrospect, but when we came up with this idea, it wasn't even clear to us that having these precompiles would be like super worth it. Like we were kind of just like, well, let's implement the precompiles for most of the use cases we're interested in, like bridging and rollups. Most of our CPU cycles are spent in these like hash functions and signature verification. And then the rest of your business logic that's like, oh, like is the gap, like is your nonce equal to like your previous nonce plus one and stuff like that. We're like, oh, I mean, that's like relatively less expensive in terms of instructions. So we implemented these precompiles. And then I remember there was like a day where we were testing our precompiles. And there was this one program where we literally saw the cycle count go from 800 million to like 30 million. And we're like, oh, wow, like this is super effective. But I think that was when we decided to build our architecture in that way, it was not super, obvious to us exactly how effective it would be. Like we thought it might be like a two to three X. And then in many cases, it was like a 10 to 20 X. And so I think, yeah, that even exceeded our own wildest expectations. you're kind of like, oh, this, I mean, this is like super obvious, like, even on a normal CPU, when you run Ethereum blocks, actually a lot of normal CPU time is spent doing the catch-ack function. So you're kind of like, oh, that also would make sense in a ZK context that you'd want to write this like ZK specific assembly and like hook it up to the main, your main RISC-V part. But yeah, I don't think that was very apparent to many people because no one had really built a system like that before and also getting it working from a cryptography perspective is pretty difficult like there were certain like novel algorithms and techniques like we had to come up with um like we have this like global shared randomness system that basically no one else has that's um pretty innovative and we came what's that um basically to make the pre-compiles actually work um we have this like two phase prover we're in phase one we like go through the whole computation generate it and then hash all of it to get, like, a shared global challenge that then we can use across our entire computation in phase two. And, you know, that's just needed for our, but yeah, that's needed to, like, make everything cryptographically work. But yeah, like, we have this, like, two-phase prover system to make all this precompile stuff work. It's not so easy to really get working in an efficient way. And a lot of our challenges and a lot of time, engineering time we spent was to get this precompile stuff to work. It's not very easy to do it. But yeah, I think now you'll see a lot of people talk about precompiles, and a lot of people are adding precompiles to their systems. And it's kind of like, I think at this point it's kind of become super obvious that this is the way the ZKVM performance is going to kind of match the performance of circuit-based systems and actually be practical. And so a lot of people are doing it now. But I think even at the time, even when we started, it was not even obvious to us. And I think it's been like super magical to see it like play out and work super well.