CROSS — Leveraging AI ASICs for Homomorphic Encryption
Microsoft Research
0:03 Welcome everybody.
0:04 It is my pleasure to introduce today Yang Ming Tong.
0:08 He is a PhD student at Georgia Tech,
0:11 a computer architect that focuses on uh the intersection of AI and cryptography.
0:16 And today he's going to present the talk
0:18 titled cross leveraging AI A6 for homorphic encryption.
0:23 Yangming the floor is yours.
0:26 Thank you so much Marsher.
0:27 And then today I'm super thrilled to in front
0:30 of the cryptographic prestigious experts in the field and I got
0:34 to introducing two line of my work cross and a gift
0:38 morph which is recently accepted at the automation conference this year.
0:42 So before we get started let me briefly introduce who I
0:45 am and what I work on and specifically my research overview.
0:48 I'm a computer architect and I'm focusing
0:50 on enabling AI systems for cryptography and I'm
0:54 currently a fifth year PhD at computer science
0:55 of Georgia Tech and largely advised by Dr.
0:58 Tucha Krishna and I'm mainly known for developing two projects and one is cross
1:03 and what I got to talk about today the other one is a reconfigurable AI systems
1:08 that make AI to reconfigure itself to be
1:11 a dedicated ASIC for different workloads and I
1:13 got to give tutorial for both of them in the upcoming plus 2026 at peace.
1:19 So if anybody come by feel free to say hi to me
1:21 and we will invite you guys to there so that we will have
1:24 a great audience in elevating the overall level and this work give me
1:29 the machine learning and mel common ring star and a couple other source.
1:33 So overall my result could be captured by this single
1:36 slice and design systems for AI and cryptography.
1:40 Today AI systems take data and model in the raw format.
1:45 So we have this particular work number one workload
1:49 and this problem cannot support private data like genome
1:53 or close source private model like charge because
1:56 this na raw format of the data will leakage everything
2:00 and to protect the privacy private data and model
2:03 could be encrypted and then go through encrypted computation
2:07 and this is enabled by homorph encryption and to support
2:11 both raw computation as well as encrypted computation I help build
2:16 a systemuler called sushi and a compilation framework called
2:19 cross and both colling these two line of workloads
2:24 into unified size of linear algebra and the linear
2:28 algebra will including mid multiplication
2:31 vector arithmetic and memory reorganization
2:34 and this unify of linear algebra could directly run
2:37 on today's AI system such as Google's TPU Amazon Freom
2:42 and Avidia's GPU as well as Microsoft AMIA And to study
2:47 the performance of this AI system running on this workload,
2:50 we built a simulator called skill sim.
2:52 And within the skill sim,
2:54 we could model the architecture of different today's AI system power.
2:58 And then it reveals a key insight.
3:01 That is when the workload has various ship.
3:04 For example, maybe mobileation call with different ships
3:07 and they might underutilize the available capation in the given
3:12 AI hovers and therefore to make it more
3:15 performant for wide range of workloads of different ships.
3:19 I designed two lines of reconfigurable AI systems
3:23 and these reconfigurable AI systems could reconfigure their data
3:27 paths as well as memory layout for each individual
3:31 workload and this could enable them to achieve full
3:34 compute utilization and the workload of all shapes
3:38 and further to enable this automation compilation I built
3:41 a layout loop framework and layout loop could search
3:45 the optimal data flow and layout for individual workload.
3:48 Among this work, Sushi has won
3:52 the Qualcomm innovation fellowship and cross has deployed
3:55 in Google and won the second place
3:57 in university demo at design automation conference last year.
4:01 It also wants a GT next award and startup launch.
4:05 Further layout loop has been integrated into Nvidia MVL
4:08 labs and the fact is a collaboration work with IBM.
4:13 So let's first with this as over mind
4:16 and go through this overview today and today I got to cover how do we
4:22 using existing AI systems to accelerate cryptographic applications.
4:27 So we start with AI model because AI model
4:30 is a most common work in today's data center.
4:33 So we using their competition demand as unit and using their latency as unit.
4:39 So we will have three lines of cryptographic application.
4:42 First privacy presuming AI which use homomorphic encryption to encrypt both
4:48 data as well as the model to provide holistic privacy guarantee.
4:53 Second cryptocurrency.
4:55 It relies on zero knowledge proof
4:57 primitives to offer both encryption and verification.
5:01 And third applying zero knowledge proof encrypted AI
5:04 computing will give us a verifiable encrypted competition.
5:09 All these applications will increase
5:11 the computational demand and latency exponentially.
5:15 And in this talk I will through very demonstrate you two line of work
5:21 cross and morph which could lower
5:23 this exponentially increased latency to roughly super linear.
5:28 Let's start with cross first and cross
5:31 is a joint work with Georgia Tech, Google,
5:34 MIT and Canal and we want to give
5:37 a special credits and appreciation to the professor
5:40 Arvin who have been able to see this work published before uh he passed away.
5:45 So but we are super appreciate and uh appreciate for his support and insights.
5:52 So this cross is a number one work which enables AI as for homorphic encryption.
5:59 They give you immediate serving of the homop
6:02 encryption in real time under Google's TPU.
6:05 So what do we do?
6:06 Let's starting with the background.
6:09 The reason why we starting with this new generation of H
6:12 accelerated on today's A system come from AI's new industry revolution.
6:18 Specifically, artificial intelligence is driving a new industry revolution.
6:22 It could transform chatbot, robotics, autonomous vehicle, AI coder,
6:27 biology, protein discovery, and AI factory all into digital tokens.
6:33 In other world, the entire world is
6:36 being digitalized and tokenized for AI assistance.
6:40 However, such revolution has been blocked by the privacy concern.
6:45 For example, Samsung has bunded chatbt among
6:48 employees after they see the code leakage.
6:51 Grock accidentally released the users conversation through a public
6:56 search tool and this is three years back we
7:00 delayed the AI features just because we want
7:02 to keep the user data on device for review later.
7:06 So these ones protect the privacy into a power bound requirement before
7:13 we release the AI feature and that gave us the following question.
7:17 How to protect the privacy of both model and user data.
7:21 To answer this question we need
7:23 to understand where this privacy concern come from.
7:26 First from model side today's most advanced model
7:30 such as charge uh orace cloud and also perplexity
7:35 these model are proprietary and their training data as well
7:39 as their data analytic tools are all clone source.
7:42 These closed source services cannot share all of their own boundary.
7:47 This creating the right hand side model privacy boundary shown in red.
7:52 On the data side, we see the data from healthcare,
7:54 finance or government are also private data and these data never able to leave
7:59 their agency as well and this creating
8:01 the left hand side orange data privacy boundary.
8:04 Such two sides boundary creates the significant privacy gap to use the right
8:08 hand side private model to serve the left hand side private data.
8:12 And as we all know here home of encryption could
8:15 protect the privacy by enable directly competition on the encrypted data.
8:19 So we could encrypt your model,
8:21 set it to the untrusted infrastructure and encrypt your private input locally
8:25 within the data privacy boundary and only send out the encrypted cipher text.
8:29 So that within this untrusted infrastructure we could using this encrypted model
8:33 to serve this encrypted input to generating encrypted results and these results
8:38 will come back to the user and only the user who
8:40 has a secret key could decrypt it and get a meaningful value.
8:44 This is great but because it secure everything but it is very slow
8:48 and we all know the sec the slowness come from both compute and memory overhead.
8:54 If we take the rest 18x as example
8:57 encryption will boost it by 8,000 time and compute
8:59 by 800 times and this combined will lead
9:02 to a 10,000x slower latency and that make it impractical.
9:09 As past few years entire community has focusing
9:13 on designing new hardware and compilation tool chains to reduce
9:17 this growing complexity and with few examples show
9:20 in the middle but building this new chips and new compilation
9:24 flow will take years of time and take
9:26 a lot of money and that's why we didn't see
9:29 available product for people to use today and therefore
9:33 at this project we want to enable immediate H acceleration
9:39 So we're using existing AI A6 such as Google's
9:43 tensor processing unit to accelerate home encryption instead of building
9:47 new hardware and we build a transparent means we
9:52 converting the encrypted data into kernels that run very
9:56 efficiently on today's AIC which we will cover
9:59 in this talk and it could lowering the years of time
10:03 into immediate and then saves you million of dollars
10:06 because we don't need to fabricate any new hardware.
10:10 Now it comes to a fundamental question.
10:12 Why we use AI6 such as Google's TPU instead of other commodity hours like GPU?
10:19 To answer that question,
10:20 let's deep dive into this TPU to see why it has in the high level.
10:26 It give you better performance per watt.
10:31 We categorize various chip shown on the right figure where the horizontal
10:36 axis indicating the power where the vertical axis indicated the throughput.
10:41 So in this right hand side throughput power plot the diagonal
10:46 line means throughput divided by power or so called power
10:50 per watt or energy efficiency and upper left means better
10:56 and further we categorize each chip based on their technology nodes.
11:01 For example, the light blue indicating the 7
11:04 nanometer and the light yellow means 456
11:08 nanometer and green means the advanced technology nodes
11:11 such as 3 nometer or even below it.
11:14 As you can see, AI A6 always give better performance
11:18 per watt than other commodity hours like GPUs and FPGAs
11:23 and this enables more chip to be fitting within the data
11:27 center and they give you higher throughput as with this strong motivation.
11:34 Let's now deep dive into what a AI6 such as Google TPU look like.
11:41 In the high level tensor processing unit from Google TPU give you three
11:46 features significant memory significant model compute
11:50 as well as wide range of the reordering.
11:54 Let's start with the memory.
11:57 TPU has massive memory.
11:59 First different generation of TPU has hundreds of gigabytes
12:03 of HPM and hundred of megabytes last level cache.
12:09 Of course go ahead.
12:10 So um are like you can't buy these TPUs right
12:14 these are totally uh closed by Google isn't that right
12:20 now uh before two years ago that is the truth and now people
12:24 has officially enabled this leader AI company to buy it for example anthropic
12:28 has signed a deal of 1 billion track contract with Google to using
12:32 their TPU and then set up the TPU cloud in their own infrastructure
12:37 oh wow uh okay so that's That's how you know like what the architecture is.
12:42 Oh, I see.
12:43 So the this architecture information is public.
12:47 Okay.
12:47 Huh.
12:48 Interesting.
12:49 Do do you know why they make it public?
12:51 Like why would it be?
12:54 Uh I think there are a couple of the papers.
12:56 So this is a summarize of the knowledge that got captured from all the public
13:00 uh paper from Google and you know I was interning at Google as well.
13:03 So I do know some internal information and then it turns out
13:06 to showing that this public information has
13:08 the same information as internal as well.
13:10 So it
13:11 that's why we're creating this slide to uh
13:13 serve as encapsulate of all generation of TPU.
13:16 Okay, cool.
13:17 Thanks.
13:18 Of course, there's a new features from TPU
13:19 as well which I didn't cover but you know these are uh maybe out of the scope
13:23 for today's talk because we didn't utilize these new features.
13:27 Cool.
13:27 Yeah.
13:30 Cool.
13:30 So this hundreds of megabytes of last level cache
13:33 will be shared by one or two tensor cores.
13:37 And therefore we call this last level cache as common memory or CMA.
13:42 And within each tensor core we have tenth of megabytes of local scratch pad.
13:48 And this t of megabytes of local scratchpad
13:51 is organized into a vector of 128 lengths.
13:55 And so we call them a vectorized memory or we man.
14:00 And within each lane there are eight saplings and each
14:04 sapling has a local 32 instance of 32bit register.
14:10 And this give us a,024 local register files for each tensor core
14:16 and this give us in total hund of the kilobytes k local registers.
14:22 So putting all the memory in summary,
14:25 TPU of various different generations give you
14:28 around 800 GB of offchip HBM as well
14:32 as a hundreds of G megabytes of onchip memories
14:36 and further TPU also offers massive compute and each
14:42 subling has two local ALUS and in total each tensor core will have 284 AUS
14:50 and these ALUS are designed to process vectorized
14:54 operations or whacked up as shown on the right.
14:58 And so we call this a bunch of alou together as WPU.
15:03 And all these 2024 48 alus in the WPU work in the lock step.
15:09 That means whenever you start launching some workload into the chip,
15:13 you have to utilize at least one ALOU from all lens and all subs
15:19 together or in other words every single instruction have to trigger a cmd of 128
15:27 by 8 alus and this creating the granularity that we have to leverage
15:33 to uh from the TPU to run on the realistic workload and this WPU itself
15:38 offer plus one to 10 32bit integer terovs and further each tensor core also
15:47 contains four dedicated compute engines for matrix
15:50 mlication and we call them as MXQu
15:54 each MXU is a 128x 128 sysol array which is designed to process large
16:01 missions mlication and this give us a 100
16:05 to a,000 8bit terops for matrix multiplication.
16:11 So if you looking at the top props
16:13 difference from the matrix multiplication engine and vectormatic engine,
16:17 you could observe one thing the throughput for low
16:20 precision matrix mlication engine is about 100x higher than
16:25 the high precision vector arithmetics and therefore we do want
16:29 to use MXU as much as possible for higher performance.
16:35 And lastly, TPU also have very abundant bunwise for doing flexible reordering.
16:43 Specifically, it has a dedicated unit called cross lane unit
16:47 showing on the right and which is also noted as SLU.
16:51 And this unit could take the data from a vector
16:54 of the VM and then reorder them and send it back.
16:59 In other words, this could do data reorganization
17:03 at the granularity of VR rag level and a reg
17:08 here means all single register file from all
17:12 the lanes together with all sublanes have to be issued
17:15 in the same manner as well and that's why
17:17 we group one register file from all sublane together
17:21 into a VR that means every single data reorganization have
17:26 to perform at the granularity of 128 by8 32-bit registers.
17:33 So putting all together TPU has massive memory seal
17:38 of compute as well as abundant B wise and these features
17:43 enable super fast primitives such as 32-bit vectorzed addition 32-bit
17:48 vectorzed multiplication 8bit mid mlication as well as 32-bit transpose.
17:55 But the question is how do we use them for H?
17:58 In order to understand this question we
18:00 have to go through understanding the background
18:02 of that and then given the audience here are super expert on H.
18:05 So we will make it very short.
18:08 First H will require a vector of floating point
18:12 value if we using for CKS scheme and this 32,000
18:17 long vector data will be first encoded into a massive
18:21 polinomial with tens of thousand degree which we
18:24 call as plantex and the degree of this is
18:27 twice as the input one length of the vector
18:31 and further this plantex is then encrypted into a pair
18:35 polomial of the same degree which call as suffer tax.
18:39 And now let's look inside a sephertex.
18:42 Every coefficients in this suffer tax is a integer in the finite field.
18:47 That means there is a limited range of the valid integer choice for you to pick.
18:52 For example, if we have a modulus Q equal to 14,
18:56 then we could only take integers from 0 to three uh 0 to 13.
19:01 If a calculation results in a value that overflow
19:04 this limit for example 12+ 4 which exceeding the modus value
19:08 of 14 we need to call on a model reduction
19:11 to break it back within the range and further this restriction does
19:16 not stop at a coefficient level and each polinomial themselves
19:21 is living in a ring and just like coefficients has
19:24 a limit the polinomial has also a limit number of the degrees
19:29 and that degree value is usually a power of two.
19:32 So like here we show you a n equal to 16 as an example.
19:38 If polinomial is a ring that means when the polinomial
19:42 exceeding their hyper limit of the degree value such as 16
19:45 here we need to call on polinomial reduction to reduce
19:49 it by the polinomial modus to bring it within the range.
19:53 In short, H will bring both integer and polomial
19:57 modular reduction to both vectorized operations as well as matrix
20:02 multiplications and further even with RNS H also lifts
20:07 the coefficients to 28 to 59 bit finite few integer.
20:12 If we put the workload and compute together
20:15 then there is clearly compute and memory gaps.
20:19 First mod reduction.
20:21 H require model reduction but the AI A6 do not have it.
20:26 Second insufficient precision.
20:29 Even with RNS H will still require 28 to 59 bit integer
20:34 but AI6 only support lower precision such
20:38 as 32-bit vectorzed arithmetic and 8bit mlication.
20:44 And on the memory part, H will require a data transpose or a data shuffling.
20:50 Imagine that we got to do transpose or shuffling
20:52 of a tens of thousand degree long vector.
20:56 Putting the memory recognization at such a big scale is
20:59 very expensive and therefore we want to get rid of them.
21:03 So in cross we propose two compilation techniques.
21:07 business align transformation to bridge the compute gap
21:11 and memory align transformation to bridge the memory gap.
21:15 The insight here is homop encryption has pre-nown parameter such as the twitter
21:20 factor entity and fixed computation graph and this allow us to partially
21:27 offload the h workloads into the compile time and then we can
21:32 transform the remaining computation into the kernel that a are very good at.
21:40 So let's first talk about how basis line transformation bridge
21:43 the compute gap and we will short the basis align transformation
21:47 as bat given this 32bit vectorzed model multiplication a* b
21:54 model reduced by q into z where we show each individual
21:58 value here as 32 bit for example a b q
22:01 are all 32 bit as an example and in realistic cross
22:05 could tackle arbitary precisions and the basis alignment could lower
22:11 this 32-bit vectorzed modular reduction based
22:15 mlication into two sides of kernels.
22:18 First, low precision mission mlication and second 32bit vector addition.
22:26 There is no model reduction needed at all.
22:29 And within these two kernels, the first mission multiplication performs a* b
22:34 and the second vector addition performs model reduction.
22:39 Let's first dive into how a* b is
22:41 converting into the first a bit machine multiplication.
22:46 We can take a pair from this 32-bit vector model multiplication.
22:51 For example, we take the first value a z have b 0 into c 0 and each of them
22:57 uh a a z and b 0 are 32 bit and the result c will be 64 bit.
23:03 So our goal is to converting this high
23:05 precision scal mlication into low precision metric multiplication.
23:10 Each high precision value want to be computed in the low precision matter.
23:16 So we could just nally do a chunk decomposition.
23:19 For example, a 32bit input a z could be break into four a bit
23:24 chunks as shown here where we're using
23:27 the superscript to indicating the chunk indices.
23:31 For example, a z upper three means the most significant bite.
23:36 Well, the a z upper zero means the lowest or least significant bite.
23:41 And we could apply the same decomposition for b 0 and c 0 as well.
23:46 If we do this chunk decomposition,
23:48 the original 32bit high precision scalar multiplication will be
23:52 converting into this all multiplication of the low precision chunk.
23:57 And then if we want to compute it in temporal then
24:00 it takes 16 mlications to generate all these 16 partial sums
24:05 and followed by that we need to accumulate this part back
24:08 together and then this will go give us the results of C.
24:12 If you really stare on this diagram there's interesting thing for example
24:16 this A Z B 0 is contributing to the lowest two bite of the C.
24:22 For example, a 0 time B 0 is only contributing to the C 0 as well as C 01.
24:28 And further, if you stare on the second a little bit left,
24:32 a 01* B 0 and A 0* B 01,
24:37 these two are only contributing to the second least significant bite,
24:42 which is C 01 and C 02.
24:45 In other words, from the rightmost to the left most,
24:49 the basis is keeping increasing.
24:53 So with this insight, the state-of-the-art GPU library could schedule
24:58 this chunk multiplication into a sparse matrix
25:02 multiplication where each row of this results indicating a part with one basis.
25:10 For example, this A00 Z will contribute into the least significant basis chunk.
25:15 So they will be scheduled at the first
25:16 row and then the second rightmost rows are
25:21 contributing the second part which is have a higher
25:25 basis a bit higher than the first one.
25:27 So based on this insight people could re
25:30 repeatedly converting this left hand side clockwise mlication
25:34 into a sparse mission mlication and then all these parts
25:39 will be shifted and accumulate together back to C0.
25:44 But if you notice this structure the left side matrix is most likely zero.
25:50 It is actually a tropolis matrix of A and half of the values are zero.
25:56 That means we are performing redundant compute on half of the data.
26:02 So in to redundant to reduce this redundancy
26:05 we proposed in basis align transformation for a bat
26:09 that could eliminate this original sparse mission mlication
26:13 into a smaller and dense mission mlication showing on the right.
26:18 Now let's see the beauty of the math behind it.
26:23 The idea behind come from a simple observation block two in the matrix
26:28 mlication only contributing to the temporal part which I will show you here.
26:34 First these generated parts will be left
26:37 shifted and accumulated into a 64-bit results
26:41 and this 64-bit results will be later on modused
26:45 by Q and become a 32-bit final results.
26:50 This means the upper 32-bit part is only
26:53 temporary value or in other words the block two
26:58 in the life matrix contributes only to this temporary part
27:02 which will be eventually fed back to the block one.
27:06 So if they only contribute the temporary part
27:09 could we directly offload them into compile time?
27:13 That's what we did specifically.
27:16 Let's see how do we do by picking one pair of the data in the block two.
27:22 The overall result Z is equal to a summation of left
27:27 shift of all the parts and then model reduced by Q.
27:31 And we keep our notice attention on the P sum 6 here.
27:35 Specifically P sum 6 is equal to A3 times B3.
27:40 And we could further group A3 left shifted by 48 together.
27:44 And then we can also insert a model Q there.
27:48 Before we do this, A3 is 8 bit and 8 bit left 48 bit will become 56 bits.
27:56 And that is definitely exceeding the 32-bit target.
27:59 And after we bring this model reduction Q into it,
28:02 it can lowering this 56 bit higher precision scaler
28:06 down to 32-bit which we know denote as R.
28:10 And then we can chunk decompose this 32-bit R into different chunks.
28:16 This allow us to get rid of this A3 from block two and then
28:20 convert it back into block one by splitting different chunk of R back.
28:26 And this is effectively align the basis of the value in the block
28:31 two below 32bit and therefore we call it as basis align transformation.
28:39 We could further repeatedly apply bath to all elements
28:42 in block two to fill them back to block one offline.
28:46 And this will generate a smaller dense
28:49 mutant multiplication and giving 2x theoretical speed up.
28:54 And now the original ones pair of 32-bit scalar multiplication
29:00 could be lower into this 8 bit k by k
29:04 by time is k by one matrix vector multiplication
29:08 and this will generating us a result of k by one.
29:15 So now we unveil the analying mass beauty of how do
29:19 we converting a* b into the first a bit mission multiplication.
29:24 Let's now dab into the second pieces.
29:26 How do we further lowering or reducing
29:28 the mode reduction into pure vectorz addition.
29:34 Now we know we want to do a* b reduced by 2 into z.
29:38 And then the first step already performs a* b into c.
29:42 and we apply back to one operand offline.
29:46 So this converting this high precision scala
29:50 multiplication into a low precision matrix vector multiplication.
29:55 Our second step is to accumulating these results back to C and then
29:59 perform a model reduction to convert or lower it back to Z.
30:04 So let's starting with our eyes on the left hand side matrix.
30:08 When this is left hand side matrix,
30:10 every single element in both left hand side k by k matrix
30:14 a and k by vector b are a bit and the multiplication
30:19 of two a bit value will give us 16 bit value
30:23 and further this mission multiplication will have a reduction of four elements.
30:27 So an actual reduction of 416 value will give us a 18 bit
30:33 partial sums and these parts need to be accumulated by shifting and add
30:40 and to analyze this entire data as using 18 bits as underlying granularity
30:46 isn't that obvious isn't that direct
30:49 and therefore let's changing our data representation.
30:52 we could changing them into using eight bit
30:55 of one bite as under analying data representation.
30:58 So these two are exactly the same and we know all these temporary partner sums
31:03 need to be accumulated together and then model
31:06 reduced by Q finally into a 32-bit result.
31:10 As you can imagine here the upper 49 subac
31:14 by 32 which is 17 bits are also temporary.
31:18 That means we only need to reduce this upper
31:22 17 bit elements or further on to the input.
31:26 That means we only need to reduce
31:28 the upper two highest significant bite C4 and C5.
31:35 If we know we want to only reduce the highest two bite which
31:39 is C4 and C5 that means our lowest four bytes could be stay
31:45 intact and this give us the goal of modulus reduce every single piece
31:50 of data back to 32-bit and given this lower four by already within 32-bit.
31:56 So we just take them out and accumulate them by shift and addition.
32:02 And here I changing the shift of the eight
32:05 into the corresponding multiplication with the corresponding bases.
32:08 For example, can I ask you a question?
32:10 I mean isn't go ahead.
32:12 Wouldn't it be I mean when you when you leave the lower part intact
32:16 and then you add these together like
32:18 it could still overflow the your prime, right?
32:22 So so you you don't necessarily get like full reduction, right?
32:26 Exactly.
32:26 It is a lazy reduction and then
32:28 this lazy reduction will stay intact with this 32-bit.
32:30 So it will keeping uh always within the range
32:33 of 32-bit without exceeding that and then we could enable
32:37 this lead reduction throughout the entire program and only utilizing
32:40 the full reduction in the end for the final result.
32:43 Yeah, I I guess like a lot
32:45 of the FG algorithms actually use lazy reduction a lot
32:50 because it benefits it benefits throughout.
32:54 Yeah.
32:54 Can can't you get one additional bit when you do the addition
32:57 and go up to 33 and it's a small thing but
33:01 yeah you're right so we are restricting on the prime
33:04 value of the que to 28 bit and this will avoid
33:07 yeah having that one extra bit to overflow 32 bit
33:10 and then in fact we also try 29 bit as well
33:13 and then based on our statistic trying there is a randomness
33:17 within uh encryption so the actual value will not actually
33:20 exceeding three more bits usually it will just need three bits
33:23 so you also supporting 20 n prime as well in reality.
33:28 Got it.
33:28 Thank you.
33:29 Yeah.
33:29 Cool.
33:30 So now we know that we want to reduce the high precision value
33:34 down to 32-bit and then now we just leaving the underlying 32-bit intact.
33:38 We didn't change it and then we multiply them with a cross basis
33:41 and back there and this will give us the original lower 32-bit value.
33:45 So our attention will be how to focusing reducing this highest to bite.
33:50 First let's take C4 as example.
33:53 uh we first converting this 8 bit C4 into the binary view and then
33:58 from the top to bottom means uh we have the increase of the basis order.
34:02 So one 0 0 1 0 1 1 0 this will
34:06 be multiply with a crossing basis and then accumulate back.
34:10 So this one zero binary view of that will become this right
34:13 line and then the right hand side multiplication with the corresponding
34:16 basis and the accumulation this are values known ahead of time
34:22 and therefore we could processing this higher than 32bit value
34:26 offline and once we compute them offline in the runtime
34:31 or in the online what we need to compute it just
34:34 a log two bit value addition they become just a vectorzed
34:39 addition And we can apply the same trick to C5 as well.
34:44 We first express this 8 bit C5 value into a binary view shown here.
34:49 And then we preprocessing this basis offline to converting
34:53 every single basis offset down to log to Q bit.
34:59 And then in the runtime we only need
35:01 to accumulate this log two cubit value together.
35:05 And then we can merging all these tempers sum together to become the final Z.
35:10 And it is a lazy reduction.
35:12 In our current testing, if you have a queue less or equal than 28 bits,
35:18 it will always 100% guarantee you will never exceeding 20 32bit.
35:23 And in reality, you could also try on 29 bit Q as well because uh there
35:29 is a statistic reason that most of this value
35:32 will not exceeding for three bits more.
35:33 So we testing among all various of the machine model
35:37 it can showing that 29 base also works and this can to clarify.
35:43 So, so I mean this is not just using additions and multiplications like it's
35:47 really essential that you have these bit shifts on the on the TPU, right?
35:53 Oh, so on the left hand side this binary lower 32 bit
35:56 we have this bit shift on TPU and then it rely on vector
35:59 vector processing unit to do the uh the shifting and then do
36:02 the addition and then for the C4 and C5 we only utilizing this addition
36:10 the shift now by multiplying.
36:14 Yeah, I see.
36:14 I see.
36:15 Okay.
36:16 Yep.
36:17 Sounds good.
36:18 Cool.
36:19 Yeah.
36:19 So now the high precision modularized mlication
36:24 could be lower into only low precision mission
36:27 mlication as well as vector operation which
36:30 including the shift and as well as addition.
36:33 The key insight behind this transformation is we
36:36 can move the temporary results and runtime model reduction
36:40 to compile time because we just know this value
36:43 of prime as well as a parameter ahead of time.
36:47 So we don't need to bothering this competition to be online with this.
36:53 Let's see how could we using back to lower a realistic workload.
36:57 For example, high precision matrix mlication as shown here.
37:02 Now we have a Gren high precision matrix
37:05 multiplication showing on the top which left hand
37:08 side matrix has h by V size and this left hand side need to be multiply
37:12 with the right hand side matrix but with the shape of V by W and then
37:16 this will give us the results of H
37:17 by W matrix mlication and then this final result
37:19 will be model reduced by Q and within
37:22 this matrix multiplication every single element highlighting
37:26 in the light blue here is log cubit which is ranging from 28 to 59 And with bat
37:34 we can just do a imp placement
37:36 in place transformation to converting each high precision scaler
37:41 into a k by k low precision matrix and a k by one low precision vector.
37:48 And this will directly lowering the original high
37:51 precision matrix mlication into low precision matrix mlication.
37:55 And this will enable us to using this high throughput of matrix
37:59 mlication engine within the TPU
38:01 to accelerate this high precision matrix mlication.
38:06 And further now we completing all the compute challenges.
38:10 We no longer have comput challenges anymore but we still have memory challenges.
38:14 How do we got to deal with this data transpose
38:16 and data shuffling of tens of thousand degree long vector?
38:20 That will bring us to the second
38:22 contribution memory align transformation or short for MAT.
38:27 First let's tackle transpose.
38:29 Let's starting with this transpose of matrix multiplication where
38:32 we have a parameter P need to be multiplied
38:34 with a vector input to generate this temporal results
38:37 and this temporal results will be transposed into the final results.
38:41 Of course, here is explicit transpose overhead and our memory align
38:46 transformation could embed this data
38:49 transpose into the computational procedure itself.
38:53 So here is how it works.
38:55 Leveraging the linear algebra property,
38:58 we could transpose this parameter P and input and then we can offline transpose
39:04 the parameter P such that this computation
39:07 would directly generate the results as you expected.
39:11 This could eliminate the explicit memory transpose into compute only.
39:17 And further for the data shuffling, we could apply the same trick as well.
39:21 Let's starting with the example of shuffled multiplication
39:25 where we have a same gran parameter P need to be multiplied with the input
39:31 and this matrix vector multiplication will generating a vector
39:34 results and this vector results will be trans
39:38 shuffled into the final results and of course
39:40 introducing shuffling overhead and our memory align
39:44 transformation will also embed this shuffling into compute.
39:49 We could stay our eyes on the first row of white elements in the parameter P.
39:54 We know this row of white will be multiplied with the input generating the red
39:59 element and that red element is finally
40:02 shuffled to the third location of the result.
40:06 If we know this red element red elements is ting at the third
40:10 location in the final result then we can just shuffle the row
40:13 of the pure white into the third row offline such that this mlication
40:19 directly generates the results without paying any
40:22 shuffling and with this memory align transformation.
40:26 Let's also take an example to see how do we make the numbers sp transform
40:31 a layout invariant operator and all the audience
40:35 here are known about entity we will
40:36 skip this part it could transforming the polinomial
40:39 mlication from quadratic cost into linear cost
40:43 and for parallel computing device the most
40:46 popular algorithm of being used is called fourstep
40:50 entity within four step entity this input
40:53 vector will first being transformed into a two
40:55 dimensional matrix here I illustrating using 16
40:59 degree as an example which will be lowering
41:02 into a two dimensional matrix of 4x4 which using R and C2 indicating the row
41:06 and columns and this input matrix will
41:09 go through four steps as it name suggested
41:12 step one we take this matrix and perform entity of size R into every single
41:18 column second step we do a data transpose from the row major into column major
41:24 third step we do otherwise multiplication
41:27 and this will generating the results in the row
41:29 major in the column major sorry and then
41:32 for this column major matrix we're again
41:34 doing the C input entity to every single column now finishing this four step
41:39 we just do a big reverse of the results which give us the final output
41:44 and within there this entire ford you can see a explicit data transpose as well
41:49 as a explicit bit reverse so the way
41:53 that Matt optimizes is showing on the following.
41:56 We take the step one intact.
41:58 We didn't change step one.
42:00 And then we remove transpose by transpose
42:03 all the parameters in the following procedure.
42:06 For example, we got rid of the transpose such that the parameter of the atomized
42:11 multiplication will be transposed from the C by R into R by C
42:15 and then the second the third step
42:18 rowise multiplication will be swapped it and then
42:23 transpose on the Twitter factor and we
42:26 know the FT factor is the symmetric matrix.
42:28 So the transpose means do not do anything and then we just
42:33 have a layout environment without doing transpose from these first three steps.
42:39 And further to get rid of the bit reverse
42:41 we could fabricate or factor the bait reverse into row
42:48 wise permutation as well as columnwise permutation into the parameter
42:52 of both step one and step three shown here.
42:55 After we do this row wise permutation
42:58 to the matrix in step one and columnwise permutation
43:01 in the matrix of step three we can
43:03 directly generate the results in the targeting layout form.
43:07 And by doing these two transformations we get rid of all
43:11 the explicit memory transpose and make the entire entity layout invariant.
43:17 And in the end we need to compute
43:19 for entity just mission mlication multiplication as well as multiplication
43:26 putting all together let's see what performance of cross
43:30 that we could give you first we built a Python TPU library for CKS and open
43:36 source here deployed in Google and we using kernel
43:40 level workloads as benchmark showing on the horizontal access
43:43 and we compare that against the best FPGA Okay.
43:47 As well as the best GPU where every single
43:50 vertical means the normalized throughput where the best GPU's
43:54 performance is highlighting in this 1.0 line and every
43:58 single bar is indicating the throughput value of cross.
44:02 As you can see on the left hand side chart,
44:04 cross constantly achieving 9% to 47% higher throughput compared to the best GPU.
44:10 And on the right hand side compared
44:12 to the best FPGA we are achieving a two order
44:15 of magnitudes higher throughput that means cross turns
44:20 TPU into the state of art throughput H machine.
44:24 So we could do a quick summarization of the cross.
44:29 Cross turns TPU into civar H engine for H kernels
44:34 as well as H operators and they give you higher
44:38 throughput and higher performance per watt than other commodity hardware
44:42 and more importantly cross redefineses the capabilities of the AI6.
44:49 Previously people think about AIC could be only used
44:52 for AI training as well as inference but we proved
44:55 that this AIC has super strong mlication engine and vectoratics
45:01 and this capability could enable them to accelerate a more
45:04 wider range of application such as modularismatics and this project
45:10 is fully open source and deploying Google and it
45:13 also wins the second place Euro demo GT next
45:16 award startup creat X general lunch and we also going
45:19 to give a tutorial as plus with that we are
45:23 perfectly time to more to move on to the second
45:25 paper which is how do we enable AIC to accelerate
45:29 a more broader of the cryptography morph morph is
45:34 also a joint work with Georgia Tech Google and MIT
45:37 and a special thanks to the professors 3DS which
45:39 is supporting our work so this is targeting at application
45:44 of cryptocurrency as well as verifiable able in crypted computing.
45:48 First of all, why we need to care about
45:50 these two applications it because we have a super strong
45:54 or largest market in that the crypto market in the last
45:58 year is G is overall giving you $3.5 trillion.
46:03 It is a largest market in the world and even you
46:07 consider a small portion of layer 2 itself got 16 billion.
46:12 So more importantly this market is growing because of two trend.
46:17 First uh an administration policy such as Trump's admin
46:21 has officially established stable coin as official US dollars.
46:26 So second agent systems the emerging agent for doing crypto
46:32 transaction and AI could be helpful giving more throughput demand
46:36 from this entire crypto market and both these two growths
46:40 making the demand on the compute showing on the right
46:43 curve on the right hand side figure and but to supporting
46:47 this demand on the right curve what we
46:50 have today from Harvard is just showing on the dash
46:52 line and you can see a clearly 10x triple gap
46:57 That's why we build morph because we simply need more
47:00 computing power and morph focusing on utilizing this new category
47:05 of the powerful hardware beast AISC to accelerate zero knowledge
47:09 proof and what do we mean by zero knowledge proof
47:13 here and for the expert here we could just
47:15 make a very succinct introduction in zero knowledge proof short
47:20 for ZP a prover need to convince a verifier
47:22 that a statement is true but without revealing any other underlying information.
47:28 Considering a scenario where a prover want to transfer
47:32 all this verifier a trillion dollars and of course
47:35 the prover doesn't want to leak how much account
47:37 balance does they have after they make this transaction.
47:40 So what it could do is it can
47:41 generate a zero proof which cryptographically protect the statement
47:46 and the verifier could only check whether it's
47:48 true or false but without knowing anything beyond that.
47:51 This zero knowledge proof is powerful but and checking this is really fast.
47:55 It can happen in millisecond even with CPU
47:58 but the proof generation could be very slow.
48:00 It take over as many minutes as even days.
48:03 So this proof generation limits the overall
48:06 system circuit that become our acceleration target.
48:10 And the reason why it is slow is coming from a high precision requirement.
48:16 Let's compare that against H.
48:19 We know in HD we have 28 to 59 bit but that will become
48:23 256 to 700 bits in zero knowledge proof which introducing a 10x higher demand
48:30 and further this higher precision is working on the prime field that means
48:36 the modus here throwing on the queue is a prime value which cannot be factored.
48:42 So it invalidate the usage of residual number systems.
48:47 With all this residual number system RS the high precision multiplication
48:52 could be have to be lower into 16 bit chunk in order
48:57 to uh tune with the hardware machine word and this high
49:01 precision multiplication will become a all
49:03 multiplication of these low precision chunks.
49:06 If you have L different chunks then in total you will pay
49:10 L square number of mlication and give you a quadratic computition overhead.
49:15 In other words a 10x increase in the precision
49:18 will introduce a 100x more computation demand.
49:22 That's why make it very slower and therefore we make
49:26 a solution to adding a extended non-prime field showing in red.
49:32 Let's first encapsulate the original prime field
49:35 showing on the f subq highlighted blue here.
49:39 And what we propose is a nonprime field f subm
49:43 where the modus m here is nonprime value which could
49:46 be factored and this enable enable us to rep the entire
49:51 computation happens in the original f subq into f subm.
49:56 For example, shown here the value of a* b will generating a result
50:00 of a and all of these will be within the range of m.
50:05 Specifically, we choose value of M to be larger than Q^ square such
50:10 that any multiplication will not exceeding value
50:13 of M and that enable us to perform
50:16 this computation in the residual number system of F subm where we will have two
50:22 L number of the chunks and this will
50:24 lowering the quadratic competition cost into linear.
50:29 So further we can converting the results of AB back to F
50:34 subq and then perform the model reduction there and then convert it
50:37 back to F subm for the following computation and this entire comp
50:42 transformation could be scheduled as a low precision mislication and that could
50:48 be le be accelerated by the machine modification available in the AI6
50:53 and put this together we also con constructed an open source
50:57 library which is for TPU to enable CKP primitive such as multiscalar
51:02 mlication as well as a DT and compare with cross which
51:06 we achieve up to 100x speed up and the high precision
51:11 crypto primitive such as MSM and NDT and both makes TPU also
51:17 the state-of-the-art super machine for the CKP as well and putting
51:22 these together in summary I'm very thrilled today to introducing two line
51:27 of the compilation work cross and morph and both could turn TPU
51:31 as state-of-the-art throughput machine for both
51:34 homorphic encryption as well as zero knowledge proofs and this positions AI
51:39 ethic as a state-of-the-art throughput machine
51:41 for both low precision crypto primitives
51:44 as well as high precision crypto primitives.
51:47 So with that I'm happy to answer any questions and as a next
51:51 step I also want to prefetch we want to enable
51:54 the cross kernel API into a anttoan application as well
51:59 as build entire morph excited to support a fullto protocol and we
52:04 are uh open for collabision on how do we design a harle
52:07 protocol or how do we align uh the morph and cross
52:10 to support realistic application from industry such as the prestigious Microsoft
52:15 research yeah with that I'm happy to also had a question.
52:20 Thank you young men for the nice presentation.
52:24 Thank you.