CROSS — Leveraging AI ASICs for Homomorphic Encryption

Microsoft Research

0:03 Welcome everybody.

0:04 It is my pleasure to introduce today Yang Ming Tong.

0:08 He is a PhD student at Georgia Tech,

0:11 a computer architect that focuses on uh the intersection of AI and cryptography.

0:16 And today he's going to present the talk

0:18 titled cross leveraging AI A6 for homorphic encryption.

0:23 Yangming the floor is yours.

0:26 Thank you so much Marsher.

0:27 And then today I'm super thrilled to in front

0:30 of the cryptographic prestigious experts in the field and I got

0:34 to introducing two line of my work cross and a gift

0:38 morph which is recently accepted at the automation conference this year.

0:42 So before we get started let me briefly introduce who I

0:45 am and what I work on and specifically my research overview.

0:48 I'm a computer architect and I'm focusing

0:50 on enabling AI systems for cryptography and I'm

0:54 currently a fifth year PhD at computer science

0:55 of Georgia Tech and largely advised by Dr.

0:58 Tucha Krishna and I'm mainly known for developing two projects and one is cross

1:03 and what I got to talk about today the other one is a reconfigurable AI systems

1:08 that make AI to reconfigure itself to be

1:11 a dedicated ASIC for different workloads and I

1:13 got to give tutorial for both of them in the upcoming plus 2026 at peace.

1:19 So if anybody come by feel free to say hi to me

1:21 and we will invite you guys to there so that we will have

1:24 a great audience in elevating the overall level and this work give me

1:29 the machine learning and mel common ring star and a couple other source.

1:33 So overall my result could be captured by this single

1:36 slice and design systems for AI and cryptography.

1:40 Today AI systems take data and model in the raw format.

1:45 So we have this particular work number one workload

1:49 and this problem cannot support private data like genome

1:53 or close source private model like charge because

1:56 this na raw format of the data will leakage everything

2:00 and to protect the privacy private data and model

2:03 could be encrypted and then go through encrypted computation

2:07 and this is enabled by homorph encryption and to support

2:11 both raw computation as well as encrypted computation I help build

2:16 a systemuler called sushi and a compilation framework called

2:19 cross and both colling these two line of workloads

2:24 into unified size of linear algebra and the linear

2:28 algebra will including mid multiplication

2:31 vector arithmetic and memory reorganization

2:34 and this unify of linear algebra could directly run

2:37 on today's AI system such as Google's TPU Amazon Freom

2:42 and Avidia's GPU as well as Microsoft AMIA And to study

2:47 the performance of this AI system running on this workload,

2:50 we built a simulator called skill sim.

2:52 And within the skill sim,

2:54 we could model the architecture of different today's AI system power.

2:58 And then it reveals a key insight.

3:01 That is when the workload has various ship.

3:04 For example, maybe mobileation call with different ships

3:07 and they might underutilize the available capation in the given

3:12 AI hovers and therefore to make it more

3:15 performant for wide range of workloads of different ships.

3:19 I designed two lines of reconfigurable AI systems

3:23 and these reconfigurable AI systems could reconfigure their data

3:27 paths as well as memory layout for each individual

3:31 workload and this could enable them to achieve full

3:34 compute utilization and the workload of all shapes

3:38 and further to enable this automation compilation I built

3:41 a layout loop framework and layout loop could search

3:45 the optimal data flow and layout for individual workload.

3:48 Among this work, Sushi has won

3:52 the Qualcomm innovation fellowship and cross has deployed

3:55 in Google and won the second place

3:57 in university demo at design automation conference last year.

4:01 It also wants a GT next award and startup launch.

4:05 Further layout loop has been integrated into Nvidia MVL

4:08 labs and the fact is a collaboration work with IBM.

4:13 So let's first with this as over mind

4:16 and go through this overview today and today I got to cover how do we

4:22 using existing AI systems to accelerate cryptographic applications.

4:27 So we start with AI model because AI model

4:30 is a most common work in today's data center.

4:33 So we using their competition demand as unit and using their latency as unit.

4:39 So we will have three lines of cryptographic application.

4:42 First privacy presuming AI which use homomorphic encryption to encrypt both

4:48 data as well as the model to provide holistic privacy guarantee.

4:53 Second cryptocurrency.

4:55 It relies on zero knowledge proof

4:57 primitives to offer both encryption and verification.

5:01 And third applying zero knowledge proof encrypted AI

5:04 computing will give us a verifiable encrypted competition.

5:09 All these applications will increase

5:11 the computational demand and latency exponentially.

5:15 And in this talk I will through very demonstrate you two line of work

5:21 cross and morph which could lower

5:23 this exponentially increased latency to roughly super linear.

5:28 Let's start with cross first and cross

5:31 is a joint work with Georgia Tech, Google,

5:34 MIT and Canal and we want to give

5:37 a special credits and appreciation to the professor

5:40 Arvin who have been able to see this work published before uh he passed away.

5:45 So but we are super appreciate and uh appreciate for his support and insights.

5:52 So this cross is a number one work which enables AI as for homorphic encryption.

5:59 They give you immediate serving of the homop

6:02 encryption in real time under Google's TPU.

6:05 So what do we do?

6:06 Let's starting with the background.

6:09 The reason why we starting with this new generation of H

6:12 accelerated on today's A system come from AI's new industry revolution.

6:18 Specifically, artificial intelligence is driving a new industry revolution.

6:22 It could transform chatbot, robotics, autonomous vehicle, AI coder,

6:27 biology, protein discovery, and AI factory all into digital tokens.

6:33 In other world, the entire world is

6:36 being digitalized and tokenized for AI assistance.

6:40 However, such revolution has been blocked by the privacy concern.

6:45 For example, Samsung has bunded chatbt among

6:48 employees after they see the code leakage.

6:51 Grock accidentally released the users conversation through a public

6:56 search tool and this is three years back we

7:00 delayed the AI features just because we want

7:02 to keep the user data on device for review later.

7:06 So these ones protect the privacy into a power bound requirement before

7:13 we release the AI feature and that gave us the following question.

7:17 How to protect the privacy of both model and user data.

7:21 To answer this question we need

7:23 to understand where this privacy concern come from.

7:26 First from model side today's most advanced model

7:30 such as charge uh orace cloud and also perplexity

7:35 these model are proprietary and their training data as well

7:39 as their data analytic tools are all clone source.

7:42 These closed source services cannot share all of their own boundary.

7:47 This creating the right hand side model privacy boundary shown in red.

7:52 On the data side, we see the data from healthcare,

7:54 finance or government are also private data and these data never able to leave

7:59 their agency as well and this creating

8:01 the left hand side orange data privacy boundary.

8:04 Such two sides boundary creates the significant privacy gap to use the right

8:08 hand side private model to serve the left hand side private data.

8:12 And as we all know here home of encryption could

8:15 protect the privacy by enable directly competition on the encrypted data.

8:19 So we could encrypt your model,

8:21 set it to the untrusted infrastructure and encrypt your private input locally

8:25 within the data privacy boundary and only send out the encrypted cipher text.

8:29 So that within this untrusted infrastructure we could using this encrypted model

8:33 to serve this encrypted input to generating encrypted results and these results

8:38 will come back to the user and only the user who

8:40 has a secret key could decrypt it and get a meaningful value.

8:44 This is great but because it secure everything but it is very slow

8:48 and we all know the sec the slowness come from both compute and memory overhead.

8:54 If we take the rest 18x as example

8:57 encryption will boost it by 8,000 time and compute

8:59 by 800 times and this combined will lead

9:02 to a 10,000x slower latency and that make it impractical.

9:09 As past few years entire community has focusing

9:13 on designing new hardware and compilation tool chains to reduce

9:17 this growing complexity and with few examples show

9:20 in the middle but building this new chips and new compilation

9:24 flow will take years of time and take

9:26 a lot of money and that's why we didn't see

9:29 available product for people to use today and therefore

9:33 at this project we want to enable immediate H acceleration

9:39 So we're using existing AI A6 such as Google's

9:43 tensor processing unit to accelerate home encryption instead of building

9:47 new hardware and we build a transparent means we

9:52 converting the encrypted data into kernels that run very

9:56 efficiently on today's AIC which we will cover

9:59 in this talk and it could lowering the years of time

10:03 into immediate and then saves you million of dollars

10:06 because we don't need to fabricate any new hardware.

10:10 Now it comes to a fundamental question.

10:12 Why we use AI6 such as Google's TPU instead of other commodity hours like GPU?

10:19 To answer that question,

10:20 let's deep dive into this TPU to see why it has in the high level.

10:26 It give you better performance per watt.

10:31 We categorize various chip shown on the right figure where the horizontal

10:36 axis indicating the power where the vertical axis indicated the throughput.

10:41 So in this right hand side throughput power plot the diagonal

10:46 line means throughput divided by power or so called power

10:50 per watt or energy efficiency and upper left means better

10:56 and further we categorize each chip based on their technology nodes.

11:01 For example, the light blue indicating the 7

11:04 nanometer and the light yellow means 456

11:08 nanometer and green means the advanced technology nodes

11:11 such as 3 nometer or even below it.

11:14 As you can see, AI A6 always give better performance

11:18 per watt than other commodity hours like GPUs and FPGAs

11:23 and this enables more chip to be fitting within the data

11:27 center and they give you higher throughput as with this strong motivation.

11:34 Let's now deep dive into what a AI6 such as Google TPU look like.

11:41 In the high level tensor processing unit from Google TPU give you three

11:46 features significant memory significant model compute

11:50 as well as wide range of the reordering.

11:54 Let's start with the memory.

11:57 TPU has massive memory.

11:59 First different generation of TPU has hundreds of gigabytes

12:03 of HPM and hundred of megabytes last level cache.

12:09 Of course go ahead.

12:10 So um are like you can't buy these TPUs right

12:14 these are totally uh closed by Google isn't that right

12:20 now uh before two years ago that is the truth and now people

12:24 has officially enabled this leader AI company to buy it for example anthropic

12:28 has signed a deal of 1 billion track contract with Google to using

12:32 their TPU and then set up the TPU cloud in their own infrastructure

12:37 oh wow uh okay so that's That's how you know like what the architecture is.

12:42 Oh, I see.

12:43 So the this architecture information is public.

12:47 Okay.

12:47 Huh.

12:48 Interesting.

12:49 Do do you know why they make it public?

12:51 Like why would it be?

12:54 Uh I think there are a couple of the papers.

12:56 So this is a summarize of the knowledge that got captured from all the public

13:00 uh paper from Google and you know I was interning at Google as well.

13:03 So I do know some internal information and then it turns out

13:06 to showing that this public information has

13:08 the same information as internal as well.

13:10 So it

13:11 that's why we're creating this slide to uh

13:13 serve as encapsulate of all generation of TPU.

13:16 Okay, cool.

13:17 Thanks.

13:18 Of course, there's a new features from TPU

13:19 as well which I didn't cover but you know these are uh maybe out of the scope

13:23 for today's talk because we didn't utilize these new features.

13:27 Cool.

13:27 Yeah.

13:30 Cool.

13:30 So this hundreds of megabytes of last level cache

13:33 will be shared by one or two tensor cores.

13:37 And therefore we call this last level cache as common memory or CMA.

13:42 And within each tensor core we have tenth of megabytes of local scratch pad.

13:48 And this t of megabytes of local scratchpad

13:51 is organized into a vector of 128 lengths.

13:55 And so we call them a vectorized memory or we man.

14:00 And within each lane there are eight saplings and each

14:04 sapling has a local 32 instance of 32bit register.

14:10 And this give us a,024 local register files for each tensor core

14:16 and this give us in total hund of the kilobytes k local registers.

14:22 So putting all the memory in summary,

14:25 TPU of various different generations give you

14:28 around 800 GB of offchip HBM as well

14:32 as a hundreds of G megabytes of onchip memories

14:36 and further TPU also offers massive compute and each

14:42 subling has two local ALUS and in total each tensor core will have 284 AUS

14:50 and these ALUS are designed to process vectorized

14:54 operations or whacked up as shown on the right.

14:58 And so we call this a bunch of alou together as WPU.

15:03 And all these 2024 48 alus in the WPU work in the lock step.

15:09 That means whenever you start launching some workload into the chip,

15:13 you have to utilize at least one ALOU from all lens and all subs

15:19 together or in other words every single instruction have to trigger a cmd of 128

15:27 by 8 alus and this creating the granularity that we have to leverage

15:33 to uh from the TPU to run on the realistic workload and this WPU itself

15:38 offer plus one to 10 32bit integer terovs and further each tensor core also

15:47 contains four dedicated compute engines for matrix

15:50 mlication and we call them as MXQu

15:54 each MXU is a 128x 128 sysol array which is designed to process large

16:01 missions mlication and this give us a 100

16:05 to a,000 8bit terops for matrix multiplication.

16:11 So if you looking at the top props

16:13 difference from the matrix multiplication engine and vectormatic engine,

16:17 you could observe one thing the throughput for low

16:20 precision matrix mlication engine is about 100x higher than

16:25 the high precision vector arithmetics and therefore we do want

16:29 to use MXU as much as possible for higher performance.

16:35 And lastly, TPU also have very abundant bunwise for doing flexible reordering.

16:43 Specifically, it has a dedicated unit called cross lane unit

16:47 showing on the right and which is also noted as SLU.

16:51 And this unit could take the data from a vector

16:54 of the VM and then reorder them and send it back.

16:59 In other words, this could do data reorganization

17:03 at the granularity of VR rag level and a reg

17:08 here means all single register file from all

17:12 the lanes together with all sublanes have to be issued

17:15 in the same manner as well and that's why

17:17 we group one register file from all sublane together

17:21 into a VR that means every single data reorganization have

17:26 to perform at the granularity of 128 by8 32-bit registers.

17:33 So putting all together TPU has massive memory seal

17:38 of compute as well as abundant B wise and these features

17:43 enable super fast primitives such as 32-bit vectorzed addition 32-bit

17:48 vectorzed multiplication 8bit mid mlication as well as 32-bit transpose.

17:55 But the question is how do we use them for H?

17:58 In order to understand this question we

18:00 have to go through understanding the background

18:02 of that and then given the audience here are super expert on H.

18:05 So we will make it very short.

18:08 First H will require a vector of floating point

18:12 value if we using for CKS scheme and this 32,000

18:17 long vector data will be first encoded into a massive

18:21 polinomial with tens of thousand degree which we

18:24 call as plantex and the degree of this is

18:27 twice as the input one length of the vector

18:31 and further this plantex is then encrypted into a pair

18:35 polomial of the same degree which call as suffer tax.

18:39 And now let's look inside a sephertex.

18:42 Every coefficients in this suffer tax is a integer in the finite field.

18:47 That means there is a limited range of the valid integer choice for you to pick.

18:52 For example, if we have a modulus Q equal to 14,

18:56 then we could only take integers from 0 to three uh 0 to 13.

19:01 If a calculation results in a value that overflow

19:04 this limit for example 12+ 4 which exceeding the modus value

19:08 of 14 we need to call on a model reduction

19:11 to break it back within the range and further this restriction does

19:16 not stop at a coefficient level and each polinomial themselves

19:21 is living in a ring and just like coefficients has

19:24 a limit the polinomial has also a limit number of the degrees

19:29 and that degree value is usually a power of two.

19:32 So like here we show you a n equal to 16 as an example.

19:38 If polinomial is a ring that means when the polinomial

19:42 exceeding their hyper limit of the degree value such as 16

19:45 here we need to call on polinomial reduction to reduce

19:49 it by the polinomial modus to bring it within the range.

19:53 In short, H will bring both integer and polomial

19:57 modular reduction to both vectorized operations as well as matrix

20:02 multiplications and further even with RNS H also lifts

20:07 the coefficients to 28 to 59 bit finite few integer.

20:12 If we put the workload and compute together

20:15 then there is clearly compute and memory gaps.

20:19 First mod reduction.

20:21 H require model reduction but the AI A6 do not have it.

20:26 Second insufficient precision.

20:29 Even with RNS H will still require 28 to 59 bit integer

20:34 but AI6 only support lower precision such

20:38 as 32-bit vectorzed arithmetic and 8bit mlication.

20:44 And on the memory part, H will require a data transpose or a data shuffling.

20:50 Imagine that we got to do transpose or shuffling

20:52 of a tens of thousand degree long vector.

20:56 Putting the memory recognization at such a big scale is

20:59 very expensive and therefore we want to get rid of them.

21:03 So in cross we propose two compilation techniques.

21:07 business align transformation to bridge the compute gap

21:11 and memory align transformation to bridge the memory gap.

21:15 The insight here is homop encryption has pre-nown parameter such as the twitter

21:20 factor entity and fixed computation graph and this allow us to partially

21:27 offload the h workloads into the compile time and then we can

21:32 transform the remaining computation into the kernel that a are very good at.

21:40 So let's first talk about how basis line transformation bridge

21:43 the compute gap and we will short the basis align transformation

21:47 as bat given this 32bit vectorzed model multiplication a* b

21:54 model reduced by q into z where we show each individual

21:58 value here as 32 bit for example a b q

22:01 are all 32 bit as an example and in realistic cross

22:05 could tackle arbitary precisions and the basis alignment could lower

22:11 this 32-bit vectorzed modular reduction based

22:15 mlication into two sides of kernels.

22:18 First, low precision mission mlication and second 32bit vector addition.

22:26 There is no model reduction needed at all.

22:29 And within these two kernels, the first mission multiplication performs a* b

22:34 and the second vector addition performs model reduction.

22:39 Let's first dive into how a* b is

22:41 converting into the first a bit machine multiplication.

22:46 We can take a pair from this 32-bit vector model multiplication.

22:51 For example, we take the first value a z have b 0 into c 0 and each of them

22:57 uh a a z and b 0 are 32 bit and the result c will be 64 bit.

23:03 So our goal is to converting this high

23:05 precision scal mlication into low precision metric multiplication.

23:10 Each high precision value want to be computed in the low precision matter.

23:16 So we could just nally do a chunk decomposition.

23:19 For example, a 32bit input a z could be break into four a bit

23:24 chunks as shown here where we're using

23:27 the superscript to indicating the chunk indices.

23:31 For example, a z upper three means the most significant bite.

23:36 Well, the a z upper zero means the lowest or least significant bite.

23:41 And we could apply the same decomposition for b 0 and c 0 as well.

23:46 If we do this chunk decomposition,

23:48 the original 32bit high precision scalar multiplication will be

23:52 converting into this all multiplication of the low precision chunk.

23:57 And then if we want to compute it in temporal then

24:00 it takes 16 mlications to generate all these 16 partial sums

24:05 and followed by that we need to accumulate this part back

24:08 together and then this will go give us the results of C.

24:12 If you really stare on this diagram there's interesting thing for example

24:16 this A Z B 0 is contributing to the lowest two bite of the C.

24:22 For example, a 0 time B 0 is only contributing to the C 0 as well as C 01.

24:28 And further, if you stare on the second a little bit left,

24:32 a 01* B 0 and A 0* B 01,

24:37 these two are only contributing to the second least significant bite,

24:42 which is C 01 and C 02.

24:45 In other words, from the rightmost to the left most,

24:49 the basis is keeping increasing.

24:53 So with this insight, the state-of-the-art GPU library could schedule

24:58 this chunk multiplication into a sparse matrix

25:02 multiplication where each row of this results indicating a part with one basis.

25:10 For example, this A00 Z will contribute into the least significant basis chunk.

25:15 So they will be scheduled at the first

25:16 row and then the second rightmost rows are

25:21 contributing the second part which is have a higher

25:25 basis a bit higher than the first one.

25:27 So based on this insight people could re

25:30 repeatedly converting this left hand side clockwise mlication

25:34 into a sparse mission mlication and then all these parts

25:39 will be shifted and accumulate together back to C0.

25:44 But if you notice this structure the left side matrix is most likely zero.

25:50 It is actually a tropolis matrix of A and half of the values are zero.

25:56 That means we are performing redundant compute on half of the data.

26:02 So in to redundant to reduce this redundancy

26:05 we proposed in basis align transformation for a bat

26:09 that could eliminate this original sparse mission mlication

26:13 into a smaller and dense mission mlication showing on the right.

26:18 Now let's see the beauty of the math behind it.

26:23 The idea behind come from a simple observation block two in the matrix

26:28 mlication only contributing to the temporal part which I will show you here.

26:34 First these generated parts will be left

26:37 shifted and accumulated into a 64-bit results

26:41 and this 64-bit results will be later on modused

26:45 by Q and become a 32-bit final results.

26:50 This means the upper 32-bit part is only

26:53 temporary value or in other words the block two

26:58 in the life matrix contributes only to this temporary part

27:02 which will be eventually fed back to the block one.

27:06 So if they only contribute the temporary part

27:09 could we directly offload them into compile time?

27:13 That's what we did specifically.

27:16 Let's see how do we do by picking one pair of the data in the block two.

27:22 The overall result Z is equal to a summation of left

27:27 shift of all the parts and then model reduced by Q.

27:31 And we keep our notice attention on the P sum 6 here.

27:35 Specifically P sum 6 is equal to A3 times B3.

27:40 And we could further group A3 left shifted by 48 together.

27:44 And then we can also insert a model Q there.

27:48 Before we do this, A3 is 8 bit and 8 bit left 48 bit will become 56 bits.

27:56 And that is definitely exceeding the 32-bit target.

27:59 And after we bring this model reduction Q into it,

28:02 it can lowering this 56 bit higher precision scaler

28:06 down to 32-bit which we know denote as R.

28:10 And then we can chunk decompose this 32-bit R into different chunks.

28:16 This allow us to get rid of this A3 from block two and then

28:20 convert it back into block one by splitting different chunk of R back.

28:26 And this is effectively align the basis of the value in the block

28:31 two below 32bit and therefore we call it as basis align transformation.

28:39 We could further repeatedly apply bath to all elements

28:42 in block two to fill them back to block one offline.

28:46 And this will generate a smaller dense

28:49 mutant multiplication and giving 2x theoretical speed up.

28:54 And now the original ones pair of 32-bit scalar multiplication

29:00 could be lower into this 8 bit k by k

29:04 by time is k by one matrix vector multiplication

29:08 and this will generating us a result of k by one.

29:15 So now we unveil the analying mass beauty of how do

29:19 we converting a* b into the first a bit mission multiplication.

29:24 Let's now dab into the second pieces.

29:26 How do we further lowering or reducing

29:28 the mode reduction into pure vectorz addition.

29:34 Now we know we want to do a* b reduced by 2 into z.

29:38 And then the first step already performs a* b into c.

29:42 and we apply back to one operand offline.

29:46 So this converting this high precision scala

29:50 multiplication into a low precision matrix vector multiplication.

29:55 Our second step is to accumulating these results back to C and then

29:59 perform a model reduction to convert or lower it back to Z.

30:04 So let's starting with our eyes on the left hand side matrix.

30:08 When this is left hand side matrix,

30:10 every single element in both left hand side k by k matrix

30:14 a and k by vector b are a bit and the multiplication

30:19 of two a bit value will give us 16 bit value

30:23 and further this mission multiplication will have a reduction of four elements.

30:27 So an actual reduction of 416 value will give us a 18 bit

30:33 partial sums and these parts need to be accumulated by shifting and add

30:40 and to analyze this entire data as using 18 bits as underlying granularity

30:46 isn't that obvious isn't that direct

30:49 and therefore let's changing our data representation.

30:52 we could changing them into using eight bit

30:55 of one bite as under analying data representation.

30:58 So these two are exactly the same and we know all these temporary partner sums

31:03 need to be accumulated together and then model

31:06 reduced by Q finally into a 32-bit result.

31:10 As you can imagine here the upper 49 subac

31:14 by 32 which is 17 bits are also temporary.

31:18 That means we only need to reduce this upper

31:22 17 bit elements or further on to the input.

31:26 That means we only need to reduce

31:28 the upper two highest significant bite C4 and C5.

31:35 If we know we want to only reduce the highest two bite which

31:39 is C4 and C5 that means our lowest four bytes could be stay

31:45 intact and this give us the goal of modulus reduce every single piece

31:50 of data back to 32-bit and given this lower four by already within 32-bit.

31:56 So we just take them out and accumulate them by shift and addition.

32:02 And here I changing the shift of the eight

32:05 into the corresponding multiplication with the corresponding bases.

32:08 For example, can I ask you a question?

32:10 I mean isn't go ahead.

32:12 Wouldn't it be I mean when you when you leave the lower part intact

32:16 and then you add these together like

32:18 it could still overflow the your prime, right?

32:22 So so you you don't necessarily get like full reduction, right?

32:26 Exactly.

32:26 It is a lazy reduction and then

32:28 this lazy reduction will stay intact with this 32-bit.

32:30 So it will keeping uh always within the range

32:33 of 32-bit without exceeding that and then we could enable

32:37 this lead reduction throughout the entire program and only utilizing

32:40 the full reduction in the end for the final result.

32:43 Yeah, I I guess like a lot

32:45 of the FG algorithms actually use lazy reduction a lot

32:50 because it benefits it benefits throughout.

32:54 Yeah.

32:54 Can can't you get one additional bit when you do the addition

32:57 and go up to 33 and it's a small thing but

33:01 yeah you're right so we are restricting on the prime

33:04 value of the que to 28 bit and this will avoid

33:07 yeah having that one extra bit to overflow 32 bit

33:10 and then in fact we also try 29 bit as well

33:13 and then based on our statistic trying there is a randomness

33:17 within uh encryption so the actual value will not actually

33:20 exceeding three more bits usually it will just need three bits

33:23 so you also supporting 20 n prime as well in reality.

33:28 Got it.

33:28 Thank you.

33:29 Yeah.

33:29 Cool.

33:30 So now we know that we want to reduce the high precision value

33:34 down to 32-bit and then now we just leaving the underlying 32-bit intact.

33:38 We didn't change it and then we multiply them with a cross basis

33:41 and back there and this will give us the original lower 32-bit value.

33:45 So our attention will be how to focusing reducing this highest to bite.

33:50 First let's take C4 as example.

33:53 uh we first converting this 8 bit C4 into the binary view and then

33:58 from the top to bottom means uh we have the increase of the basis order.

34:02 So one 0 0 1 0 1 1 0 this will

34:06 be multiply with a crossing basis and then accumulate back.

34:10 So this one zero binary view of that will become this right

34:13 line and then the right hand side multiplication with the corresponding

34:16 basis and the accumulation this are values known ahead of time

34:22 and therefore we could processing this higher than 32bit value

34:26 offline and once we compute them offline in the runtime

34:31 or in the online what we need to compute it just

34:34 a log two bit value addition they become just a vectorzed

34:39 addition And we can apply the same trick to C5 as well.

34:44 We first express this 8 bit C5 value into a binary view shown here.

34:49 And then we preprocessing this basis offline to converting

34:53 every single basis offset down to log to Q bit.

34:59 And then in the runtime we only need

35:01 to accumulate this log two cubit value together.

35:05 And then we can merging all these tempers sum together to become the final Z.

35:10 And it is a lazy reduction.

35:12 In our current testing, if you have a queue less or equal than 28 bits,

35:18 it will always 100% guarantee you will never exceeding 20 32bit.

35:23 And in reality, you could also try on 29 bit Q as well because uh there

35:29 is a statistic reason that most of this value

35:32 will not exceeding for three bits more.

35:33 So we testing among all various of the machine model

35:37 it can showing that 29 base also works and this can to clarify.

35:43 So, so I mean this is not just using additions and multiplications like it's

35:47 really essential that you have these bit shifts on the on the TPU, right?

35:53 Oh, so on the left hand side this binary lower 32 bit

35:56 we have this bit shift on TPU and then it rely on vector

35:59 vector processing unit to do the uh the shifting and then do

36:02 the addition and then for the C4 and C5 we only utilizing this addition

36:10 the shift now by multiplying.

36:14 Yeah, I see.

36:14 I see.

36:15 Okay.

36:16 Yep.

36:17 Sounds good.

36:18 Cool.

36:19 Yeah.

36:19 So now the high precision modularized mlication

36:24 could be lower into only low precision mission

36:27 mlication as well as vector operation which

36:30 including the shift and as well as addition.

36:33 The key insight behind this transformation is we

36:36 can move the temporary results and runtime model reduction

36:40 to compile time because we just know this value

36:43 of prime as well as a parameter ahead of time.

36:47 So we don't need to bothering this competition to be online with this.

36:53 Let's see how could we using back to lower a realistic workload.

36:57 For example, high precision matrix mlication as shown here.

37:02 Now we have a Gren high precision matrix

37:05 multiplication showing on the top which left hand

37:08 side matrix has h by V size and this left hand side need to be multiply

37:12 with the right hand side matrix but with the shape of V by W and then

37:16 this will give us the results of H

37:17 by W matrix mlication and then this final result

37:19 will be model reduced by Q and within

37:22 this matrix multiplication every single element highlighting

37:26 in the light blue here is log cubit which is ranging from 28 to 59 And with bat

37:34 we can just do a imp placement

37:36 in place transformation to converting each high precision scaler

37:41 into a k by k low precision matrix and a k by one low precision vector.

37:48 And this will directly lowering the original high

37:51 precision matrix mlication into low precision matrix mlication.

37:55 And this will enable us to using this high throughput of matrix

37:59 mlication engine within the TPU

38:01 to accelerate this high precision matrix mlication.

38:06 And further now we completing all the compute challenges.

38:10 We no longer have comput challenges anymore but we still have memory challenges.

38:14 How do we got to deal with this data transpose

38:16 and data shuffling of tens of thousand degree long vector?

38:20 That will bring us to the second

38:22 contribution memory align transformation or short for MAT.

38:27 First let's tackle transpose.

38:29 Let's starting with this transpose of matrix multiplication where

38:32 we have a parameter P need to be multiplied

38:34 with a vector input to generate this temporal results

38:37 and this temporal results will be transposed into the final results.

38:41 Of course, here is explicit transpose overhead and our memory align

38:46 transformation could embed this data

38:49 transpose into the computational procedure itself.

38:53 So here is how it works.

38:55 Leveraging the linear algebra property,

38:58 we could transpose this parameter P and input and then we can offline transpose

39:04 the parameter P such that this computation

39:07 would directly generate the results as you expected.

39:11 This could eliminate the explicit memory transpose into compute only.

39:17 And further for the data shuffling, we could apply the same trick as well.

39:21 Let's starting with the example of shuffled multiplication

39:25 where we have a same gran parameter P need to be multiplied with the input

39:31 and this matrix vector multiplication will generating a vector

39:34 results and this vector results will be trans

39:38 shuffled into the final results and of course

39:40 introducing shuffling overhead and our memory align

39:44 transformation will also embed this shuffling into compute.

39:49 We could stay our eyes on the first row of white elements in the parameter P.

39:54 We know this row of white will be multiplied with the input generating the red

39:59 element and that red element is finally

40:02 shuffled to the third location of the result.

40:06 If we know this red element red elements is ting at the third

40:10 location in the final result then we can just shuffle the row

40:13 of the pure white into the third row offline such that this mlication

40:19 directly generates the results without paying any

40:22 shuffling and with this memory align transformation.

40:26 Let's also take an example to see how do we make the numbers sp transform

40:31 a layout invariant operator and all the audience

40:35 here are known about entity we will

40:36 skip this part it could transforming the polinomial

40:39 mlication from quadratic cost into linear cost

40:43 and for parallel computing device the most

40:46 popular algorithm of being used is called fourstep

40:50 entity within four step entity this input

40:53 vector will first being transformed into a two

40:55 dimensional matrix here I illustrating using 16

40:59 degree as an example which will be lowering

41:02 into a two dimensional matrix of 4x4 which using R and C2 indicating the row

41:06 and columns and this input matrix will

41:09 go through four steps as it name suggested

41:12 step one we take this matrix and perform entity of size R into every single

41:18 column second step we do a data transpose from the row major into column major

41:24 third step we do otherwise multiplication

41:27 and this will generating the results in the row

41:29 major in the column major sorry and then

41:32 for this column major matrix we're again

41:34 doing the C input entity to every single column now finishing this four step

41:39 we just do a big reverse of the results which give us the final output

41:44 and within there this entire ford you can see a explicit data transpose as well

41:49 as a explicit bit reverse so the way

41:53 that Matt optimizes is showing on the following.

41:56 We take the step one intact.

41:58 We didn't change step one.

42:00 And then we remove transpose by transpose

42:03 all the parameters in the following procedure.

42:06 For example, we got rid of the transpose such that the parameter of the atomized

42:11 multiplication will be transposed from the C by R into R by C

42:15 and then the second the third step

42:18 rowise multiplication will be swapped it and then

42:23 transpose on the Twitter factor and we

42:26 know the FT factor is the symmetric matrix.

42:28 So the transpose means do not do anything and then we just

42:33 have a layout environment without doing transpose from these first three steps.

42:39 And further to get rid of the bit reverse

42:41 we could fabricate or factor the bait reverse into row

42:48 wise permutation as well as columnwise permutation into the parameter

42:52 of both step one and step three shown here.

42:55 After we do this row wise permutation

42:58 to the matrix in step one and columnwise permutation

43:01 in the matrix of step three we can

43:03 directly generate the results in the targeting layout form.

43:07 And by doing these two transformations we get rid of all

43:11 the explicit memory transpose and make the entire entity layout invariant.

43:17 And in the end we need to compute

43:19 for entity just mission mlication multiplication as well as multiplication

43:26 putting all together let's see what performance of cross

43:30 that we could give you first we built a Python TPU library for CKS and open

43:36 source here deployed in Google and we using kernel

43:40 level workloads as benchmark showing on the horizontal access

43:43 and we compare that against the best FPGA Okay.

43:47 As well as the best GPU where every single

43:50 vertical means the normalized throughput where the best GPU's

43:54 performance is highlighting in this 1.0 line and every

43:58 single bar is indicating the throughput value of cross.

44:02 As you can see on the left hand side chart,

44:04 cross constantly achieving 9% to 47% higher throughput compared to the best GPU.

44:10 And on the right hand side compared

44:12 to the best FPGA we are achieving a two order

44:15 of magnitudes higher throughput that means cross turns

44:20 TPU into the state of art throughput H machine.

44:24 So we could do a quick summarization of the cross.

44:29 Cross turns TPU into civar H engine for H kernels

44:34 as well as H operators and they give you higher

44:38 throughput and higher performance per watt than other commodity hardware

44:42 and more importantly cross redefineses the capabilities of the AI6.

44:49 Previously people think about AIC could be only used

44:52 for AI training as well as inference but we proved

44:55 that this AIC has super strong mlication engine and vectoratics

45:01 and this capability could enable them to accelerate a more

45:04 wider range of application such as modularismatics and this project

45:10 is fully open source and deploying Google and it

45:13 also wins the second place Euro demo GT next

45:16 award startup creat X general lunch and we also going

45:19 to give a tutorial as plus with that we are

45:23 perfectly time to more to move on to the second

45:25 paper which is how do we enable AIC to accelerate

45:29 a more broader of the cryptography morph morph is

45:34 also a joint work with Georgia Tech Google and MIT

45:37 and a special thanks to the professors 3DS which

45:39 is supporting our work so this is targeting at application

45:44 of cryptocurrency as well as verifiable able in crypted computing.

45:48 First of all, why we need to care about

45:50 these two applications it because we have a super strong

45:54 or largest market in that the crypto market in the last

45:58 year is G is overall giving you $3.5 trillion.

46:03 It is a largest market in the world and even you

46:07 consider a small portion of layer 2 itself got 16 billion.

46:12 So more importantly this market is growing because of two trend.

46:17 First uh an administration policy such as Trump's admin

46:21 has officially established stable coin as official US dollars.

46:26 So second agent systems the emerging agent for doing crypto

46:32 transaction and AI could be helpful giving more throughput demand

46:36 from this entire crypto market and both these two growths

46:40 making the demand on the compute showing on the right

46:43 curve on the right hand side figure and but to supporting

46:47 this demand on the right curve what we

46:50 have today from Harvard is just showing on the dash

46:52 line and you can see a clearly 10x triple gap

46:57 That's why we build morph because we simply need more

47:00 computing power and morph focusing on utilizing this new category

47:05 of the powerful hardware beast AISC to accelerate zero knowledge

47:09 proof and what do we mean by zero knowledge proof

47:13 here and for the expert here we could just

47:15 make a very succinct introduction in zero knowledge proof short

47:20 for ZP a prover need to convince a verifier

47:22 that a statement is true but without revealing any other underlying information.

47:28 Considering a scenario where a prover want to transfer

47:32 all this verifier a trillion dollars and of course

47:35 the prover doesn't want to leak how much account

47:37 balance does they have after they make this transaction.

47:40 So what it could do is it can

47:41 generate a zero proof which cryptographically protect the statement

47:46 and the verifier could only check whether it's

47:48 true or false but without knowing anything beyond that.

47:51 This zero knowledge proof is powerful but and checking this is really fast.

47:55 It can happen in millisecond even with CPU

47:58 but the proof generation could be very slow.

48:00 It take over as many minutes as even days.

48:03 So this proof generation limits the overall

48:06 system circuit that become our acceleration target.

48:10 And the reason why it is slow is coming from a high precision requirement.

48:16 Let's compare that against H.

48:19 We know in HD we have 28 to 59 bit but that will become

48:23 256 to 700 bits in zero knowledge proof which introducing a 10x higher demand

48:30 and further this higher precision is working on the prime field that means

48:36 the modus here throwing on the queue is a prime value which cannot be factored.

48:42 So it invalidate the usage of residual number systems.

48:47 With all this residual number system RS the high precision multiplication

48:52 could be have to be lower into 16 bit chunk in order

48:57 to uh tune with the hardware machine word and this high

49:01 precision multiplication will become a all

49:03 multiplication of these low precision chunks.

49:06 If you have L different chunks then in total you will pay

49:10 L square number of mlication and give you a quadratic computition overhead.

49:15 In other words a 10x increase in the precision

49:18 will introduce a 100x more computation demand.

49:22 That's why make it very slower and therefore we make

49:26 a solution to adding a extended non-prime field showing in red.

49:32 Let's first encapsulate the original prime field

49:35 showing on the f subq highlighted blue here.

49:39 And what we propose is a nonprime field f subm

49:43 where the modus m here is nonprime value which could

49:46 be factored and this enable enable us to rep the entire

49:51 computation happens in the original f subq into f subm.

49:56 For example, shown here the value of a* b will generating a result

50:00 of a and all of these will be within the range of m.

50:05 Specifically, we choose value of M to be larger than Q^ square such

50:10 that any multiplication will not exceeding value

50:13 of M and that enable us to perform

50:16 this computation in the residual number system of F subm where we will have two

50:22 L number of the chunks and this will

50:24 lowering the quadratic competition cost into linear.

50:29 So further we can converting the results of AB back to F

50:34 subq and then perform the model reduction there and then convert it

50:37 back to F subm for the following computation and this entire comp

50:42 transformation could be scheduled as a low precision mislication and that could

50:48 be le be accelerated by the machine modification available in the AI6

50:53 and put this together we also con constructed an open source

50:57 library which is for TPU to enable CKP primitive such as multiscalar

51:02 mlication as well as a DT and compare with cross which

51:06 we achieve up to 100x speed up and the high precision

51:11 crypto primitive such as MSM and NDT and both makes TPU also

51:17 the state-of-the-art super machine for the CKP as well and putting

51:22 these together in summary I'm very thrilled today to introducing two line

51:27 of the compilation work cross and morph and both could turn TPU

51:31 as state-of-the-art throughput machine for both

51:34 homorphic encryption as well as zero knowledge proofs and this positions AI

51:39 ethic as a state-of-the-art throughput machine

51:41 for both low precision crypto primitives

51:44 as well as high precision crypto primitives.

51:47 So with that I'm happy to answer any questions and as a next

51:51 step I also want to prefetch we want to enable

51:54 the cross kernel API into a anttoan application as well

51:59 as build entire morph excited to support a fullto protocol and we

52:04 are uh open for collabision on how do we design a harle

52:07 protocol or how do we align uh the morph and cross

52:10 to support realistic application from industry such as the prestigious Microsoft

52:15 research yeah with that I'm happy to also had a question.

52:20 Thank you young men for the nice presentation.

52:24 Thank you.

Study with Looplines Download Captions Watch on YouTube

CROSS — Leveraging AI ASICs for Homomorphic Encryption

Send feedback