The most complex model we actually understand

Welch Labs

0:00 No one understands modern AI.

0:03 Each new little piece of text known as a token produced

0:06 by Chat GPT is the result of hundreds of billions of separate calculations.

0:11 The parameters used in these calculations are learned from data by training

0:16 Chat GPT to predict a single token [music] at a time.

0:20 But somehow from just learning to predict the next little

0:23 piece of [music] text again and again across trillions of examples,

0:27 what feels like real intelligence emerges?

0:31 What pathways through the network's billions of computations

0:34 are responsible for specific knowledge or abilities?

0:38 Why do certain skills only emerge from models

0:40 of a certain size or after training for a certain duration?

0:44 Are these giant models just memorizing or are they actually learning?

0:50 Today we have many compelling clues

0:52 but no definitive answers to these questions.

0:56 One interesting question we can ask is how much complexity do we

0:59 have to strip away before we can really truly understand a model?

1:04 We know how the individual artificial neurons that make up these models work.

1:09 Although this did take some time to sort out back in the 1960s.

1:13 As we connect more and more of these neurons together,

1:16 when exactly does our understanding really start to break down?

1:20 In this video, I'm going to claim that one specific example,

1:24 groing modular arithmetic with a single layer transformer,

1:28 is the most complex AI model that we fully understand.

1:32 This is obviously highly subjective.

1:35 If you have a different example that you think fits,

1:37 please share it in the comments.

1:38 Your answers could make for a fun follow-up video.

1:42 Like many scientific discoveries,

1:44 we stumbled onto groing completely by accident.

1:48 The initial discovery led to some remarkable follow-up work that allows

1:51 us to rigorously understand what the model's parameters are actually learning,

1:56 why certain behaviors emerge later in training.

1:59 And incredibly, we can even watch

2:00 the model progress from just memorizing training

2:03 examples to learning a robust forier

2:06 space solution to the modular arithmetic problem.

2:10 This example is a few years old at this point,

2:12 but it's an amazing and still very relevant

2:14 way to look under the hood of modern transformers.

2:17 At the end of this video,

2:19 we'll also look at some more recent fascinating results from a team

2:22 at anthropic where the team found a six-dimensional manifold in the activations

2:27 of Claude Haiku that appears to be responsible for handling the arithmetic

2:32 required for the model to figure out when to create new lines.

2:35 As Claude writes, in 2021, a research team at OpenAI was

2:42 training small models to perform modular arithmetic.

2:46 If we take a mathematical operation like X+ Y,

2:50 we can turn this operation into a data set by creating a table

2:53 with various X values as our columns and various Y values as our rows.

2:59 From here, we can fill in each cell with the sum of X and Y.

3:03 0+ 0 is 0.

3:04 0+ 1 is 1 and so on.

3:07 The team was studying modular arithmetic,

3:10 meaning we need to pick a largest number or modulus.

3:14 When our number reaches or exceeds the modulus,

3:16 we divide by the modulus and take the remainder.

3:20 If we choose a modulus of 5, when we reach 1+ 4 on our table,

3:25 the answer is actually 5 modulo 5 equals 0.

3:30 4+ 2 equals 6 modulo 5 giving a final answer of 1 and so on.

3:35 The modulo operation gives our model some interesting structure to learn

3:39 and nicely bounds the number of individual tokens our model needs.

3:44 We know that in this case our answer will always be 0 1 2 3 or four.

3:49 From here we set aside a portion of our data

3:51 for testing and train on the remaining examples.

3:55 It's worth taking a moment to consider what

3:57 this data set really looks like from our model's perspective.

4:00 Our model has one input and one output for each token in its vocabulary.

4:05 We need five tokens to represent our numbers 0 through 4,

4:09 and we'll add one more token to represent our equal sign.

4:13 We could also add a token for the plus sign,

4:16 but since we'll only be training our model on addition, it's not needed.

4:20 Having a token for the equal sign is helpful, however, as we'll see.

4:24 This effectively gives our model a placeholder for its final answer.

4:28 So our model has six total inputs, one for each token.

4:32 For comparison, GPT5 has 200,000 inputs.

4:36 Again, one for each token in its vocabulary.

4:39 To input a math problem into our model, for example, 1+ 2,

4:44 we pass in the first token in our math problem one into the model

4:48 by switching on the one position and switching off all the other positions.

4:53 This is known as one hot encoding and is how the model sees our first token.

4:58 Our second token two is passed into our model

5:01 by switching on the second input and switching off the rest.

5:05 Finally, our equal sign tells us to switch on only the final input to our model.

5:11 So the math problem 1+ 2 from the perspective of our model looks like

5:15 its first input switched on then its second input and then its sixth input.

5:21 Transformers like these are generally configured to return

5:24 outputs of the same dimension that they're given.

5:26 So our model's final output will also be 6x3.

5:30 In this case, we're only going to look

5:32 at the final column of the model's output.

5:35 This is where we want the right answer to show up.

5:37 And in this case, we want the three output

5:39 to be switched on since 1+ 2 is three.

5:43 So what our model is really learning is to map this pattern of 18 values,

5:47 mostly zeros, to this new pattern of six values.

5:52 Now imagine someone just handed you a bunch

5:54 of different target input and output patterns.

5:57 Here are the input and output patterns for 1+ 3= 4.

6:02 Here's 2+ 3= 0, and so on.

6:05 After you saw enough of these examples,

6:08 do you think you could figure out the underlying structure of the problem?

6:13 This is precisely how large language models work.

6:16 When we pass in the text the capital of France is into llama,

6:19 for example, the token for the tells us to switch on input 791.

6:24 The token for capital tells us to switch on input 6864 and so on.

6:30 Moving to llama's output, the final column is maximized at an index of 12366,

6:36 which corresponds to the token for Paris.

6:39 It's easy to forget that the symbols we assign to our model's

6:43 inputs and outputs have this extra meaning that we attach to them.

6:47 But to the model, they're just patterns of inputs and outputs.

6:52 Now, when the OpenAI team trained their model on modular arithmetic,

6:56 their initial results were pretty underwhelming.

6:59 The model was able to quickly learn to match the patterns in the training data,

7:03 giving the correct output on all training examples.

7:07 However, the model performed very poorly on the test set.

7:10 It appeared that the model had simply memorized

7:12 the training data without actually learning modular addition.

7:17 But then something interesting happened.

7:20 One of the researchers went on vacation but accidentally left a model training.

7:25 Returning from vacation, the researcher was shocked to discover that after

7:29 a very large number of training steps, the model had suddenly generalized,

7:34 performing perfectly on both training and test sets.

7:39 What mechanism could possibly be causing the model to perfectly

7:42 fit the training examples after just a couple hundred steps,

7:45 appear to lay dormant for a couple thousand steps,

7:49 and then suddenly actually learn?

7:51 And could similar dynamics happen in full-size models?

7:56 In Robert A.

7:57 Highland's 1961 novel, Stranger in a Strange Land, he coins the term grocking.

8:03 The book's main character, a human who was raised on Mars and returns to Earth,

8:08 uses the Martian word gro throughout the book.

8:11 Grock has no direct translation from the far more complex Martian language.

8:16 But one meaning is to understand something so thoroughly

8:19 that you merge with it and it merges with you.

8:23 The OpenAI team was able to replicate

8:25 the sudden generalization phenomenon across a range of arithmetic

8:29 operations and model configurations and in January 2022

8:34 published this paper where they called the phenomenon groing.

8:38 Grocking is a provocative name but the phenomenon itself is shocking.

8:44 What could be causing the model to suddenly perform perfectly on the test set?

8:49 A year after the publication of the OpenAI groing paper,

8:53 a team led by researcher Neil Nandanda

8:55 published an incredibly detailed analysis of the phenomenon.

8:59 Their paper digs deep into the model's parameters

9:01 and activations to produce a very satisfying and elegant explanation.

9:06 Nandanda and his collaborators studied a single layer transformer.

9:11 This is the same architecture used in most

9:13 large language models just with fewer layers.

9:17 A transformer layer is composed

9:18 of an attention and multi-layer perceptron compute block.

9:22 As we saw with our toy example earlier,

9:25 our data is fed into our model using one hot vectors.

9:29 NAND used a modulus of 113.

9:32 So the model's input vectors are of length 114 with 113 positions

9:37 for the digits 0 through 112 and a final position for the equal sign.

9:42 So to ask our model what 1+ 2 is,

9:45 we pass in this 114x3 matrix made up of all zeros

9:50 except for a one in the one spot of our first column,

9:54 a one in the two spot of our second column,

9:56 and a one in the equal spot of our final column.

9:59 From here, our 113x3 matrix is multiplied by a matrix

10:03 of learned weights known as an embedding matrix,

10:06 producing three new vectors of length 128 each.

10:11 These resulting embedding vectors are no longer sparse

10:14 and as we'll see contain some interesting structure.

10:17 From here, our embedding vectors are passed into our attention

10:20 block and then our multi-layer perceptron compute block.

10:24 The output of our multi-layer perceptron is of length 128.

10:28 We multiply this output by an unmbbedding matrix

10:31 to compute a final vector of length 114.

10:35 The model's answer is given by the largest value in this final vector.

10:39 So if our model is working well,

10:41 its maximum output value should occur in the three

10:44 position corresponding to the correct answer 1+ 2 equals 3.

10:50 Training this model on modular edition,

10:52 we see the same groing behavior observed by the OpenAI team with the model first

10:57 memorizing the training data after around 140

11:00 steps and then generalizing after 7,000 training steps.

11:05 Let's explore the model's intermediate outputs, better known as activations.

11:10 Specifically, let's have a close look at the outputs of some

11:13 of the neurons in the second layer of our multi-layer perceptron block.

11:18 This layer has 512 total neurons.

11:21 If we pass in the problem 0 plus 0 into our network,

11:25 the first neuron of this layer returns an output value of 1.17.

11:30 Our second neuron returns an output of 0.6 and so on.

11:34 Now let's visualize how these values change as we change the input math problem.

11:40 Let's fix the value of x to 0 and explore

11:43 a range of y values starting with 0+ 0.

11:47 then 0+ 1, then 0+ 2, and so on.

11:51 Sweeping through all 113 possible values for y,

11:54 we see some interesting structure with the outputs

11:58 of some of our neurons looking like sine waves.

12:02 Digging deeper, let's explore the correlation between

12:05 all the different pairs of these neurons.

12:08 Let's color our points using the input y value to our model.

12:11 So our neuron outputs given the input 0 0 are colored

12:14 purple and outputs given the input 0+ 112 are colored yellow.

12:20 From here we'll create a 7x7 grid of scatter plots for each pair of neurons.

12:25 So on our second scatter plot on our first row for example we'll plot the output

12:30 of our first neuron as the y value

12:32 and the output of our second neuron as the x value.

12:35 Bringing our two waves together like this results in a nice loop shape.

12:39 creating the same plots for each pair of neuron outputs,

12:42 we see more interesting structures.

12:45 So our model has clearly learned some type of structure.

12:48 But could this structure be related to groing?

12:51 If we move backwards in our training

12:53 process and visualize these structures as we go,

12:57 we see that by the time we reach our model that just memorizes our training set,

13:01 these structures completely disappear.

13:04 So while this early model performs perfectly on the training set,

13:08 we don't see any evidence of the waves and loops that we see after grocking.

13:12 So perhaps these structures are related to why the model gro is sponsored by me.

13:20 The Welsh Labs team and I have written a whole new book on AI.

13:24 It's beautifully illustrated and is a great way to dig

13:27 deeper into the topics we cover in these videos.

13:30 Each chapter includes thoughtprovoking exercises and supporting code.

13:35 Our first print run is totally sold out,

13:37 but we have another batch coming quickly in January.

13:40 And if you order now,

13:41 I'll send you a discount code for a free download of the ebook version.

13:46 Books and education are really near and dear to my heart,

13:49 and we've poured a ton of effort into this book.

13:51 I really think you're going to like it.

13:54 Now, back to Groing modular arithmetic.

13:58 The wave shapes and loops we see inside

14:00 our model as it gro suggest that the model

14:03 is potentially computing and making use of the signs

14:06 and cosiness of our inputs x and y.

14:09 If we take a discrete 4a transform of our activation pattern,

14:13 we can compute the frequencies of the waves learned by our model.

14:17 This first wave yields a largest frequency component of 8 pi over 113.

14:22 And our third wave shows a largest frequency component of 6 pi over 113.

14:27 If we plot these waves on top of our model's outputs, we see nice alignment.

14:33 Let's look for these frequencies in other places in our model.

14:37 Let's visualize a single value in our first embedding vector.

14:42 Just as we did with the neurons in our multi-layer perceptron,

14:45 let's plot this value as we sweep through a range of input values.

14:50 Note that our first embedding vector only depends on our first input x.

14:54 So here we'll sweep from x= 0 to x= 112 while keeping y fixed at zero.

15:01 We don't see quite the same smooth plots that we saw earlier.

15:04 But if we compare our curve to a cosine wave with a frequency of 8 pi over 113,

15:09 we do see reasonably good alignment.

15:13 Part of the challenge here is that this early signal

15:16 in our network also appears to contain higher frequency information,

15:20 which makes sense given that we found

15:22 evidence of multiple frequencies later in our model.

15:25 We could analyze the frequency content of our full

15:27 embedding vectors at this stage of the model.

15:30 But for now, let's build what's known as a sparse linear probe.

15:35 If we sample the values at a few more positions of our embedding vector,

15:38 we see similar semeriodic curves.

15:42 Now it turns out that if we take a weighted sum of these eight curves,

15:46 we end up with a curve that looks very close

15:48 to a cosine curve with a frequency of 8 pi over 113.

15:54 The weighted sum is very relevant here because taking weighted sums like

15:58 this is a big part of what our attention and multi-layer perceptron blocks do.

16:04 Meaning that these compute blocks have access to a very clean cosine wave.

16:09 The signal is just spread across a few different locations in our model.

16:12 At this stage, we can compute a similar sparse linear

16:16 probe for the sign of x* 8 pi over 113.

16:21 Now, our first embedding vector only depends on our first input x

16:25 and our second embedding vector only depends on our second input y.

16:29 These inputs are combined in our attention block.

16:32 Since the same embedding matrix is

16:34 used to process our three inputs independently,

16:37 we can use the same sparse linear probe on our second embedding vector.

16:41 And we'll see the same nice cosine and sign curves, but now as a function of y.

16:47 So very early in the model,

16:49 our model learns to compute the signs and cosiness of our inputs.

16:53 But why?

16:54 What did these functions from trigonometry

16:56 have to do with learning modular addition?

17:00 The modular addition problem may seem a bit foreign or contrived,

17:03 but we actually do it all the time.

17:06 A 2-hour meeting that starts at 11 a.m.

17:09 will end at 11+ 2 modulo 12 equals 1 p.m.

17:15 Analog clocks are implementing modular addition physically.

17:19 Each hour that ticks by adds one with the hour hand.

17:22 And the circular motion of the hands

17:24 perfectly matches the modulo arithmetic problem.

17:28 starting over when reaching 12.

17:31 Now, as we saw when probing the neurons in our multi-layer perceptron,

17:35 our network learns to form circular patterns in its activations.

17:40 Could these circular structures be solving the modular arithmetic

17:43 problem in the same way that an analog clock does?

17:48 The signs and cosiness we see computed by our model

17:50 in its first layer could be part of this puzzle.

17:54 If we put the output of our sparse cosine probe on an x axis

17:58 and the output of our sparse sign probe on the y-axis of a scatter plot,

18:02 we get a nice circle when we sweep through our input values.

18:08 However, it's not enough to learn

18:09 a circular structure for x and y independently.

18:13 Our network has to figure out how to actually add x and y together.

18:17 Adding x and y may seem trivial for our model to learn.

18:21 After all, neural networks are literally built

18:23 from a bunch of adds and multiplies.

18:26 But remember that we aren't actually passing in, for example,

18:30 the number two or a direct representation of it.

18:33 Instead, we're switching on the input to our model that we have labeled two.

18:39 The network cannot just use one of the additions

18:41 in one of its neurons to add X and Y together.

18:45 What happens instead turns out to be way more interesting.

18:50 It is straightforward for our attention layer to add together

18:53 the various signs and cosiness computed by our first layer.

18:57 Our attention layer could easily compute cosine x plus cosine of y.

19:02 However, that's still not what we need to solve the problem.

19:06 We need to add together x and y themselves in our clock analogy.

19:10 We need to add the angles of the clock hands,

19:13 not the signs and cosiness of these angles.

19:17 Let's return to the second layer

19:19 of neurons in our multi-layer perceptron compute block.

19:23 Earlier, we explored how these neuron outputs

19:25 changed as we varied a single input.

19:29 Let's now explore how these outputs change as we vary both X and Y

19:33 to see if we can figure out

19:34 how our network is bringing these variables together.

19:38 Again, visualizing the output of a single neuron.

19:42 If we keep y fixed at zero and sweep through all possible x values,

19:46 we get a familiar wave shape.

19:49 Now let's add another axis to our visualization

19:52 and plot our neurons output now as we vary y.

19:57 Let's explore all combinations of values for x and y.

20:01 With this many points,

20:02 it's easier to visualize our neurons outputs as the height of a surface

20:06 where the color of the surface corresponds to our neuron's output value.

20:11 Like many of the outputs we've seen so far,

20:13 our surface is approximately wavelike.

20:17 What combinations of signs and cosiness best capture

20:20 this wave structure that our network has learned?

20:23 As we did earlier, we can take a 4A transform,

20:27 but this time with respect to both X and Y.

20:30 Extracting our top frequencies,

20:32 we can decompose our surface into a few key components.

20:37 This component is the cosine of x and this component is the cosine of y.

20:43 This top component is the strongest and the most interesting.

20:47 It's equal to the cosine of x times the cosine of y.

20:51 So the strongest frequency component of our surface

20:54 is equal to the product of the cosine

20:56 of x and cosine of y functions that we saw computed earlier in our network.

21:01 Now, it turns out that it's more natural for our network

21:04 to take a sum of signs and cosiness than a product.

21:07 I'll put a note about this in the description.

21:10 So, why are we finding a strong product like this in the middle of our network?

21:15 And does this get us any closer to actually computing the sum of X and Y?

21:20 Remarkably, it does.

21:22 Let me show you one more thing.

21:24 Let's go one layer of neurons deeper into our multi-layer perceptron and plot

21:29 the outputs of a neuron in this layer as a function of X and Y.

21:34 We see similar wavelike shapes here,

21:36 but the wave is less regular and it moves diagonally across our surface.

21:42 This orientation of the wave is really important.

21:46 Consider these top two crests where the output of our neuron is maximized.

21:52 Let's move to an overhead view and look at the combinations

21:55 of our input values that fall on these wave crests.

21:59 The first crest starts at x= 0 and y= 65.

22:03 Moving along our crest, we find intermediate values at x= 20 and y= 45,

22:10 x= 40 and y= 25, x= 60 and y= 5, and finally x= 65 and y= 0.

22:20 All of these pairs of inputs add to the same value of 65.

22:25 So this neuron fires maximally when x+ y equals 65.

22:30 In its own specialized way,

22:32 this neuron has learned to add or more precisely this neuron

22:36 fires for any pair of inputs that add to 65.

22:41 Our second wave crest starts at x= 66, y= 112.

22:47 From there it moves through values like x= 91

22:50 and y= 87 and ends on x= 112 and y= 66.

22:56 Adding these pairs together we get 178 in each case.

23:01 Recall that our model is trained on modular addition with a modulus of 113.

23:07 Our result of 178 modulo 113 is 65.

23:12 So this second crest also finds pairs of inputs that add to 65.

23:18 But how in just one layer of neurons do we go from products like the cosine

23:23 of x times the cosine of y to actually adding together x and y themselves.

23:30 Here's the output of another neuron

23:31 in the second layer of our multi-layer perceptron.

23:35 The strongest frequency component here is s of x time s of y.

23:40 Now each neuron in our following layer takes a weighted

23:43 sum of the outputs of the neurons in our current layer.

23:48 Let's consider how this weighted sum causes our surfaces to interact.

23:52 We saw earlier that our first neuron's output has a strongest

23:55 frequency component of cosine of x time the cosine of y

24:00 and our new second layer neuron has a strongest frequency

24:02 component of the s of x time the s of y.

24:06 Let's assume for a moment that the weight

24:08 assigned to our cosine x* cosine y neuron

24:11 is 1 and the weight assigned to our sin x* sin y neuron is negative 1.

24:17 Visually, this negative weight flips our second surface vertically.

24:22 Now, when we add these weighted surfaces together,

24:25 the signs and cosiness remarkably interfere in just the right

24:29 way to create the diagonal symmetry that we see

24:32 in our neuron in the following layer that allowed our neuron

24:34 to fire on combinations of inputs that add to 65.

24:40 As you may remember from trigonometry class,

24:42 the cosine of x time the cosine of y minus the s

24:46 of x* the s of y is actually a trigonometric identity.

24:50 specifically a sum of angles identity that exactly equals the cosine of x+ y.

24:57 This identity allows us to convert the sum of products

25:00 of s and cosiness into a sum of x and y,

25:04 which is exactly what our network needs to compute.

25:07 And remarkably, the network appears to have learned to effectively

25:11 use this trigonometric identity to solve the modular addition problem.

25:16 And remember that our training data is just

25:18 these sparse patterns that have nothing to do with signs,

25:21 cosiness, or trigonometric identities.

25:26 The final unmbed portion of our model takes one more weighted sum.

25:30 This time of the outputs of the final

25:32 layer neurons in our multi-layer perceptron.

25:35 Visualizing the outputs of a few more of these neurons,

25:38 we see the same types of diagonal symmetries with various shifts and scales.

25:44 Our unmbedding layer takes different combinations of these outputs

25:47 for each possible token that the network could return.

25:51 Here's the resulting surface for the seven output.

25:55 As we saw with our multi-layer perceptron neuron

25:57 that detected all combinations of numbers that added to 65,

26:02 this surface reaches a maximum for all the combinations

26:05 of X and Y that add to 7.

26:08 Here's 7 plus 0.

26:09 Here's 0 plus 7.

26:10 And here's 3+ 4.

26:14 So remarkably to solve this modular arithmetic

26:17 problem our network learns to numerically estimate

26:20 the signs and cosiness of our inputs computes

26:24 the products of these functions and then uses

26:26 a clever trig identity to create the diagonal

26:29 symmetry needed to solve the modular addition

26:31 problem and then brings multiple versions of these resulting

26:35 patterns together to compute a final answer.

26:39 Now, can this detailed understanding of how the model

26:41 solves modular addition help us understand why it gro?

26:46 Let's watch the training process again,

26:48 but this time while visualizing the evolution

26:50 of the various structures learned by our model.

26:54 After a few hundred steps, our model perfectly fits the training data.

26:59 But we don't yet see any hints of signs or cosiness.

27:02 As our model continues to learn, its performance stays flat,

27:06 giving the appearance that nothing is happening.

27:10 However, as we can now clearly see under the hood,

27:13 the model is starting to piece together the relevant

27:16 structures needed to solve the modular arithmetic problem.

27:20 This is such a wild phenomenon.

27:23 It's very common to visualize training and test performance as a model learns.

27:28 And when both metrics are flat for this long,

27:31 the typical assumption is that the model is

27:33 done learning and has settled into a stable solution.

27:38 Neil Nandanda and his co-authors propose a clever

27:40 new metric in their paper called excluded loss.

27:43 Note that thus far we've been plotting the model's accuracy as it learns.

27:47 And here we'll switch to plotting the model's cross entropy loss.

27:51 So lower values are better.

27:53 See my gradient descent video or chapter 2

27:55 of my new AI book for more on cross entropy loss.

27:59 Now that we know that our model is operating

28:01 in the frequency domain at a few key frequencies,

28:05 what happens when we remove the information at these frequencies

28:08 from the model's final output before measuring performance?

28:13 Removing the 8 pi over 113 frequency that we

28:16 found and plotting this excluded loss as the model learns.

28:20 We see our new metric dip down quickly with training loss,

28:24 but then slowly climb as our model builds the sign and cosine representations.

28:29 This excluded loss increases because we've taken away

28:33 the model's ability to use this key frequency.

28:36 And importantly, during this long period

28:38 of flat training and testing performance, our excluded loss slowly climbs,

28:44 showing that our model is making more

28:45 and more use of patterns at this frequency.

28:49 Interestingly, Nanda and his collaborators show that groing occurs

28:52 not necessarily when the sign and cosine structures are completed,

28:56 but just after during a phase they call the cleanup phase,

29:01 where the model actually removes the memorized

29:03 examples that it relied on early in training.

29:07 These dynamics are fascinating and explain very

29:10 nicely why this model gross on this problem.

29:14 It's so satisfying to me that we can take apart this model,

29:18 understand the actual mechanisms that it learns,

29:20 and then use these mechanisms to design

29:22 a new metric that clearly shows the model's

29:25 slow progression from memorization to learning

29:28 and that nicely explains the surprising groing behavior.

29:33 This level of clarity is a beautiful and rare exception in modern AI,

29:38 a transparent box in a world of black boxes.

29:42 The approach Nandanda and his collaborators use to perform

29:45 this analysis is generally known as mechanistic interpretability.

29:49 Since Nand's paper came out in early 2023,

29:53 we've seen some really interesting progress in this field,

29:56 but are still very far away from anywhere near

29:58 this level of understanding of full large language models.

30:02 There's some recent work from a research

30:04 team at Anthropic that gives a nice feel

30:06 for the current edge of our understanding

30:08 using this type of bottomup mechanistic interpretability approach.

30:13 The team studies how a full-sized model Claude 3.5

30:16 Haiku figures out when to create line breaks when writing.

30:21 The team finds that the Haiku model represents the number of characters

30:24 that it's written on a given line on a manifold in sixdimensional space.

30:30 This structure is somewhat analogous to the loops

30:32 that we saw in the multi-layer perceptron of our model.

30:36 To figure out when to insert a line break,

30:38 Haiku needs to know both how many characters

30:40 it's written on the current line and how

30:43 many characters long the lines of the text it's currently writing it need to be.

30:47 Using linear probes similar to the ones we used here

30:50 to find the signs and cosiness early in our model.

30:54 The anthropic team mapped character count and line

30:57 length to this sixdimensional manifold and found

31:00 that haik coup represents these concepts in this space in a very similar way.

31:06 This 70 character count probe lines up right next

31:08 to this line length of 70 probe and so on.

31:12 Now, this gets really wild when

31:14 these representations are passed into Haiku's attention blocks.

31:19 We see what the team calls a QK twist,

31:23 where these helix-like geometries are rotated relative

31:26 to each other in this sixdimensional space.

31:29 After rotation, the probe for a character count of 70

31:32 is now closest to a line width of 75.

31:36 And we see a similar offset of four

31:38 to five characters across the length of our curve.

31:42 The proximity of these points in the model's attention heads leads to a high dot

31:46 product when the model is about five characters away from the end of a line.

31:52 The team goes on to show that there are multiple attention heads that specialize

31:55 in detecting various distances from the end of the current line of text.

32:00 And this mechanism allows Haiku to precisely estimate how much

32:04 more room it has before the end of the line.

32:07 Now, compared to Claude Haiku's full range of capabilities,

32:11 deciding when to create a new line is very simple.

32:14 However, it is exciting to see that the anthropic team found

32:17 such a clean mechanism that controls this behavior in a full-size model.

32:24 The story of groing is such a nice arc of scientific discovery and progress.

32:30 We accidentally discovered a new phenomenon

32:33 and the search for an explanation genuinely helped

32:36 push forward our understanding of model training

32:38 dynamics and the inner workings of transformers.

32:43 The names we give our discoveries matter and I like the name groing.

32:48 It feels alien and originates

32:50 from the complex Martian language in Highland's novel.

32:54 The AI researcher Andre Karpathy recently commented that training large language

32:59 models is less like building animal intelligence and more like summoning ghosts.

33:05 You can think of a ghost as a fundamentally

33:07 different kind of point in the space of possible intelligences.

33:11 The literal meaning of gro to understand something profoundly and deeply

33:16 is a nice fit for what the model appears to be doing.

33:20 But what I really appreciate here is the connotation of this thing being alien.

33:25 I think it's a really nice counterpoint to overly personifying models.

33:31 We communicate with these models in human language.

33:34 But as we've seen, this is a thin veneer.

33:36 If we go one layer deeper into what these models actually process in return,

33:41 we find these absurdly complex [music] patterns.

33:45 As we build more intelligent models and learn more about how they work,

33:49 it will be fascinating to see if

33:51 these artificial intelligences feel more alien, ghost, or human.

34:04 I am tired.

34:05 So, this has been my first full year working fully on Welch Labs.

34:10 [music] Um, we made some progress.

34:12 So, we did nine videos this year and we did one book.

34:15 Um, and man, getting that done filled like

34:17 every available second of time that I had.

34:21 Um, for now on the business, I'm trying to keep things simple.

34:25 [music] Um, so really just focusing on making sure that the business

34:29 and the channel work well enough to support my family and I.

34:32 Um, I left my full-time job last year.

34:35 Um, my goal is to earn as much from Welch Labs as I did from my engineering job.

34:40 I was hoping to replace my whole income this year.

34:42 It's probably going to be like 75%.

34:45 Um, the book helped a lot, but there's always challenges.

34:47 The business side is hard.

34:49 Um, I've tried to do this full-time before, once in 2018.

34:53 Um, I just didn't have enough runway and enough focus on the business.

34:55 So, I think we're doing it right this time, but gosh,

34:58 it takes time and man, it takes a lot of work.

35:00 So, I hope you enjoyed what we've done this year.

35:02 Um, a lot more of it next year.

35:04 Uh, kind of working on the focus and direction for next year right now.

35:07 Um, but I'm really happy with the book.

35:10 I hope you're able to get a copy.

35:11 I know we're not shipping internationally yet.

35:13 That will be a focus early next year.

35:15 I I promise.

35:16 Um but yeah, what a year, man.

35:18 Thank you so much for your support.

35:19 If you are able to support on Patreon, that helps a ton.

35:21 Or just liking and sharing the videos.

35:23 Um thanks for a great year.

35:25 I'll see you next year.

Study with Looplines Download Captions Watch on YouTube

The most complex model we actually understand

Send feedback