The most complex model we actually understand
Welch Labs
0:00 No one understands modern AI.
0:03 Each new little piece of text known as a token produced
0:06 by Chat GPT is the result of hundreds of billions of separate calculations.
0:11 The parameters used in these calculations are learned from data by training
0:16 Chat GPT to predict a single token [music] at a time.
0:20 But somehow from just learning to predict the next little
0:23 piece of [music] text again and again across trillions of examples,
0:27 what feels like real intelligence emerges?
0:31 What pathways through the network's billions of computations
0:34 are responsible for specific knowledge or abilities?
0:38 Why do certain skills only emerge from models
0:40 of a certain size or after training for a certain duration?
0:44 Are these giant models just memorizing or are they actually learning?
0:50 Today we have many compelling clues
0:52 but no definitive answers to these questions.
0:56 One interesting question we can ask is how much complexity do we
0:59 have to strip away before we can really truly understand a model?
1:04 We know how the individual artificial neurons that make up these models work.
1:09 Although this did take some time to sort out back in the 1960s.
1:13 As we connect more and more of these neurons together,
1:16 when exactly does our understanding really start to break down?
1:20 In this video, I'm going to claim that one specific example,
1:24 groing modular arithmetic with a single layer transformer,
1:28 is the most complex AI model that we fully understand.
1:32 This is obviously highly subjective.
1:35 If you have a different example that you think fits,
1:37 please share it in the comments.
1:38 Your answers could make for a fun follow-up video.
1:42 Like many scientific discoveries,
1:44 we stumbled onto groing completely by accident.
1:48 The initial discovery led to some remarkable follow-up work that allows
1:51 us to rigorously understand what the model's parameters are actually learning,
1:56 why certain behaviors emerge later in training.
1:59 And incredibly, we can even watch
2:00 the model progress from just memorizing training
2:03 examples to learning a robust forier
2:06 space solution to the modular arithmetic problem.
2:10 This example is a few years old at this point,
2:12 but it's an amazing and still very relevant
2:14 way to look under the hood of modern transformers.
2:17 At the end of this video,
2:19 we'll also look at some more recent fascinating results from a team
2:22 at anthropic where the team found a six-dimensional manifold in the activations
2:27 of Claude Haiku that appears to be responsible for handling the arithmetic
2:32 required for the model to figure out when to create new lines.
2:35 As Claude writes, in 2021, a research team at OpenAI was
2:42 training small models to perform modular arithmetic.
2:46 If we take a mathematical operation like X+ Y,
2:50 we can turn this operation into a data set by creating a table
2:53 with various X values as our columns and various Y values as our rows.
2:59 From here, we can fill in each cell with the sum of X and Y.
3:03 0+ 0 is 0.
3:04 0+ 1 is 1 and so on.
3:07 The team was studying modular arithmetic,
3:10 meaning we need to pick a largest number or modulus.
3:14 When our number reaches or exceeds the modulus,
3:16 we divide by the modulus and take the remainder.
3:20 If we choose a modulus of 5, when we reach 1+ 4 on our table,
3:25 the answer is actually 5 modulo 5 equals 0.
3:30 4+ 2 equals 6 modulo 5 giving a final answer of 1 and so on.
3:35 The modulo operation gives our model some interesting structure to learn
3:39 and nicely bounds the number of individual tokens our model needs.
3:44 We know that in this case our answer will always be 0 1 2 3 or four.
3:49 From here we set aside a portion of our data
3:51 for testing and train on the remaining examples.
3:55 It's worth taking a moment to consider what
3:57 this data set really looks like from our model's perspective.
4:00 Our model has one input and one output for each token in its vocabulary.
4:05 We need five tokens to represent our numbers 0 through 4,
4:09 and we'll add one more token to represent our equal sign.
4:13 We could also add a token for the plus sign,
4:16 but since we'll only be training our model on addition, it's not needed.
4:20 Having a token for the equal sign is helpful, however, as we'll see.
4:24 This effectively gives our model a placeholder for its final answer.
4:28 So our model has six total inputs, one for each token.
4:32 For comparison, GPT5 has 200,000 inputs.
4:36 Again, one for each token in its vocabulary.
4:39 To input a math problem into our model, for example, 1+ 2,
4:44 we pass in the first token in our math problem one into the model
4:48 by switching on the one position and switching off all the other positions.
4:53 This is known as one hot encoding and is how the model sees our first token.
4:58 Our second token two is passed into our model
5:01 by switching on the second input and switching off the rest.
5:05 Finally, our equal sign tells us to switch on only the final input to our model.
5:11 So the math problem 1+ 2 from the perspective of our model looks like
5:15 its first input switched on then its second input and then its sixth input.
5:21 Transformers like these are generally configured to return
5:24 outputs of the same dimension that they're given.
5:26 So our model's final output will also be 6x3.
5:30 In this case, we're only going to look
5:32 at the final column of the model's output.
5:35 This is where we want the right answer to show up.
5:37 And in this case, we want the three output
5:39 to be switched on since 1+ 2 is three.
5:43 So what our model is really learning is to map this pattern of 18 values,
5:47 mostly zeros, to this new pattern of six values.
5:52 Now imagine someone just handed you a bunch
5:54 of different target input and output patterns.
5:57 Here are the input and output patterns for 1+ 3= 4.
6:02 Here's 2+ 3= 0, and so on.
6:05 After you saw enough of these examples,
6:08 do you think you could figure out the underlying structure of the problem?
6:13 This is precisely how large language models work.
6:16 When we pass in the text the capital of France is into llama,
6:19 for example, the token for the tells us to switch on input 791.
6:24 The token for capital tells us to switch on input 6864 and so on.
6:30 Moving to llama's output, the final column is maximized at an index of 12366,
6:36 which corresponds to the token for Paris.
6:39 It's easy to forget that the symbols we assign to our model's
6:43 inputs and outputs have this extra meaning that we attach to them.
6:47 But to the model, they're just patterns of inputs and outputs.
6:52 Now, when the OpenAI team trained their model on modular arithmetic,
6:56 their initial results were pretty underwhelming.
6:59 The model was able to quickly learn to match the patterns in the training data,
7:03 giving the correct output on all training examples.
7:07 However, the model performed very poorly on the test set.
7:10 It appeared that the model had simply memorized
7:12 the training data without actually learning modular addition.
7:17 But then something interesting happened.
7:20 One of the researchers went on vacation but accidentally left a model training.
7:25 Returning from vacation, the researcher was shocked to discover that after
7:29 a very large number of training steps, the model had suddenly generalized,
7:34 performing perfectly on both training and test sets.
7:39 What mechanism could possibly be causing the model to perfectly
7:42 fit the training examples after just a couple hundred steps,
7:45 appear to lay dormant for a couple thousand steps,
7:49 and then suddenly actually learn?
7:51 And could similar dynamics happen in full-size models?
7:56 In Robert A.
7:57 Highland's 1961 novel, Stranger in a Strange Land, he coins the term grocking.
8:03 The book's main character, a human who was raised on Mars and returns to Earth,
8:08 uses the Martian word gro throughout the book.
8:11 Grock has no direct translation from the far more complex Martian language.
8:16 But one meaning is to understand something so thoroughly
8:19 that you merge with it and it merges with you.
8:23 The OpenAI team was able to replicate
8:25 the sudden generalization phenomenon across a range of arithmetic
8:29 operations and model configurations and in January 2022
8:34 published this paper where they called the phenomenon groing.
8:38 Grocking is a provocative name but the phenomenon itself is shocking.
8:44 What could be causing the model to suddenly perform perfectly on the test set?
8:49 A year after the publication of the OpenAI groing paper,
8:53 a team led by researcher Neil Nandanda
8:55 published an incredibly detailed analysis of the phenomenon.
8:59 Their paper digs deep into the model's parameters
9:01 and activations to produce a very satisfying and elegant explanation.
9:06 Nandanda and his collaborators studied a single layer transformer.
9:11 This is the same architecture used in most
9:13 large language models just with fewer layers.
9:17 A transformer layer is composed
9:18 of an attention and multi-layer perceptron compute block.
9:22 As we saw with our toy example earlier,
9:25 our data is fed into our model using one hot vectors.
9:29 NAND used a modulus of 113.
9:32 So the model's input vectors are of length 114 with 113 positions
9:37 for the digits 0 through 112 and a final position for the equal sign.
9:42 So to ask our model what 1+ 2 is,
9:45 we pass in this 114x3 matrix made up of all zeros
9:50 except for a one in the one spot of our first column,
9:54 a one in the two spot of our second column,
9:56 and a one in the equal spot of our final column.
9:59 From here, our 113x3 matrix is multiplied by a matrix
10:03 of learned weights known as an embedding matrix,
10:06 producing three new vectors of length 128 each.
10:11 These resulting embedding vectors are no longer sparse
10:14 and as we'll see contain some interesting structure.
10:17 From here, our embedding vectors are passed into our attention
10:20 block and then our multi-layer perceptron compute block.
10:24 The output of our multi-layer perceptron is of length 128.
10:28 We multiply this output by an unmbbedding matrix
10:31 to compute a final vector of length 114.
10:35 The model's answer is given by the largest value in this final vector.
10:39 So if our model is working well,
10:41 its maximum output value should occur in the three
10:44 position corresponding to the correct answer 1+ 2 equals 3.
10:50 Training this model on modular edition,
10:52 we see the same groing behavior observed by the OpenAI team with the model first
10:57 memorizing the training data after around 140
11:00 steps and then generalizing after 7,000 training steps.
11:05 Let's explore the model's intermediate outputs, better known as activations.
11:10 Specifically, let's have a close look at the outputs of some
11:13 of the neurons in the second layer of our multi-layer perceptron block.
11:18 This layer has 512 total neurons.
11:21 If we pass in the problem 0 plus 0 into our network,
11:25 the first neuron of this layer returns an output value of 1.17.
11:30 Our second neuron returns an output of 0.6 and so on.
11:34 Now let's visualize how these values change as we change the input math problem.
11:40 Let's fix the value of x to 0 and explore
11:43 a range of y values starting with 0+ 0.
11:47 then 0+ 1, then 0+ 2, and so on.
11:51 Sweeping through all 113 possible values for y,
11:54 we see some interesting structure with the outputs
11:58 of some of our neurons looking like sine waves.
12:02 Digging deeper, let's explore the correlation between
12:05 all the different pairs of these neurons.
12:08 Let's color our points using the input y value to our model.
12:11 So our neuron outputs given the input 0 0 are colored
12:14 purple and outputs given the input 0+ 112 are colored yellow.
12:20 From here we'll create a 7x7 grid of scatter plots for each pair of neurons.
12:25 So on our second scatter plot on our first row for example we'll plot the output
12:30 of our first neuron as the y value
12:32 and the output of our second neuron as the x value.
12:35 Bringing our two waves together like this results in a nice loop shape.
12:39 creating the same plots for each pair of neuron outputs,
12:42 we see more interesting structures.
12:45 So our model has clearly learned some type of structure.
12:48 But could this structure be related to groing?
12:51 If we move backwards in our training
12:53 process and visualize these structures as we go,
12:57 we see that by the time we reach our model that just memorizes our training set,
13:01 these structures completely disappear.
13:04 So while this early model performs perfectly on the training set,
13:08 we don't see any evidence of the waves and loops that we see after grocking.
13:12 So perhaps these structures are related to why the model gro is sponsored by me.
13:20 The Welsh Labs team and I have written a whole new book on AI.
13:24 It's beautifully illustrated and is a great way to dig
13:27 deeper into the topics we cover in these videos.
13:30 Each chapter includes thoughtprovoking exercises and supporting code.
13:35 Our first print run is totally sold out,
13:37 but we have another batch coming quickly in January.
13:40 And if you order now,
13:41 I'll send you a discount code for a free download of the ebook version.
13:46 Books and education are really near and dear to my heart,
13:49 and we've poured a ton of effort into this book.
13:51 I really think you're going to like it.
13:54 Now, back to Groing modular arithmetic.
13:58 The wave shapes and loops we see inside
14:00 our model as it gro suggest that the model
14:03 is potentially computing and making use of the signs
14:06 and cosiness of our inputs x and y.
14:09 If we take a discrete 4a transform of our activation pattern,
14:13 we can compute the frequencies of the waves learned by our model.
14:17 This first wave yields a largest frequency component of 8 pi over 113.
14:22 And our third wave shows a largest frequency component of 6 pi over 113.
14:27 If we plot these waves on top of our model's outputs, we see nice alignment.
14:33 Let's look for these frequencies in other places in our model.
14:37 Let's visualize a single value in our first embedding vector.
14:42 Just as we did with the neurons in our multi-layer perceptron,
14:45 let's plot this value as we sweep through a range of input values.
14:50 Note that our first embedding vector only depends on our first input x.
14:54 So here we'll sweep from x= 0 to x= 112 while keeping y fixed at zero.
15:01 We don't see quite the same smooth plots that we saw earlier.
15:04 But if we compare our curve to a cosine wave with a frequency of 8 pi over 113,
15:09 we do see reasonably good alignment.
15:13 Part of the challenge here is that this early signal
15:16 in our network also appears to contain higher frequency information,
15:20 which makes sense given that we found
15:22 evidence of multiple frequencies later in our model.
15:25 We could analyze the frequency content of our full
15:27 embedding vectors at this stage of the model.
15:30 But for now, let's build what's known as a sparse linear probe.
15:35 If we sample the values at a few more positions of our embedding vector,
15:38 we see similar semeriodic curves.
15:42 Now it turns out that if we take a weighted sum of these eight curves,
15:46 we end up with a curve that looks very close
15:48 to a cosine curve with a frequency of 8 pi over 113.
15:54 The weighted sum is very relevant here because taking weighted sums like
15:58 this is a big part of what our attention and multi-layer perceptron blocks do.
16:04 Meaning that these compute blocks have access to a very clean cosine wave.
16:09 The signal is just spread across a few different locations in our model.
16:12 At this stage, we can compute a similar sparse linear
16:16 probe for the sign of x* 8 pi over 113.
16:21 Now, our first embedding vector only depends on our first input x
16:25 and our second embedding vector only depends on our second input y.
16:29 These inputs are combined in our attention block.
16:32 Since the same embedding matrix is
16:34 used to process our three inputs independently,
16:37 we can use the same sparse linear probe on our second embedding vector.
16:41 And we'll see the same nice cosine and sign curves, but now as a function of y.
16:47 So very early in the model,
16:49 our model learns to compute the signs and cosiness of our inputs.
16:53 But why?
16:54 What did these functions from trigonometry
16:56 have to do with learning modular addition?
17:00 The modular addition problem may seem a bit foreign or contrived,
17:03 but we actually do it all the time.
17:06 A 2-hour meeting that starts at 11 a.m.
17:09 will end at 11+ 2 modulo 12 equals 1 p.m.
17:15 Analog clocks are implementing modular addition physically.
17:19 Each hour that ticks by adds one with the hour hand.
17:22 And the circular motion of the hands
17:24 perfectly matches the modulo arithmetic problem.
17:28 starting over when reaching 12.
17:31 Now, as we saw when probing the neurons in our multi-layer perceptron,
17:35 our network learns to form circular patterns in its activations.
17:40 Could these circular structures be solving the modular arithmetic
17:43 problem in the same way that an analog clock does?
17:48 The signs and cosiness we see computed by our model
17:50 in its first layer could be part of this puzzle.
17:54 If we put the output of our sparse cosine probe on an x axis
17:58 and the output of our sparse sign probe on the y-axis of a scatter plot,
18:02 we get a nice circle when we sweep through our input values.
18:08 However, it's not enough to learn
18:09 a circular structure for x and y independently.
18:13 Our network has to figure out how to actually add x and y together.
18:17 Adding x and y may seem trivial for our model to learn.
18:21 After all, neural networks are literally built
18:23 from a bunch of adds and multiplies.
18:26 But remember that we aren't actually passing in, for example,
18:30 the number two or a direct representation of it.
18:33 Instead, we're switching on the input to our model that we have labeled two.
18:39 The network cannot just use one of the additions
18:41 in one of its neurons to add X and Y together.
18:45 What happens instead turns out to be way more interesting.
18:50 It is straightforward for our attention layer to add together
18:53 the various signs and cosiness computed by our first layer.
18:57 Our attention layer could easily compute cosine x plus cosine of y.
19:02 However, that's still not what we need to solve the problem.
19:06 We need to add together x and y themselves in our clock analogy.
19:10 We need to add the angles of the clock hands,
19:13 not the signs and cosiness of these angles.
19:17 Let's return to the second layer
19:19 of neurons in our multi-layer perceptron compute block.
19:23 Earlier, we explored how these neuron outputs
19:25 changed as we varied a single input.
19:29 Let's now explore how these outputs change as we vary both X and Y
19:33 to see if we can figure out
19:34 how our network is bringing these variables together.
19:38 Again, visualizing the output of a single neuron.
19:42 If we keep y fixed at zero and sweep through all possible x values,
19:46 we get a familiar wave shape.
19:49 Now let's add another axis to our visualization
19:52 and plot our neurons output now as we vary y.
19:57 Let's explore all combinations of values for x and y.
20:01 With this many points,
20:02 it's easier to visualize our neurons outputs as the height of a surface
20:06 where the color of the surface corresponds to our neuron's output value.
20:11 Like many of the outputs we've seen so far,
20:13 our surface is approximately wavelike.
20:17 What combinations of signs and cosiness best capture
20:20 this wave structure that our network has learned?
20:23 As we did earlier, we can take a 4A transform,
20:27 but this time with respect to both X and Y.
20:30 Extracting our top frequencies,
20:32 we can decompose our surface into a few key components.
20:37 This component is the cosine of x and this component is the cosine of y.
20:43 This top component is the strongest and the most interesting.
20:47 It's equal to the cosine of x times the cosine of y.
20:51 So the strongest frequency component of our surface
20:54 is equal to the product of the cosine
20:56 of x and cosine of y functions that we saw computed earlier in our network.
21:01 Now, it turns out that it's more natural for our network
21:04 to take a sum of signs and cosiness than a product.
21:07 I'll put a note about this in the description.
21:10 So, why are we finding a strong product like this in the middle of our network?
21:15 And does this get us any closer to actually computing the sum of X and Y?
21:20 Remarkably, it does.
21:22 Let me show you one more thing.
21:24 Let's go one layer of neurons deeper into our multi-layer perceptron and plot
21:29 the outputs of a neuron in this layer as a function of X and Y.
21:34 We see similar wavelike shapes here,
21:36 but the wave is less regular and it moves diagonally across our surface.
21:42 This orientation of the wave is really important.
21:46 Consider these top two crests where the output of our neuron is maximized.
21:52 Let's move to an overhead view and look at the combinations
21:55 of our input values that fall on these wave crests.
21:59 The first crest starts at x= 0 and y= 65.
22:03 Moving along our crest, we find intermediate values at x= 20 and y= 45,
22:10 x= 40 and y= 25, x= 60 and y= 5, and finally x= 65 and y= 0.
22:20 All of these pairs of inputs add to the same value of 65.
22:25 So this neuron fires maximally when x+ y equals 65.
22:30 In its own specialized way,
22:32 this neuron has learned to add or more precisely this neuron
22:36 fires for any pair of inputs that add to 65.
22:41 Our second wave crest starts at x= 66, y= 112.
22:47 From there it moves through values like x= 91
22:50 and y= 87 and ends on x= 112 and y= 66.
22:56 Adding these pairs together we get 178 in each case.
23:01 Recall that our model is trained on modular addition with a modulus of 113.
23:07 Our result of 178 modulo 113 is 65.
23:12 So this second crest also finds pairs of inputs that add to 65.
23:18 But how in just one layer of neurons do we go from products like the cosine
23:23 of x times the cosine of y to actually adding together x and y themselves.
23:30 Here's the output of another neuron
23:31 in the second layer of our multi-layer perceptron.
23:35 The strongest frequency component here is s of x time s of y.
23:40 Now each neuron in our following layer takes a weighted
23:43 sum of the outputs of the neurons in our current layer.
23:48 Let's consider how this weighted sum causes our surfaces to interact.
23:52 We saw earlier that our first neuron's output has a strongest
23:55 frequency component of cosine of x time the cosine of y
24:00 and our new second layer neuron has a strongest frequency
24:02 component of the s of x time the s of y.
24:06 Let's assume for a moment that the weight
24:08 assigned to our cosine x* cosine y neuron
24:11 is 1 and the weight assigned to our sin x* sin y neuron is negative 1.
24:17 Visually, this negative weight flips our second surface vertically.
24:22 Now, when we add these weighted surfaces together,
24:25 the signs and cosiness remarkably interfere in just the right
24:29 way to create the diagonal symmetry that we see
24:32 in our neuron in the following layer that allowed our neuron
24:34 to fire on combinations of inputs that add to 65.
24:40 As you may remember from trigonometry class,
24:42 the cosine of x time the cosine of y minus the s
24:46 of x* the s of y is actually a trigonometric identity.
24:50 specifically a sum of angles identity that exactly equals the cosine of x+ y.
24:57 This identity allows us to convert the sum of products
25:00 of s and cosiness into a sum of x and y,
25:04 which is exactly what our network needs to compute.
25:07 And remarkably, the network appears to have learned to effectively
25:11 use this trigonometric identity to solve the modular addition problem.
25:16 And remember that our training data is just
25:18 these sparse patterns that have nothing to do with signs,
25:21 cosiness, or trigonometric identities.
25:26 The final unmbed portion of our model takes one more weighted sum.
25:30 This time of the outputs of the final
25:32 layer neurons in our multi-layer perceptron.
25:35 Visualizing the outputs of a few more of these neurons,
25:38 we see the same types of diagonal symmetries with various shifts and scales.
25:44 Our unmbedding layer takes different combinations of these outputs
25:47 for each possible token that the network could return.
25:51 Here's the resulting surface for the seven output.
25:55 As we saw with our multi-layer perceptron neuron
25:57 that detected all combinations of numbers that added to 65,
26:02 this surface reaches a maximum for all the combinations
26:05 of X and Y that add to 7.
26:08 Here's 7 plus 0.
26:09 Here's 0 plus 7.
26:10 And here's 3+ 4.
26:14 So remarkably to solve this modular arithmetic
26:17 problem our network learns to numerically estimate
26:20 the signs and cosiness of our inputs computes
26:24 the products of these functions and then uses
26:26 a clever trig identity to create the diagonal
26:29 symmetry needed to solve the modular addition
26:31 problem and then brings multiple versions of these resulting
26:35 patterns together to compute a final answer.
26:39 Now, can this detailed understanding of how the model
26:41 solves modular addition help us understand why it gro?
26:46 Let's watch the training process again,
26:48 but this time while visualizing the evolution
26:50 of the various structures learned by our model.
26:54 After a few hundred steps, our model perfectly fits the training data.
26:59 But we don't yet see any hints of signs or cosiness.
27:02 As our model continues to learn, its performance stays flat,
27:06 giving the appearance that nothing is happening.
27:10 However, as we can now clearly see under the hood,
27:13 the model is starting to piece together the relevant
27:16 structures needed to solve the modular arithmetic problem.
27:20 This is such a wild phenomenon.
27:23 It's very common to visualize training and test performance as a model learns.
27:28 And when both metrics are flat for this long,
27:31 the typical assumption is that the model is
27:33 done learning and has settled into a stable solution.
27:38 Neil Nandanda and his co-authors propose a clever
27:40 new metric in their paper called excluded loss.
27:43 Note that thus far we've been plotting the model's accuracy as it learns.
27:47 And here we'll switch to plotting the model's cross entropy loss.
27:51 So lower values are better.
27:53 See my gradient descent video or chapter 2
27:55 of my new AI book for more on cross entropy loss.
27:59 Now that we know that our model is operating
28:01 in the frequency domain at a few key frequencies,
28:05 what happens when we remove the information at these frequencies
28:08 from the model's final output before measuring performance?
28:13 Removing the 8 pi over 113 frequency that we
28:16 found and plotting this excluded loss as the model learns.
28:20 We see our new metric dip down quickly with training loss,
28:24 but then slowly climb as our model builds the sign and cosine representations.
28:29 This excluded loss increases because we've taken away
28:33 the model's ability to use this key frequency.
28:36 And importantly, during this long period
28:38 of flat training and testing performance, our excluded loss slowly climbs,
28:44 showing that our model is making more
28:45 and more use of patterns at this frequency.
28:49 Interestingly, Nanda and his collaborators show that groing occurs
28:52 not necessarily when the sign and cosine structures are completed,
28:56 but just after during a phase they call the cleanup phase,
29:01 where the model actually removes the memorized
29:03 examples that it relied on early in training.
29:07 These dynamics are fascinating and explain very
29:10 nicely why this model gross on this problem.
29:14 It's so satisfying to me that we can take apart this model,
29:18 understand the actual mechanisms that it learns,
29:20 and then use these mechanisms to design
29:22 a new metric that clearly shows the model's
29:25 slow progression from memorization to learning
29:28 and that nicely explains the surprising groing behavior.
29:33 This level of clarity is a beautiful and rare exception in modern AI,
29:38 a transparent box in a world of black boxes.
29:42 The approach Nandanda and his collaborators use to perform
29:45 this analysis is generally known as mechanistic interpretability.
29:49 Since Nand's paper came out in early 2023,
29:53 we've seen some really interesting progress in this field,
29:56 but are still very far away from anywhere near
29:58 this level of understanding of full large language models.
30:02 There's some recent work from a research
30:04 team at Anthropic that gives a nice feel
30:06 for the current edge of our understanding
30:08 using this type of bottomup mechanistic interpretability approach.
30:13 The team studies how a full-sized model Claude 3.5
30:16 Haiku figures out when to create line breaks when writing.
30:21 The team finds that the Haiku model represents the number of characters
30:24 that it's written on a given line on a manifold in sixdimensional space.
30:30 This structure is somewhat analogous to the loops
30:32 that we saw in the multi-layer perceptron of our model.
30:36 To figure out when to insert a line break,
30:38 Haiku needs to know both how many characters
30:40 it's written on the current line and how
30:43 many characters long the lines of the text it's currently writing it need to be.
30:47 Using linear probes similar to the ones we used here
30:50 to find the signs and cosiness early in our model.
30:54 The anthropic team mapped character count and line
30:57 length to this sixdimensional manifold and found
31:00 that haik coup represents these concepts in this space in a very similar way.
31:06 This 70 character count probe lines up right next
31:08 to this line length of 70 probe and so on.
31:12 Now, this gets really wild when
31:14 these representations are passed into Haiku's attention blocks.
31:19 We see what the team calls a QK twist,
31:23 where these helix-like geometries are rotated relative
31:26 to each other in this sixdimensional space.
31:29 After rotation, the probe for a character count of 70
31:32 is now closest to a line width of 75.
31:36 And we see a similar offset of four
31:38 to five characters across the length of our curve.
31:42 The proximity of these points in the model's attention heads leads to a high dot
31:46 product when the model is about five characters away from the end of a line.
31:52 The team goes on to show that there are multiple attention heads that specialize
31:55 in detecting various distances from the end of the current line of text.
32:00 And this mechanism allows Haiku to precisely estimate how much
32:04 more room it has before the end of the line.
32:07 Now, compared to Claude Haiku's full range of capabilities,
32:11 deciding when to create a new line is very simple.
32:14 However, it is exciting to see that the anthropic team found
32:17 such a clean mechanism that controls this behavior in a full-size model.
32:24 The story of groing is such a nice arc of scientific discovery and progress.
32:30 We accidentally discovered a new phenomenon
32:33 and the search for an explanation genuinely helped
32:36 push forward our understanding of model training
32:38 dynamics and the inner workings of transformers.
32:43 The names we give our discoveries matter and I like the name groing.
32:48 It feels alien and originates
32:50 from the complex Martian language in Highland's novel.
32:54 The AI researcher Andre Karpathy recently commented that training large language
32:59 models is less like building animal intelligence and more like summoning ghosts.
33:05 You can think of a ghost as a fundamentally
33:07 different kind of point in the space of possible intelligences.
33:11 The literal meaning of gro to understand something profoundly and deeply
33:16 is a nice fit for what the model appears to be doing.
33:20 But what I really appreciate here is the connotation of this thing being alien.
33:25 I think it's a really nice counterpoint to overly personifying models.
33:31 We communicate with these models in human language.
33:34 But as we've seen, this is a thin veneer.
33:36 If we go one layer deeper into what these models actually process in return,
33:41 we find these absurdly complex [music] patterns.
33:45 As we build more intelligent models and learn more about how they work,
33:49 it will be fascinating to see if
33:51 these artificial intelligences feel more alien, ghost, or human.
34:04 I am tired.
34:05 So, this has been my first full year working fully on Welch Labs.
34:10 [music] Um, we made some progress.
34:12 So, we did nine videos this year and we did one book.
34:15 Um, and man, getting that done filled like
34:17 every available second of time that I had.
34:21 Um, for now on the business, I'm trying to keep things simple.
34:25 [music] Um, so really just focusing on making sure that the business
34:29 and the channel work well enough to support my family and I.
34:32 Um, I left my full-time job last year.
34:35 Um, my goal is to earn as much from Welch Labs as I did from my engineering job.
34:40 I was hoping to replace my whole income this year.
34:42 It's probably going to be like 75%.
34:45 Um, the book helped a lot, but there's always challenges.
34:47 The business side is hard.
34:49 Um, I've tried to do this full-time before, once in 2018.
34:53 Um, I just didn't have enough runway and enough focus on the business.
34:55 So, I think we're doing it right this time, but gosh,
34:58 it takes time and man, it takes a lot of work.
35:00 So, I hope you enjoyed what we've done this year.
35:02 Um, a lot more of it next year.
35:04 Uh, kind of working on the focus and direction for next year right now.
35:07 Um, but I'm really happy with the book.
35:10 I hope you're able to get a copy.
35:11 I know we're not shipping internationally yet.
35:13 That will be a focus early next year.
35:15 I I promise.
35:16 Um but yeah, what a year, man.
35:18 Thank you so much for your support.
35:19 If you are able to support on Patreon, that helps a ton.
35:21 Or just liking and sharing the videos.
35:23 Um thanks for a great year.
35:25 I'll see you next year.