Machine learning using neural networks has led to a remarkable leap forward in artificial intelligence, and the technological and social ramifications have been discussed at great length. To understand the origin and nature of this progress, it is useful to dig at least a little bit into the mathematical and algorithmic structures underlying these techniques. Anil Ananthaswamy takes up this challenge in his book Why Machines Learn: The Elegant Math Behind Modern AI. In this conversation we give a brief overview of some of the basic ideas, including the curse of dimensionality, backpropagation, transformer architectures, and more.
Support Mindscape on Patreon.
Anil Ananthaswamy received a Masters degree in electrical engineering from the University of Washington, Seattle. He is currently a freelance science writer and feature editor for PNAS Front Matter. He was formerly the deputy news editor for New Scientist, a Knight Science Journalism Fellow at MIT, and journalist-in-residence at the Simon Institute for the Theory of Computing, University of California, Berkeley. He organizes an annual science journalism workshop at the National Centre for Biological Sciences at Bengaluru, India.
Click to Show Episode Transcript
0:00:00.2 Sean Carroll: Hello everyone. Welcome to the Mindscape Podcast. I'm your host, Sean Carroll. We all know that artificial intelligence in various forms has been exploding over the last couple years. After decades of effort and various summers and winters in AI research, we clearly have crossed some threshold where AI is being put into use in all sorts of different places. Now we can debate the words artificial intelligence, right? Is it really intelligence? That's the large language models, which are a particular approach approach to AI, which have really gotten all the attention lately. They're based on a broader idea called neural networks or deep learning, which has been all over the place for a long time. Something like Google Maps uses this kind of technology. But now with the more human like behavior of AI in the form of large language models, they've become much more ubiquitous. And there's been a wild range of reactions to what is going on. Some people saying that maybe they'll become super intelligent and take over the world, and that's a danger. Other people just complaining that they can't download a new software or an app without it being infused with AI that they don't really want.
0:01:15.6 SC: So I'm not myself very sure what the long term impact of AI is going to be, at least in this sort of large language model incarnation. A couple of years ago when it first became a big thing, I said that it's probably somewhere between the impact of cell phones and electricity. And still that's a lot of impact one way or the other, right? Cell phones have had a lot of impact on our lives, but not really completely changing the way we live. That's a minimal expectation for the impact that AI will have for better or for worse. Whereas the larger thing of the impact level of electricity is maybe the upper level of where it could possibly reach. None of us knows. That's my personal guess. Various people have very strong opinions one way or the other. Many of them are more educated than mine. But we all should be a little bit educated about what this technology is, that is affecting us so much. So that's what we're here to do today. Today's guest is Anil Ananthaswamy, who is a science writer, actually former editor at New Scientist, and now mostly a freelance science writer.
0:02:23.3 SC: But he got his start in engineering and computer engineering in particular. And when large language models came along, he became sort of re-fascinate by this aspect of technology and dived into it. So in addition to being a science writer, he's actually thinking at quite an advanced level about AI and what it is, and the book that has resulted from this thinking is called why Machines Learn the Elegant Math Behind Modern AI. It's actually, it came out last year. This is my fault for waiting so long to talk about it. Anil is a person, a friend of mine that I've known for a long time. But it's a book chock full of big ideas and mathematics. It's exactly in the spirit of my own biggest ideas books, et cetera. That really doesn't just say, well AI is gonna take over the world. It says, here is why you need to understand how to diagonalize a matrix to understand what AI is really telling us. Now, there's a lot more in the book than we could possibly cover in a one hour podcast, so we hit some highlights. But I think that for me, this conversation was extraordinarily helpful in clarifying which advance in AI technology came first, what the importance of it was, what it led to later.
0:03:38.8 SC: The actual math that is used is, on the one hand, fascinating. On the other hand, as we say in the podcast, mostly classical in the sense. It's not like you're developing new math in order to be able to do this. It's not like you're using the most advanced reaches of modern category theory or topology to figure things out. You're applying math in very large dimensional spaces that only computers can really handle. And that's enough to do something that is very different than anything that's been done before. So trying to understand it the best we can is a worthwhile endeavor. Let's go. Anil Ananthaswamy, welcome to the Mindscape podcast.
0:04:32.8 Anil Ananthaswamy: Thank you, Sean. It's my pleasure.
0:04:35.0 SC: So you've written a book about AI that doesn't single you out. There's lots of people who've written books about AI. You've decided to write a book about the mathematics behind AI, which is an interesting choice. What is it that led you to that?
0:04:49.0 AA: Oh, it began well before the AI craziness came about. Sometime in 2016, 2017, I started noticing a whole bunch of stories that I was beginning to do as a journalist that had a machine learning component.
0:05:08.1 SC: Right.
0:05:08.5 AA: And when I became a journalist, this was when I transitioned from being engineer to being a journalist. I was writing mostly about physics and neuroscience and when I would write about those subjects, like particle physics, I was happy just doing as much research as I could and understanding it to the best of my ability and then writing about it. I never had any illusions about being able to do particle physics. Or do neuroscience. But when I started encountering stories in machine learning, when I would talk to the researchers, explaining their algorithms, their machine learning models, I suddenly felt, hang on, this is something I could do.
0:05:52.1 SC: I could do this.
0:05:54.0 AA: Well, not in the way they were doing it, but I could certainly get my hands dirty because of the software background, because of my engineering background. And so what happened was I got a fellowship at MIT, the Knight Science Journalism Fellowship. And as part of that fellowship, we had to do projects. And the project I took on was essentially teaching myself deep learning. So the project question was, could a deep learning or a deep neural network do what Kepler did? So it was trying to build a neural network that would try and predict the future positions of planets, given the kind of data Kepler had access to.
0:06:34.3 SC: Oh, okay.
0:06:35.4 AA: And the short answer, very quickly, I found out, absolutely not. There's no way a neural network would do what Kepler did. Kepler had access to literally a few tens of positions of the orbits of Mars and Jupiter. Very little data that Tycho Brahe had collected. And so then I ended up writing a simulation to generate loads and loads of data and learned how to train very simple deep neural networks to make predictions about planetary positions given years and years of data. And that was still just empirical stuff. I had gone back to CS101, taught myself coding in Python. It had been maybe almost 20 years or so since I'd done any coding, so I had to sit in a class with teenagers teaching myself coding. That was fun. And at some point, what happened towards the end of that fellowship was Covid happened, and we all got kind of locked up in our apartments, and my interest shifted to wanting to understand more about machine learning. It wasn't enough to just sit and do some coding. I felt I needed to get under the skin of this thing.
0:07:55.5 AA: So I just started watching lectures from Cornell, from MIT. There was one professor at Cornell, Kilian Weinberger, whom I discovered. It was a 2018 class that he gave, which is online even today. And it's just him giving his talks to his students. It is not produced for YouTube. It's amazing. There's nothing slick about it. It's just a professor and the students, fantastic stuff. I got sucked into that and kept learning more and more of the math. And at some point the journalist and the storyteller in me woke up again saying, hang on, this math is actually quite lovely that there are stories here to be told. But because I was so steeped in the math, by then, again, I don't wanna make it sound as if I did a lot of math. It was steeped in the math relative to the kind of math I was steeped in before. It was not very much. So this was more for me, for actual machine learning practitioners, this is pretty simple math. But for me it was a fair bit. And it was that desire to communicate the beauty of the math that I was encountering and to tell the stories about it.
0:09:11.0 AA: So some combination of getting really stuck into the math. So I remember when I made my proposed book proposal to my editor and you and I share an editor, full disclosure. We share. And I was pretty sure that he was gonna say no, given how much math I was proposing to put into the book. But I made it very clear that I was going to put the math in there.
0:09:40.1 SC: I softened him up for you.
0:09:43.0 AA: Yes, exactly. I realized that later. Absolutely. So, yeah, that's how it came about. It wasn't a book that was written with a desire to ride the AI wave, because this was proposed in 2020, well before all of the craziness came about. And I just wanted to share in the beauty of the map that I was encountering.
0:10:08.5 SC: And I need to dig into that Kepler story a little bit because it's secretly profound. Would you still think that it's true that a machine learning or some sort of deep learning algorithm, given the data that Kepler had, would not be able to come up with Kepler's laws somehow? It seems like that must depend on the space of possible theories that the LLM or whatever it is, has access to. But I'm not quite sure what's going on there.
0:10:41.6 AA: Yeah. So. Well, first of all, it's unlikely that we would use an LLM large language model to solve this problem because the data is simply, in this case, the data that I was using is just orbital positions of planets.
0:10:55.2 SC: Yeah.
0:10:56.2 AA: And I was teaching a neural network to learn about the patterns that exist in this data. And then it was like a time series. Then it would just try to predict into the future where those planets were. And, if you have enough data, you can train deep neural networks to learn the time series and make predictions going into the future. But what they don't do is they will not give you a symbolic form of the equation. The equation might be there in the system, but it has no way of spitting out some symbolic form of Kepler's loss. Right? But the network is embodying in its weights something that is very similar to what Kepler would have figured out. The question is, how do we extract that out of that network? And so that's one problem. The second problem is what you were alluding to is that the amount of data that Kepler had access to, there's no way.
0:11:52.5 AA: Today's neural networks are extremely sample inefficient, so they require too much data to do what they need to do. And so we're certainly missing something in our AI models in terms of being able to learn the way humans do. It's also true that Kepler came with a whole bunch of prior knowledge. He was a smart fellow, so he was obviously coming with a whole bunch of inbuilt knowledge about geometry, about calculus and all of these things. And so we have to take that into account also. So maybe large language models which have been trained on a whole bunch of human text have that kind of prior knowledge built in. And it will be interesting to see how one would solve this problem using a large language model. I wasn't using large language models. This was well before LLMs came on the scene. And I was just simply training these things called LSTMs, which are recurrent neural networks which are good for time series.
0:12:51.6 SC: I went to a seminar recently that was supposed to be an intro to the power of AI for physicists and cosmologists in particular. And the speaker started the seminar, he had a data set from LIGO from a particular gravitational black hole inspiral and he had cooked up ahead of time a one page long prompt. And he fed the data set and the prompt to, it was a large language model, certainly a deep learning thing. And basically the LLM wrote the paper. So it wrote a bunch of Python scripts to analyze the data, it made the figures, it wrote the paper, it embedded the figures, it found the references and everything and it was finished by the time the seminar had finished an hour later. And I'm open, of course you'd wanna check it, that didn't make mistakes, right? You would definitely wanna check it very carefully. But I'm open to the possibility that that kind of science is doable by deep learning methods, whereas Kepler's kind of science, where you're literally coming up with a new conceptualization of what's going on, seems to be much harder.
0:14:11.0 AA: Yes, and I would very much agree with that. There was someone I was listening to just recently about this exact issue and he was pointing out that if we had an LLM that had data about physics that happened until 1915 and LLM. And then could it then come up with Einstein's theory of relativity without having anything in the data about relativity? Very unlikely. I wouldn't even say unlikely. I would say impossible at this point.
0:14:45.4 SC: Right. It's just a different kind of thing.
0:14:47.5 AA: Yeah.
0:14:48.2 SC: Well, okay, good. So we're here to learn about what it is actually able to do and why it's able to do it. Should we just start with going back to the beginning and the Perceptron and the first neural networks?
0:15:00.5 AA: Yeah, absolutely. I mean, the Perceptron, the first neural network, started in the late 1950s. So the perceptron was designed by Frank Rosenblatt. He was a Cornell University psychologist. And the Perceptron is essentially a single layer artificial neural network. Right.? And artificial neural network is simply a network of artificial neurons. An artificial neuron is very simply a computational unit. It takes in a bunch of inputs and does some sort of weighted sum of those inputs, adds a bias term, and then if that weighted sum plus bias exceeds some threshold, it will produce a one, otherwise it produces a minus one. That was in essence, Rosenblatt's artificial neuron. And he showed how you could use a series of such artificial neurons sort of lined up vertically, so one layer of them, to do some sort of linear classification. So if you had, for instance, images of one kind of digit, let's say the digit 9, and images of another kind of digit, let's say the digit, I don't know, something that looks very different, 4. And if these were, let's say, 20 pixels by 20 pixels, black and white images, then each image can be effectively turned into 400 pixels.
0:16:31.5 AA: And if you were to map each pixel along one axis, then in 400 dimensional space, each of these images becomes a bunch of points. And so the image 9, all images of the digit 9 will be in one location of this 400 dimensional space, and the digit forwards will be somewhere else. And as long as those things are pretty distinct and you can draw some sort of hyperplane separating out those two regions, the Perceptron algorithm was guaranteed to find one such claim. As long as the data was linearly separable in any kind of dimensional space, the Perceptron would find it.
0:17:14.6 SC: Okay.
0:17:15.1 AA: And this was a big deal. This was a very big deal. In fact, when we were talking of earlier of the math that inspired me to write the book, it was actually the Perceptron convergence Proof, which was a few years after he came up with the algorithm. So he first comes up with the algorithm empirically to how to do this. And then people get into the act of trying to figure out mathematically the properties of this algorithm. And there was something called the perceptron convergence proof, which basically said that if the data is linearly separable, then the algorithm will find it in finite time. And this was a huge statement to make in computer science terms back in the 1950s, that an algorithm is guaranteed to work, right? And it put a lower bound in saying this will definitely work. And it's a very simple proof that uses just basic linear algebra and nothing more. And if you look at the proof, it's so lovely that I did put it in the coda of one of my chapters and beginning of that coda, I tell the reader that you really don't have to read this to read the rest of the book.
0:18:28.8 AA: But I should also tell you that if it weren't for this proof, I would not have written the book. It's a way of teasing them into reading that book.
0:18:37.9 SC: Good.
0:18:40.6 AA: Sorry, I just wanted to say that this trick is actually due to a British novelist called Somerset Maugham. He had a novel called the Razor's Edge and it's a very interesting book. And there's a chapter somewhere in the middle where he says to the reader, he addresses the reader directly, saying, dear reader, you don't have to read this chapter because it won't change the rest of the book. But I should tell you that if it weren't for this chapter, I wouldn't have written the book.
0:19:11.8 SC: You steal from the best. That's what we all do, we as writers. That's okay. I wanna dwell on this. Not the proof, but the linear separability idea, 'cause it is kind of deep but also hard to visualize. You're saying that I have a 20 by 20 grid of pixels and I can think of that as a single point in a 400 dimensional space. Right? Which, you know, once you've done a certain amount of math in your life, that's obvious before you've done that amount of math that's almost impossible to quite wrap your head around. But then the point is that all of the nines kind of cluster in a group, hopefully in that 400 dimensional space, and all of the fours cluster somewhere else. And so if a new data point comes in, that might not be any of the existing data points. But you can say it's closer to one cluster than the other one. Right?
0:20:09.1 AA: Right. So in the case of the Perceptron, it will find a hyperplane, and so this will be a 399 dimensional plane that will separate out the two classes of data. And it doesn't guarantee that it will find an optimal hyperplane, it'll just find something. And because if there is a gap between the two clusters of data, in principle there's an infinity of hyperplanes that'll pass through that. So it'll find one, the first one that it finds, it stops. And it might not be an optimal one. But then what happens is then when you give it a new digit saying, okay, tell me whether this is a 9 or a 4. It actually doesn't matter whether it's either, whether it's a 9 or a 4. It just, all it does is it's gonna say, is it to this side of the hyperplane or is it to that side of the hyperplane? And it'll classify it as such based on which side of the hyperplane the new digit falls on. So you already identified a problem with the algorithm that you could have a data that is. It's much easier to think of this now, the same thing applies to images of cats and dogs.
0:21:16.0 AA: You could have 1000 by 1000 image of a cat and of a dog. So this would now be dots in million dimensional space. Right? And the hyperplane will separate the two sets of images. The cats on one side, dogs on the other. Now you bring in an image of a horse. The classifier has no idea that. All it does is it says, which side of the hyperplane does this image fall on? And it's gonna call the horse either a cat or a dog. Right? But again, going back to the late 1950s, early 1960s, this was a very big deal because you are basically saying we can recognize images, we can classify images, and classification is the first step towards recognition.
0:22:01.7 SC: And I presume that it's not that hard to let the computer know that we're gonna introduce a new category called horses, and then we're gonna separate that space into three subsets.
0:22:13.5 AA: Yeah. So then you can have classifiers that are classifying into multiple categories. Yes, very much so. Yeah. But the basics began with that first linear classifier. And at the same time there was another researcher who doesn't actually get talked about as much, but whose work was just as seminal. And this was Bernard Widrow, who was at Stanford. And he was somebody who had been working on designing digital filters, adaptive digital filters. So filters that learn about the characteristics of the signal that they're processing and learn how to separate noise from signal adaptively, so they're learning on the fly. And he was very much steeped in that and then realized that the techniques that he was using to build his adaptive digital filters were actually exactly the same techniques he needed to build an artificial neuron that learned a linear classifier exactly like what Rosenblatt was doing. But Widrow's approach was very different. And in a very fundamental sense, the algorithm that Widrow came up with to build this linear classifier is actually the true precursor to today's back propagation algorithm, which is used to train artificial neural networks. And there's actually talking about stories, there's an amazing story about how the Widrow, it's called the Widrow Hoffman Least Mean Square Algorithm.
0:23:45.8 AA: And Widrow was an assistant professor at Stanford in the late 1950s. And there's a knock on his door. A young student comes in wanting to see if he can do a PhD with widrow. And so Bernie Widrow starts scribbling some stuff on the blackboard, trying to tell him about adaptive filters. And in the course of two hours of discussing what a PhD project would be like, they end up designing what's today called the least mean Square algorithm. They realize that they have designed an algorithm to train a simple artificial neuron. And then the duo walk across the room. There's an analog computer out there that Lockheed has donated to Stanford. And Hoff, who's a kid looking for a PhD thesis project, he goes and programs the analog computer to simulate the algorithm, shows that it works. And then now it's Friday evening by them at Stanford and the supply rooms are closed, and they want to build this thing in hardware. So they walk across to Zach's Electronics, buy all the stuff that they want, go over to Tedhoff's apartment, and over the course of the weekend build the world's first hardware artificial neuron.
0:25:00.0 AA: Monday morning, they have it working right? And they have a very crude. So most people who are following AI by now will know of this algorithm called back propagation. And this idea of doing gradient descent and stochastic gradient descent, as these are algorithms that are used for optimizing the parameters of your model. And what Widrow and Hoff had done was an extremely noisy version of stochastic gradient descent. They had come up with an algebraic formulation instead of Using any calculus or anything, they just came up with a very straightforward sort of algebraic formulation that they could then implement in hardware. So that Monday morning, they had an artificial neuron on the desk working.
0:25:45.5 SC: I wanna make sure that all prospective graduate students out there know this is not usually what happens. This is a rare story.
0:25:55.6 AA: And I think Ted Hoff's story is pretty amazing, because once he finished his PhD, he gets an offer from a startup in the Bay Area, and he comes to Bernie Woodrow to ask, should he join the startup? And Woodrow tells him, yes, you should. And that startup turns out to be Intel.
0:26:13.9 SC: Good, smart.
0:26:15.1 AA: And Ted Hoff goes on to become one of the first designers of the world's first microprocessor.
0:26:20.9 SC: Wow. Okay. Very good. I wanna get into the actual neuron just a little bit. This idea of a threshold is apparently very important here. So your little artificial neuron is taking in some signals, and rather than just sort of adding them together, it says, I'm gonna hold off until it crosses some threshold, and then I'm gonna fire. Is that the basic idea?
0:26:45.5 AA: Yes. And it's inspired from our understanding of biological neurons. Right? That's what biological neurons are doing. In a very simple way of looking at what biological neurons do, you have all these signals coming in through the dendrites of the neuron into the cell body, and the cell body is kind of accumulating the signals that are coming in. And when it crosses some threshold, both in terms of the strength of the signal and the timing of the signals, it then fires a signal on its own axon, which then travels to the dendrites of other neurons. So. And this model was already known. And so the simple kind of artificial neuron is a very basic approximation of this biological neural mechanism.
0:27:32.3 SC: And then the real fun, special thing is go harden.
0:27:35.0 AA: Yes.
0:27:35.6 SC: The real fun comes in when we start adding them together in layers.
0:27:40.7 AA: Yes. And this was actually not possible in the 1950s and 1960s, because what they had was just a single layer of neurons, which means that the inputs are coming in from one side into the neurons. The neurons do this weighted sum and then add a bias term, and then if that weighted sum plus bias exceeds some threshold, they fire. They're always producing an output. If the threshold exceeds a certain amount, then the output becomes 1. Otherwise it's minus 1. Right? Or 0 or 1. Choose your output, but that's it. So on the output side, you have either 1 or -1. And on the input side you have these inputs coming in. The neuron does this computation the moment you add another layer of neurons, so that the outputs of the first layer of neurons goes in as inputs to the second layer. Then the training algorithm that Rosenblatt had and Bernard Widrow had, the Widrow hoff LMS algorithm, least mean square algorithm, they didn't work. And that there's a very interesting story as to why this was a very big deal in the 1960s, because we had people like Marvin Minsky and Seymour Papert, who wrote this extraordinary book that was published in 1969.
0:29:04.2 AA: It was called Perceptrons in honor of Rosenblatt. And it was a mathematical analysis of these kinds of machines that learn. And in it, it had the perceptron convergence proof and a whole bunch of other things. But they also had a very important proof that showed that a single layer neural network of the kind that Rosenblatt had and of the kind that Bernard Widrow had, could not solve something called the XOR problem. So if you imagine four points on the XY plane, so at the origin, at the point 0,0 you have a circle. At the point 1,0 on the X axis you have a triangle. Then at the coordinate 1,1 you have another circle. And then on the y axis, coordinate 0,1 you have a triangle. So you have two triangles and two circles, but they're on the diagonals of the square. Right? You cannot draw a straight line to separate the circles from the triangles.
0:30:10.5 SC: It's true.
0:30:12.1 AA: And so this was the XR problem. And Minsky and Paper had a very elegant proof saying that single layer neural networks will never solve this problem. And this was a huge knock because this was such a simple problem that anyone looking at it can solve it. But, this incredible thing that people had been going on and on about couldn't solve it. And then what they did, which was kind of underhanded, was they also insinuated, without giving any mathematical proof, that even multi-layer neural networks will not solve this problem.
0:30:46.0 SC: Oh, yeah.
0:30:46.8 AA: And this effectively killed. Well, the legend is that this effectively killed research into neural networks and led to the first AI winter. So sometime in the 1970s, research into neural networks just died because people didn't think that these things were good for anything if they couldn't solve something as simple as an XR problem, except that they had no proof that multi-layer neural networks couldn't do it. And no one had yet come up with an algorithm to train multi-layer neural networks. So only people who kept the faith were people like Jeff Hinton, who persisted. Hinton, I remember talking to him, and he was completely convinced that multi-layer neural networks will solve the problem. He thought that Minsky and Papert had just kind of pulled a fast one over everyone's eyes. So basically the argument there was Minsky and Papert were very interested in another form of AI called symbolic AI.
0:31:47.9 SC: Sure.
0:31:49.6 AA: Right? And they wanted research funding to go into that. And this approach of connectionism and neural networks was against what they were looking at. I don't know how true that is, whether there was any underhandedness here, but certainly the first day I went to, was in part influenced by Minsky and Papert's group. So maps played a big role.
0:32:12.2 SC: Yeah, no, the human beings, they're endlessly fascinating. But at some point, one way or the other, the multi-layer neural networks did start gaining traction.
0:32:23.4 AA: Yes. So that we would have to wait until the 1980s. So, the first big change that happened in the early 1980s was John Hopfield coming up with Hopfield networks, which were a different kind of neural network. They were essentially neurons that were fully interconnected, which meant that the output of a neuron would go as input to every other neuron in the network except itself. So it couldn't influence itself, but its output could influence every other neuron. And so these were fully connected networks. And Hopfield networks were essentially networks that, that were used to store memories. And the way they were very deeply inspired by condensed matter physics. So the icing model of sort of ferromagnetic materials and spin glasses. And what Hopfield was after was basically, he was formerly a condensed matter physicist who was looking to do something in computational neuroscience or biology. And he was looking for a problem to solve. And he figured out that he could solve the problem of how the brain stores and retrieves what he calls associative memories by designing this kind of neural network. Where given this fully connected neural network, he designed an equation which allowed you to calculate the quote, unquote, the energy of the network.
0:33:54.6 AA: This was modeled after the Hamiltonian of a material. So, and here the energy of the system would be at a minimum when you stored a memory. And anytime you corrupted the network, it would enter into a higher energy state. And because the neurons were all connected to each other, the moment you corrupted one of them in terms of changing the output of a neuron, you would end up setting up a dynamic that made the network just traverse the energy landscape and come all the way down to a minima and when it came and settled into a minima, then you would just read off the outputs of the neuron and they would be exactly the memory that you had stored previously. So the way you set the coefficients of your network, which in this case are the strengths of the connections between the neurons dependent on the memory you wanted to store. So given that you wanted to store, let's say it's a 10 by 10 image, black and white image that you want to store. And so again, 10 by 10 is 100 pixels. And you want 100 neurons in your network so that each neuron is responsible for one of those pixel values, right?
0:35:06.1 AA: So you take those 100 pixel values that you want to store, and Hopfield had an equation that told you that given that I want to store these hundred pixels, what should be the strength of the connections between the neurons? And you would set the strength of the connections of the neurons such that when the memory is stored, the network is at energy minima, so it is stable. It won't do anything. But the moment you perturb it, moment you add some noise. Let's say you want to corrupt your image. And corrupting an image simply means that some of the outputs are now changed. Let's say they were zeros and ones in certain neurons. You just flip them. And now the dynamics of the networks takes over. The network finds itself at a higher position in the energy landscape. And because the neurons are influencing each other, they will just start flipping, just like magnetic moments in a ferromagnetic material, right? And then the whole system will traverse its way down the energy landscape and then settle back into its minima, which by definition is a stable state, the way the network is designed. And when it settles back into a stable state, you just read off the outputs of the neurons and you've got your image back.
0:36:15.8 SC: And obviously they're stealing ideas from physics, but we made up for it by giving them the Nobel Prize. So I think that that's frayed.
0:36:24.3 AA: Yeah. So this was 1982, 1984. That's when Hopfield came up with this Hopfield network. But we still didn't know how to train multilayer neural networks. What was interesting was that the ideas that would eventually influence people like Hinton were already in the air. In fact, going B ack to the 1960s, people in rocketry had ways of sort of optimizing their models. Like when you think about what happens when you launch a rocket which is supposed to reach some destination in deep space, your rocket is actually going in space. You have your model of the rocket which is controlling, which is giving you the control system. And at every time step you have to actually look at where the rocket is and then modify the parameters of your model. Because your model has to adapt to the fact that the rocket may not be doing exactly what you think it should be doing. So your model is being updated, your model parameters are being updated on the fly. That in a sense had the beginnings of the back propagation algorithm already built in, except it was not formulated as such. And there were others people at MIT, there was a grad student at MIT who had done some work in economics who was also playing around with similar ideas, and many others who had done bits and pieces.
0:37:47.6 AA: But it was Hinton, Rumelhart and Williams who in 1986 put out a paper in Nature, just an amazing three and a half page paper, that is essentially the back propagation algorithm, which shows you how you can train a multi-layer neural network using what turns out to be just the chain rule of calculus. It's an extraordinarily simple idea in retrospect, maybe one thing I should mention here, that remember we talked about the fact that the early neurons had a thresholding function, but that thresholding function is not differentiable, so it just transitions steeply at the point of the when the weighted sum exceeds a certain threshold and then outputs 1, otherwise it's 0. So you can't differentiate that. And this turned out to be very key to not being able to train the network when you added more layers. Because the chain rule requires that the entire computational graph from the output all the way back to the input has to be differentiable. Every computation that you do has to be differentiable so that you can use the chain rule to kind of back propagate your error from the output side all the way to the input side.
0:39:05.6 AA: But these discontinuities that were there in the way the artificial neurons were designed with a step threshold function kind of ruined that. So one of the things that Hinton and others did was change that step function into a sigmoid. So suddenly the sigmoid was differentiable. And so they essentially ensured that the computation from the input all the way to the output, regardless of how many steps there were, were all differentiable.
0:39:37.2 SC: So the sigmoid, remind us what that is. That's like a smoother version of the kind of step function that turns the neuron on or off.
0:39:45.6 AA: Yes. So the initial neurons had a very sharp transition from, let's say 0 to 1. So the slope of the function at the point of the transition is infinite. Right? And the sigmoid is essentially a smoothing of that function. So this step is gone. So it's a very smooth transition from 0 to 1.
0:40:06.6 SC: And let me. Sure, I have the right mental picture here. When Hopfield has these networks where every neuron talks to every other neuron, in what sense is that multi-layer? What happened to the layers?
0:40:20.1 AA: So, yeah, good question. So that was not a multi-layer network in the way we think of it today. That was simply a fully connected neural network. I brought that up only to say that the research in neural networks had a resurgence with Hopfield networks. But Hopfield networks are not the kind of networks we use today.
0:40:37.6 SC: Good.
0:40:38.8 AA: They're what are called recurrent neural networks because the outputs of a neuron can feed back into the rest of the network and there's a lot of feedback going on, which is not the way networks work in today's neural networks. Today's neural networks are what are just called feed forward networks. They, computation proceeds from one end from the input side, and then it goes layer by layer to the output side. And the outputs don't feed back to the input side.
0:41:09.6 SC: Okay, I guess it's interesting. There is this idea of back propagation. Is that part of a feedforward network or is that excluded?
0:41:20.2 AA: So back propagation is the way you update the weights of the network when you want to train it. Right? So the computation, when the neural network is doing a computation, when you give it an input and it has to produce an output, that is the feedforward process. So the input comes in from one side and each layer does some computation, feeds the result of its computation to the next layer, and then next layer does its computation. And then finally this exits on the output side as the output that you want. And one way to think about it is a vector of information comes in on the input side and the vector just propagates, changes size as it goes through the network because the layers can have different sizes. And then on the output side you'll get back another vector. And one vector comes in, gets transformed by each layer into a different vector. And then finally on the output side you have another vector. And that output side could be a scalar, it could just be a number 0 or 1 saying this is a cat or a dog, or it could be a vector which has more information than just a scalar.
0:42:29.3 AA: But that's the computation that's happening in the feedforward pass. But when you're training the network, let's say you're training a network to recognize images of cats from images of dogs, right? And let's go back to our example of 10 by 10 images. So that's 100 pixels. So you turn each image into a hundred dimensional vector, right? 100 pixel values, and they are fed into the network on the input side. And then on the output side, after it has gone through a bunch of transformations, you get back either 0 or a 1, 0 for a cat and 1 for a dot. Now, in the beginning, when you have initialized the network randomly, all the weights of the network, which are the strengths of the interconnections between the neurons, they are initialized randomly. And so when you feed in some image on the input side, you're just gonna get the wrong answer on the output side.
0:43:31.3 SC: Sure.
0:43:32.2 AA: But you know what the right answer should be because you supplied the training data, you have human annotated data saying these are cats and these are dogs. So on the output side, you know that it should be outputting either a 1 or a 0 and it maybe does the wrong thing.
0:43:45.4 AA: So you calculate the error now, and the amount of error that it makes is a function of all the parameters of the model, all the weights of the network, or all the strengths of the connections between the neurons. So you now have something called a loss function where the loss is formulated in terms of the parameters of the model. And you can imagine this as some sort of very high dimensional surface. And when, in the beginning, when the network makes a loss, you will end up on a location in that loss landscape, which is pretty high up, you've made a large amount of noise, a loss. So you now use something like gradient descent to try and work your way down to the point in the landscape where the loss is at a minimum, and that part where you have to now figure out how much to update each weight so that your network is slightly better at the same task than it was the first go around. So let's say you took one image and it made a certain amount of loss, you took that loss, figured out how much, you need to tweak all of the parameters so that the next time you feed the same image back, the loss will be a little less.
0:45:07.0 AA: And if you keep doing that, eventually you'll come down to a point where the loss is at a minimum. But the trick here is that you have to do this for all images, because if you just did that for one image, then you'll be off for all the other images. So you have to do it in parallel for everything simultaneously. So for your entire data set, you want to reach the bottom of the loss landscape, right? And that part where you are trying to update the weights of the network, that's the back propagation part. It's called back propagation because the loss is calculated on the output side. And now you have to propagate that loss all the way back layer by layer so that you can update the weights of each layer as you go back towards the input side.
0:45:46.4 SC: That's very helpful. Yeah, because...
0:45:48.8 AA: And that's where the cable comes in. Because you calculate the gradient on the output side and then you have to chain all the differentiable computations together so that you can calculate the gradient over the entire network.
0:46:03.7 SC: It makes sense because there's a difference between training the network where obviously you're gonna have to go backwards and fix all the parameters early in the chain of layers versus just doing the calculation once you've trained it, which is a purely feed forward mechanism.
0:46:20.3 AA: Yes. And so for people who are, for instance, using something like ChatGPT today, when we use it, we're just using the feedforward part of it. But when you're training it, you have to keep doing this back and forth.
0:46:35.4 SC: And say more about the idea of gradient descent. It seems to me maybe I'm not sophisticated enough here. It seems to be like it's a very high dimensional version of something that Isaac Newton would have told us about many years ago.
0:46:49.5 AA: Well, that is the amazing part about this field is that a lot of these ideas go back all the way to the invention of calculus and linear algebra. These are all 16th, 17th, 18th century math, very simple stuff in some fundamental sense. And yet of course now when we do gradient descent, we are doing it in extraordinary high dimensional spaces, right? When you think about modern deep neural networks like the large language models that we have, you know, GPT4 or OpenAI's O1 or O3 and Claude, all these are quite. We don't know exactly what the number of parameters they have, but they're close to a trillion. Right? So your loss that the, error that the network is making has to be formulated as a function of these trillion parameters. Right? So your loss landscape is in trillion dimensions. Right? And the other thing that I should have mentioned earlier when we were talking of these artificial neurons is that the early artificial neurons were linear neurons. And so the subsequent sort of artificial neurons have a non-linearity built into them. The addition of a sigmoid essentially makes the neuron non-linear.
0:48:14.4 AA: So now your loss function is not only does it have trillion, it's a function of a trillion parameters, but there's a huge amount of non-linearity baked into the whole system. So it is not a convex function, it's not just, even if it is in trillion dimensions, you can imagine a convex function, some trillion dimensional bowl shaped function that you can just slide down all the way and you're guaranteed to find the global minimum. That's not the case with these modern neural networks. These lost landscapes are extraordinarily complex. They have probably, we don't even know for sure that they have a global minimum, but they may have lots and lots of very good optimal local minima. And so the trick is to somehow use gradient descent to find one of these local minima that is satisfactory and not a trivial task. But the math is the same regardless of whether it's a trillion parameters or four parameters. That's the amazing part. The back propagation algorithm, what Hinton did in 1986 in their paper Rommel, Hart, Hinton and Williams, the same stuff holds true today. And they had, I don't know, I forget how many parameters they had, but you could tens or 10, 20, 30 or something like that.
0:49:34.9 AA: And today we're talking about trillion, but the algorithm is the same. And of course there are all sorts of innovations about, how you do the stochastic, how you do the gradient descent. You can do stochastic gradient descent where you don't, sort of, you take a small batch of the data that you're training on at any given time and so the loss landscape that you calculate is always some approximation of the true loss landscape. And so when you do the descent, you are sort of stochastic in terms of whether you're actually going in the right direction or not. But it turns out stochastic gradient descent works brilliantly. And so there are these kinds of innovations that have happened since the 1986 paper. There are also other tricks about how fast you do the gradient descent. Do you use momentum? Do you keep track of the gradient at previous steps? When do you slow down, when do you speed up those kinds of things? And they're really important in terms of engineering, but conceptually it hasn't changed.
0:50:40.1 SC: And so just so that I'm getting the right visualization here because I know there's a trillion dimensional thing, but I'm visualizing a two dimensional landscape because that's all I can do in my head. And the impact of the non-linearities which you're emphasizing, would be that if you just had linear would be a tiny change in input always gives you a tiny change in output. And therefore your fitness landscape that you're trying to find the minimum of would be pretty smooth. It would be gently rolling hills. But now that you have these neurons that can sort of click on and off in a non-linear way, now you have a jagged neural landscape and it becomes much harder to sort of know from where you are and what your local conditions are, where your actual minimum is lying.
0:51:26.0 AA: Yeah. And because of the size of these networks and the complexity of the computations that happen layer by layer, it's not even clear that there's a global minimum. These are not, if you can think of Y is equal to X square function, which is convex, and you have a well defined global minimum that you can descend down to and that will represent the lowest loss. We are not guaranteed that these functions are convex. So not only are they jagged, they may not have a global minimum. So most of the sort of mathematical work that is going on is a lot of it is in trying to figure out the actual nature of these lost landscapes.
0:52:14.4 SC: But okay, despite the fact that a lot of the math is sort of classical math, it's good old back there, there are some more recent developments. Right? Or maybe I don't even, I guess I shouldn't say that because I don't know what dates different developments correspond to. But you do talk about the curse of dimensionality. The number of dimensions is so large that one of the things you're, you try to do is sort of find a good subspace where interesting things are happening. Using ideas like Principal Component Analysis, maybe you could make an effort to explain that to everybody.
0:52:51.8 AA: So the curse of dimensionality actually has been well known for a long time. And it predates or is almost orthogonal to neural networks. In terms of a problem, one of the best ways to understand this is a very classic machine learning algorithm that was developed in the 1960s called the Nearest Neighbor Search algorithm. Right? So here again we can go back to our images of cats and dogs, let's say images that are 10 pixels by 10 pixels. So each image is essentially 100 pixels. So if we map it into 100 dimensional space, cats end up in one location, dogs end up in another location, and now we're given a new image. So the Perceptron would have started by saying, oh, it's going to find a hyperplane that separates the two clusters of data, and then checks to see which side of the hyperplane the new image falls. And then it classifies it as a dog or a cat. The K nearest neighbor algorithm does something actually intuitively very simple. It basically says, okay, where does this new image map to in that 100 dimensional space? Is it closest to a dog or is it closest to a cat?
0:54:05.5 SC: Okay.
0:54:06.8 AA: If it is closer to a dog, it's a dog. If it's closer to a cat, it's a cat. Right? And if you're just using one neighbor to make that discernment, then you can end up in all sorts of. You're essentially overfitting the data because you can have noise in your data, original data. And if your new image is closer to a noisy data point or a mislabeled data point, then you will get an error. So you can mitigate that by saying, okay, I'm gonna look at three neighbors, or I'm going to look at seven neighbors or whatever, some odd number of neighbors, and then you take a majority vote. Right? But that process depends on being able to calculate some sort of distance between these data points in high dimensional spaces. So 100 dimensions is not high dimensional for machine learning, right? So you're basically going to use some sort of, let's say, Euclidean distance or, well, there are various metrics you could use, but let's say you're using some sort of Euclidean distance is fine. Let's say you use that. And then what happens is as you increase the number of dimensions of your data, then there comes a point where the notion of similarity or dissimilarity that this algorithm depends on, that similar data points are closer together than dissimilar ones.
0:55:29.0 AA: That starts falling apart because in high dimensions everything is as far away as everything else. So the notion of similarity and dissimilarity just doesn't work anymore. And that, in a very simple way, is the curse of dimensionality. And this is a serious problem for machine learning because the number of dimensions is also the number of features that you have in your data. So in this case, each pixel is a feature of your image. But you could have other kinds of data where, let's say you're thinking about penguins and you're trying to analyze penguins by looking at the length of their beak, the depth of their beak, or their flipper length, or et cetera. A penguin can be characterized by, I don't know, 10 such features, and you have a pretty good idea of what kind of penguin it is. But There will be situations where 10 is not enough, where you probably need 10,000 features to figure out what might be happening in terms of classifying an object, or even more. And the more and more features you add, you end up with this cursor dimensionality. You're just not going to be able to do what you want.
0:56:39.6 AA: So one obvious thing is to do something like PCA, Principal Component Analysis, where you essentially project the data back into lower dimensions and hopefully capture most of the variance in your data along the few lower dimensional axes that you've chosen, right? So if the data varied equally in all of the higher dimensional axes, then you're in trouble.
0:57:09.2 SC: You'd be stuck.
0:57:11.2 AA: But if there is something about the structure of your data such that if you do bring it down into lower dimensions and it still captures most of the variance of your data along those fewer dimensions, then you can take that lower dimensional data and do your classification on that. You could train your machine learning model on the lower dimensional data. So that's one way of tackling. But oddly, higher dimensions are also really useful. So for instance, if you have data that let's say is 100 dimensional and you cannot build a linear classifier because the two clusters of data are not linearly separable in 100 dimensions, or let's not even go into 100 dimensionals, let's just take two dimensional data, right? A smattering of dots on an XY plane. One is colored red and the other is colored green.
0:58:03.9 AA: But the red and green are kind of mixing up such that you cannot draw a straight line to separate the two. Right? Now, one easy trick would be to actually project this data into higher dimensions so that let's say you go into three dimensions or four dimensions, for instance, if your red colored dots are centered around the origin and your green colored dots are an annular ring around the red colored dots, there's no straight line that's gonna separate the two clusters in two dimensions. But you can imagine adding a third dimension where you're just multiplying the X coordinate and the Y coordinate to create a Z coordinate. Right? Or rather, will multiplication be enough? Maybe not. You'll have to square the X coordinate and add it to the square of the Y coordinate. And then when you plot this data in three dimensions, the green dots are going to rise above the red dots. And then you can draw a hyperplane between the two. You can use a linear classifier to separate out the green dots and the red dots in three dimensions. And then once you've figured out what that separation is when you project it back into two dimensions, you will get a sort of non-linear curve which is going to be some sort of oval shape that separates out the annular ring from the dots in the center.
0:59:27.6 AA: Right? So this is something we can visualize in two dimensions and three dimensions. But this exact thing is also done using this extraordinary technique called kernel machines or kernel methods, where the idea is that you want to project your data into high dimensions. And if you want to find a linear classifier in high dimensions, the algorithm requires you to take dot products of vectors of the data points. And as you go up to higher and higher dimensions, the dot products are computationally expensive.
1:00:00.7 SC: Okay.
1:00:02.1 AA: Right. Because let's say you started off with 100 dimensional data, which is technically low dimensional, and you project and you can't find a linearly separating hyperplane and you project it into a million dimensions. Well, you can find that separation, right? But now you've basically moved your computation of dot products from 100 Dimensional Space in all the way into million dimensional space. And that's, you're essentially creating a computationally intractable problem. Kernel methods are this amazing technique where you have to find a function that takes in two low dimensional vectors and spits out a number that is equal to the dot product of the corresponding two vectors in the higher dimensions. Right?
1:00:46.0 AA: So let's say there is a vector X in the low dimension which maps to a higher dimensional vector phi of X. And there's another vector Y in lower dimensions which maps to another higher dimensional vector which is phi of Y. Right? So in the higher dimensions, the dot product that you would need is phi of X dot phi of Y. And that might be million dimensions. Dot, million dimensions. Very expensive. You have another function called kernel, it's called K. So you feed in the two lower dimensional vectors and it spits out a number that is actually equal to phi of x dot phi of y. Except you've never gone into the million dimensions.
1:01:28.7 SC: Right.
1:01:29.5 AA: You're operating in the 100 dimensional space. Your function just takes in 200 dimensional vectors and gives you a number, a scalar, which is equal to the dot product in the higher dimensional space. And once you have this, you can now run your linear classifier in million dimensional space without ever stepping into the million dimensional space. And the amazing thing about these kernel methods is that you can project it to infinite dimensions where you're guaranteed to find a linearly separating hyperplane.
1:02:03.6 AA: And technically you can never even write down what a million dimensional infinite dimensional vector is going to be computationally but your kernel method, your kernel function will take two lower dimensional vectors, give you a scalar value that is the dot product of two infinite dimensional vectors in infinite dimensional space. So your linear classifier now is operating in infinite dimensional space where it finds a hyperplane and then when it's projected back into your lower dimensional space, you have found some very intricate nonlinear boundary.
1:02:34.5 SC: It's interesting to me how much of the effort needs to go into these sort of speeding up processes. Like you might find an algorithm that would do amazing things, but if it takes 10,000 years to run, it's not could be very helpful to you in the real world.
1:02:48.8 AA: No, that's what's amazing about this whole enterprise is there's so much engineering chops that is needed and math informed engineering.
1:03:00.1 SC: And I know that one of the big papers that really made a revolution more recently was the transformer architecture. And I made a little bit of effort to understand what that means, but I've kind of failed. Is it possible for you to explain why that was important?
1:03:15.3 AA: Yeah. So you're talking of the attention is all you need. Paper 2017. Yes. So I'll do my best to see how to explain that. It is an amazing paper. When you think about one paper which changed the course of AI, Right? This is not to say that the paper came out of the blue. I mean there was a lot of work that was happening that led up to that paper, but it was a very transformational paper because I think maybe it's almost simpler to just talk of what the transformer is rather than the paper, right? So, the way a large language model works is that. Let's talk about the training process. You take a sentence, let's say this is a sentence I keep taking in my talks. The dog ate my homework, right? And you blank out the last word. Homework. You have the first four words, the the dog ate my blank. And then you feed it to your model and you're asking it to predict what is to follow. Right? Now traditionally when we do next word prediction, we tend to look at. Before all this happened with LLMs, we were kind of looking at maybe the previous one word or two words in order to predict what the next word might be.
1:04:44.9 AA: If you just looked at the last word in that sentence that you had. The dog ate my. If you looked at my and you tried to predict the next word, you would be completely wrong because you'd be saying something like my poem, my dog. I don't know, it could be anything, just about anything. So the model will have no idea you know how to predict the next word. Let. If you took two words, the dog ate my, and if you looked at 8 and my and then said what should be the word to follow ate my, you'll probably say lunch or dinner or something. Again, completely wrong in the context of this sentence. It's only when you look at the word dog that you realize, oh, that should be. We know that this is a very popular sentence in, for children as an excuse to their teacher about why they didn't do their homework. So the dog ate my homework. So what happens when you feed these words to a large language model is the first thing that the AI does is it turns these words into vectors, or they're called embeddings.
1:05:49.7 AA: Each of these words, each of these four words are turned into vectors in some high dimensional space, let's say, a thousand dimensional space. So and, these vectors then just flow through the deep neural network, that is the black box that is being called the transformer. And what it has to do is like if it was just looking at the final vector which represented the word my, and using that to predict the next word, it'll probably get it wrong. So as those four vectors flow through the deep neural network, which is the transformer, it has to contextualize it, has to keep massaging those vectors such that the words start paying attention to each other. So the four vectors are just simply moving through the layers of the network. And at each layer the four vectors have changed such that they capture something about each other, right? And so they're kind of, and hence the term retention, they're paying attention to each other. So after the first transformation, maybe the four vectors have changed enough that you are predicting something that is close to homework, but not quite. And then as you keep going through the transformer layers, at the final analysis at the very end, you still have four vectors.
1:07:15.1 AA: But now the fourth vector has so much information, contextualized information, it knows that it has paid attention to all the other words. And the vector has changed such that now the LLM can say, oh, I can look at that last vector and I know that the next word should be homework, right? And the attention mechanism is essentially the process that allows the transformers to contextualize these vectors. And it's a whole bunch of matrix manipulations. It's just very neat matrix math going on. And then you just spit out these four vectors at the end. You just look at the final vector, which is the vector for the word my. But now it has knowledge about the fact that it had paid attention to 8 and dog and all of that. And it can allow you to make the prediction that next word should be homework. During training, of course it will make an error because all of the weights of the network are randomly initialized. The matrix stuff that the transformer is doing has to be learned. It needs to learn what it has to pay attention to given a certain sentence. So in the beginning, when it predicts a word, it might predict something completely wrong.
1:08:28.1 AA: In fact, it will predict something completely wrong. But you know what the right word should be. Right? So what the language model is predicting at the very end is a probability distribution over its vocabulary. It's basically saying, oh, if my vocabulary is a thousand words, then here's the probability distribution over my vocabulary as to what is the most likely next word. And it's gonna get it wrong in the beginning. But what the correct probability distribution over your vocabulary should be. It should be one for the word homework and zero for everything else. And so you then calculate an error. And that error is a function of all of the, 500 billion or a trillion parameters in your large language model. And you do back propagation all the way through your networks to fiddle with the weights of the network so that the next time you give the same sentence, it predicts a word that is a tiny bit closer in that probability distribution space to the word that you want. And so as you keep tweaking with every back propagation step, your network will get better and better at predicting the fact that the next word should be homework.
1:09:41.1 AA: But that's just for one sentence. Now imagine doing this for every sentence that you scraped off the Internet. And that's why training these language models takes months.
1:09:49.8 SC: Yeah, and that was great that you did it. I do understand. This is the first time in my life. So thank you very much for that. So we're near the end of the podcast. The final question is gonna be a completely unfair one. So answer it to whatever level you wanna answer it. Given that you've studied some of the math, some of the history of how these have gone, what is your feeling about the future of progress in these kind of AI landscapes? Is it more just gonna be scaling and we have more computing power, more data, or is there some conceptual leap out there remaining to be made that's gonna make everything very different?
1:10:33.3 AA: My sense is that scaling, well, scaling what, Right? So currently, when we talk of scaling things up, we're talking of scaling up large language models and large language models. The reason why scaling them up alone will not get us to any kind of generalized intelligence is potentially because A, we have no mathematical guarantee that a language model is 100% accurate. That you cannot guarantee accuracy. Right? Because the output is, it's outputting a probability distribution over its vocabulary with every forward pass. That's what it produces. It produces a probability distribution over its vocabulary and then you sample from that distribution. So there is an inbuilt stochasticity there. And there's no mathematical guarantee that the probability distribution it produces, even if you sample the most likely next word out of that distribution is going to be the word that you want or token. So scaling up alone is not going to get us to a place where we are 100% sure of the accuracy of the model. The other problem with large language models is that they are extremely sample inefficient. They require enormous amounts of data to get to where they are. Right? And the reason why scaling has worked so far is because this entire process of training a large language model is more or less hands off in terms of human inputs.
1:12:04.6 AA: You just scrape some amount of text information from the Internet, you mask the last word and ask the network to learn how to predict the next word. Right?
1:12:15.3 SC: Yeah.
1:12:15.9 AA: With the mask word. And that's a completely, a process that can be completely automated and hence amenable to scaling up. And they've managed to do that now for a long time and the results are pretty amazing. But given its sample efficiency, given that it has no guarantee of correctness, even though they're getting much better at being correct, but there's no guarantee. It's an asymptotic thing. So we're not going to guarantee 100% accuracy. Given those two things and other concerns, most people in the field are expecting some sort of, something similar to what happened with the attention is all you need paper. That paper changed everything. We're probably one or two steps away like that from an AI that is capable of generalizing to questions that it hasn't seen, answering things or questions about patterns that don't exist in the training data. So effectively going back to our early argument, doing what Kepler did.
1:13:18.0 SC: Yeah.
1:13:18.8 AA: Right? And LLMs are not those kinds of systems.
1:13:22.9 AA: Very unlikely. But you never say no with these things. But my hunch is that we're two or three breakthroughs away from something quite transformative.
1:13:35.5 SC: That gives the youngsters in the audience something to think about and something to try to do. So, Anil Ananthaswamy, thanks so much for being on the Mindscape podcast. This was great.
1:13:44.9 AA: Thank you, Sean. It's my pleasure.
Pingback: Sean Carroll's Mindscape Podcast: Anil Ananthaswamy on the Mathematics of Neural Nets and AI - 3 Quarks Daily
Gr8