The first few “artificial intelligence” models that we knew of came in the late 50s to early 60s. They were essentially rule based engines. Symbolic computation was involved that allowed for mimicking intelligence. And lisp was the hot language of choice for this work. These systems were non deterministic to a degree, but in all, we were stuffing rules such as “Brad likes apples” into machines and hoping for the best. It was a sham.
Then neural network came along, which made the concept of stuffing rule based engines redundant. And while that made many of the “symbolic computation” crowd evaporate slowly, I thought it would be a fun exercise to try and make neural networks with lisp. Something about the dead coming back to haunt you.
But really, this is about functional programming. I must say, to the extent that I have worked on it, I found lisp powerful. But even I haven’t grasped it’s power fully. This is a bit half baked. But I am not here to kid around about functional programming. Almost all of the “Neural network from scratch” - or really anything from scratch tutorials I have seen end at the last level with a file with some class declarations. And my only question remains the same one I have had forever. Is this really needed?
The goal was simple - make a neural network from scratch with lisp. See if there were any advantages (or disadvantages) of making such a network? What does it do differently than a standard neural network? Can we learn something about the properties of such networks by designing them in a different manner? I don’t know how many of these questions I have answered, but this is my attempt.
Let’s go to a simpler question. What is a function? Lets say you have a set of numbers on 1 side, (1 2 3) and another on the other side (2 3 4). What could you say about their relationship? Right, each number on the right is one larger than its corresponding one on the left. That is a function. Y = x +1. At its very basic level, you can have two sets of numbers (real, irrational, all that jazz) and they would be adequate to “describe” a function.
What is a neural network? Anyone? On asking this question, I could get various answers - “A machine simulation of your brain” was one. That is partially true I guess. There are biological roots to this field that are always present, one of the more beautiful things about learning this. But really, a neural network is a function approximator.
So while we have y = x + 1 that can give us the y for any x, what about complicated functions that we cannot represent cleanly? Well, in that case, assuming we have a machine that can be trained as such, we give it a bunch of x’s and a bunch of corresponding y’s. And once that machine is able to give us pretty accurate results for any given x and y pair, we can then in the future, by crossing our fingers, use it to produce the y result for any new x. That’s what happens in handwriting recognition to large language models. It is a function approximator. Fancy word for “ehhh … this produces results that are very close to what a well defined mathematical function would”
The point is - a “neural network” is one way to do this. The base block of a neural network is a perceptron or neuron. Fancy word for something that is supposed to be similar to the neuron in that brain of yours. But it gets inputs. It applies some function to it, and gives an output.That function - lets call it the one we called it in 5th grade
$$y = mx + c$$
Now, usually we know m and c, which is why it is easy for us to find out y. But we don’ t today, so how do we do it?
Well, we can kind of reverse it. If we have two pairs of x and y points, we can create a system of linear equations and solve for it. Then we get x and y. But what if the function is more complicated? Multidimensional. God only knows how dimensional? This approach doesn’t really cover those cases. So, we have to approximate these values. How do we do that?
You all remember our good old friend linear regression? Anybody fresh out of hell that is a college statistics class? In linear regression, we have a bunch of points. And when we want to find the “best fit line” - we try to take the line which is the closest to all the possible points. So we take all the points, and find the line for which the distance of the points from the line is minimum. That is the best fit line. And that function - the distance one. It’s called the loss function. The more that distance, the greater the chances that you aren’t on the best fit line.

Remember again, the point I made, about a bunch of points, and finding something that describes them? Well, that’s what a best fit line is. So all of this is to say that a neural network is simply an overcomplicated linear regression model. Yes, well kind of. There are more things involved, but basically that is what it is.
But how do we find that line, or that equation? That is the question. Well, one way is to try a bunch of m and c values till you get the least distance below a threshold and call it a day. This is a nice approach, and it will save you your sleep at night. But then you will wake up with squiggly lines taunting you, calling you not good enough. So what can you do? Well, one key way to look at it is that loss function. Suppose, for every one of the possible lines we can think of, we calculate the distance of each point from that line. Since our line is identified by $y = mx + c$, but we want to find m and c, so we take those two on the two horizontal axes. And then, we put the output of the loss function for each of those pairs on the vertical axis. Well, we have ourselves a little hill here don’t we.
Let's go down it, dare I say? But really? What do we want now? We want the point at which this loss function has a minimum value. And when we visualise it, simply rolling a ball down this plane from a bunch of points will give us that. And that is what gradient descent is. It is a way to find the minimum of this hill - so that we can identify m and c that are appropriate. You find the derivative at a point in the slope, see if it’s going up or down, and keep going down. Do that with multiple starts, until you get to a point you’re pretty comfortable is the actual minimum and not just a random valley you got stuck in.

There is a bunch of other stuff involved. There are neurons, with these weights and biases. We have activation functions of neurons - essentially another function application after you find m and c, to further improve the output. You have hidden layers, essentially multiple neurons interacting with each other - which codify the m and c (our friends weights and biases, let’s call them that). You have input layers (x’s in this case). And you have the output layer, your ideal y.
So you train the neural network - by which you mean, for a very large number of x’s and y’s - you find the best m and x values. And then, after you have verified this, you unleash this into the world and watch it bleed. Now the functional part. What is the advantage of doing this functionally - well there is a lot of differentiation involved in gradient descent, and higher order functions are your buddies in this. But my main curiosity was peaked when I read this in On Lisp.
“A combination of a function and a set of variable bindings (at the time it was created) is called a closure.”
Closures have three useful properties: they are active, they have local state, and we can make multiple instances of them. Where could we use multiple copies of active objects with local state? In applications involving networks, among others.”
Aha - something. Right networks, methinks I remember looking at some interesting networks.
In the next part.