P.S This article assumes some familiarity with functional programming, knowing lisp syntax helps. But I have tried to explain the best I can regardless. I hope if you’re from a general programming background, you will understand this.
Now the functional part. First, a major part of this code is directly a translation of the work done in the napkin math article. The premise is simple - If we want to build a “function approximator”, let’s try and do it for a very basic case - The average function.
As an example - you have four squares, each having a colour between white and black. We can represent this as numbers as follows. If the square’s colour is white, it is 0, if it is black, it is 1, and any other shade of gray in between will be a real number in between those two. To estimate the colour of the overall grid, you need the average of the four numbers.
Again, the napkin math article explains it way better, I highly encourage you to check it out. All I have done is a lispy rewrite of it, so I will focus on that. However, I did run into one little hiccup. Will get to that.
First, in order to approximate the average function - we can try one thing. Suppose we get the square numbers, we can multiply each with any set of four numbers and then add those products up. For example, if the “weights” we decide in this case are 1 1 1 1, then the four numbers remain unchanged. But, if somehow, from any random starting point, our little function approximator could get close to 0.25, 0.25, 0.25, 0.25, then it would be able to act like the average function.
So, initialization of global variables first. Typically, I’d define global variables with ALL CAPS. I like how scary this makes global variables. But lisp uses this *style* for them, so I will stick with it. When in Rome.
(defparameter *input-layer* '(0.2 0.5 0.4 0.7))
(defparameter *hidden-weights* '(0.98 -0.08 0.27 0.01))
; To Ignore, will become relevant later
(defparameter *threshold* 0.00005)
(defparameter *dx* 1e-4)
(defparameter *learning-rate* 0.1)
(defparameter *tolerance* 1e-6)
(defparameter *max-steps* 1000)
Now, we get to the main part. Closures. So we can define our neuron as follows
(defun make-neuron (weights &optional (bias 0.0))
"Creates a neural network node as a closure with weights and optional bias"
(let ((w (copy-list weights))
(b bias))
(lambda (inputs)
(+ b (reduce #'+ (mapcar #'* inputs w)))))
)
The copy-list part is to do a deep copy of the weights passed to the generator function. And what’s that over there in the last line? That takes the input numbers given and does exactly what is mentioned above -> A sum of the product of each with the weights. Again, closures have some object-like properties but are functions. So what do we get with this?
(defparameter *average-weights* '(0.25 0.25 0.25 0.25)) ; For testing only
(defparameter *average-neuron* (make-neuron *average-weights*))
; To call the same
(funcall *average-neuron* *input-layer*)
That gets you the average. Pretty cool right? No need to do any finagling with the methods of objects - for now. But this is for testing only, we want our hidden weights to converge to that value.
To do that, i.e train the model, I need to generate a bunch of x and y pairs. So I will do that.
(let ((rectangle (list (float (random 1.0))
(float (random 1.0))
(float (random 1.0))
(float (random 1.0)))))
(setq *rectangles* (append *rectangles* (list rectangle)))
(setq *rectangle-averages* (append *rectangle-averages* (list (/ (apply #'+ rectangle)
In this case, each x is a tuple of four numbers between 0 and 1, as mentioned before. And y is of course the corresponding average.
Now, when I generate a y in my model, I need to check how close it is to the actual average of the four given numbers. And I will hand-wave over why this is the case, because the napkin math article does a better job of it. For our purposes, this is our holy grail loss function.
(defun mean-squared-error (actual expected)
"Calculates the Mean Squared Error between two lists of numbers"
(let* ((squared-errors (mapcar (lambda (a b) (expt (- a b) 2)) actual expected))
(sum-of-errors (apply #'+ squared-errors)))
(/ sum-of-errors (length actual))))
Now, if I had to make a VERY basic model, using my flawed hidden-weights for now, I would simply multiply the values of each rectangle with the hidden weights, and compare the results using the loss function
(defun model (rectangle hidden-weights)
"A model function that computes the dot product of a rectangle's inputs an
d a hidden layer."
(reduce #'+ (mapcar #'* rectangle hidden-weights)))
(defun train (rectangles hidden-weights)
"Applies the model function to each rectangle in a list, returning a list of outputs."
(mapcar (lambda (rectangle) (model rectangle hidden-weights)) rectangles))
Then we do the “training”
(setq simple-train-outputs (train *rectangles* *hidden-weights*))
(setq simple-train-error (mean-squared-error simple-train-outputs *rectangle-averages*))
So that does a pretty bad job, naturally. Since the weights aren’t actually a 4 number tuple of 0.25. It gives some output between 0 and 1, but it’s not the average.
I was in a DSA class once, and the instructor asked us for a solution. So I raised my zoom class hand up and answered. While I was met with some encouragement - I also got a cool response with the following “Saachi is a bit of a violent person. Her approach is to hit the problem with a hammer.” So, to get to the hammer approach. One way to get closer to (0.25, 0.25, 0.25, 0.25) is to simply generate a bunch of random ones, compare each to our target. And once the loss goes below a threshold, call it a day. So, I tried that as well.
(defun randomised_train (rectangles hidden-weights rectangle-averages &optional (threshold *threshold*))
"Applies the model function to each rectangle in a list, returning a list of outputs."
(let ((outputs (train rectangles hidden-weights))
)
(progn
(while (> (mean-squared-error outputs rectangle-averages) threshold)
(setq hidden-weights (random_layer))
(setq outputs (train rectangles hidden-weights))
)
hidden-weights
)
)
)
)
This works pretty well too, since the function in this case is fairly simple. But still, we can’t generalise this to any kind of function we want to approximate. This won’t work for recognising your local doctors handwriting, let’s say that. So, we go to our friend from last time, gradient descent.
Now this is where I reasonably diverge from the napkin-math article. Because the article uses Pytorch for automatic differentiation for finding the slope of the loss curve. While that is nice, I wanted to make it even easier for me to implement this. So I chose to do finite differentiation. The thing you reluctantly learnt in your eighth grade class. This part was quite tough for me to implement, but then I came across a wonderful book by Paul Orland called Math For Programmers. Thank you to O’reilly / Manning. That book does a python version of this. Then I had to translate it to lisp.
(defun secant-slope (f xmin xmax)
"Calculate slope using secant line approximation"
(/ (- (funcall f xmax) (funcall f xmin))
(- xmax xmin)))
This is y2-y1/x2-x1. Again, eight grade, slope of a line. For any given function. Just beautiful. Blew my mind when I saw it in the book. Just clicked immediately. If you ever doubt there if there is the higher order, just remember at the very least, these higher orders exist.
Once this is done, some plumbing is needed to actually use it for our tuples. This is done withing the approx-gradient function. But then, we can do the following
(defun update-parameters (v grad &optional (learning-rate *learning-rate*))
"Update parameters using gradient: v - learning_rate * grad"
(mapcar (lambda (vi dvi) (- vi (* learning-rate dvi))) v grad))
(defun gradient-descent (f vstart &key (tolerance *tolerance*) (max-steps *max-steps*))
"Perform gradient descent to minimize function f"
(labels ((gd-step (v steps)
(let ((grad (approx-gradient f v)))
(if (or (<= (vector-length grad) tolerance)
(>= steps max-steps))
v
(gd-step (update-parameters v grad) (1+ steps)))))) ; Recursive call
(gd-step vstart 0) ; calls it with initial starting point of 0 steps
)
Again, that’s it. That’s gradient descent. You might be curious about that vector length function. It’s just a way to determine if you’re closer to your slope being flat (i.e at the minimum loss point). So, if you’ve either crossed the max-steps or reached a flat slope, congrats, you’re in the game.
Nice, some more plumbing is done in a gradient-train function, pretty similar to the other ones, and then we can simply do -
(gradient-train *rectangles* *hidden-weights* *rectangle-averages*)
When this worked (0.25000513 0.2499971 0.24999955 0.24999803) (the most recent output) - I was on a plane. I did a little whoop of happiness. Made my damn day. You can find the entire code here.
So, right. That’s it for technical stuff. But there is part 2 of this story. I recently presented this to a very nice group of people at a functional programming meetup. But tragically, I don’t think I did a very good job. I was very nervous, since I hadn’t made a presentation outside of work in a while. I had been practicing, but still forgot to pause before speaking, asked a bunch of open ended questions to a very small audience (a mistake I have made before) and did not even do a quick check of how many people were actually familiar with the machine learning terms I would be using. I gave too abstract of a demo to a technical audience. I will attach the slides, and a demo I made later.
But still, I am glad I failed. The people there, despite my mistakes, were thoughtful enough to ask for clarifications, and make requests for what they would have liked. That has allowed me to write this article. I am genuinely grateful for their kindness.