INSPIRED BY


OTHER WRITING

Put some theory on it

By: skaup On: Sat 27 December 2025
In: The-Middle
Tags: #neural-networks #backpropagation #science

OR - How I learned to stop worrying and love all the Origins of Backpropagation.

If you read about machine learning history, even a bit, you will see a very interesting fight come about. To cut to it, in the 70s and 80s, there was a clash (and to some extent, that clash will ALWAYS be there) between the efficacy of rule based systems (your standard prolog, ELIZA chatbot) vs more non-deterministic systems that claimed to probabilistically "learn" from "experience". Now these more non deterministic systems have many origins - the work of Marvin Minsky, rooted in biology. At least, that is what your standard undergrad-level ML textbook goes into.

But the PARTICULAR fight in the 70s and 80s was amongst the people who were literally making it possible to have more probabilistic systems. The famous paper Learning representations popularised the approach of "backpropagation" - i.e in essence something that combines gradient descent, with some fancy (or basic, depending on how you look at it) chain rule derivations. It suddenly became very probable that we could have more stochastic systems. I won't go into it, but it basically boils down to being able to compute gradients in a much easier manner. Thus making it easier to the gradient part of gradient descent and you know, all the C-3PO magic that follows.

However, it is astonishing to me how much even the people involved with popularising said technique (word used deliberately) were not fans of it. Geoffrey Hinton, one of the co-authors of the paper, who spent years working on the technique also was like .. hmm this is cool, but seems to converge for larger samples (or overfits I guess, in simple words) - OR they would themselves turn back around and say oh this is not as elegant as so and so approach. He himself went back to working on something called Boltzman machines - and then had to FAIL, repeatedly fail actually, to come back to this curious backpropagation business. A very good article that goes into the details of this here.

So they ditch this Backpropagation business for a while in the 70s and come back to it in the 80s. It becomes a good approach, and I suppose (I am only guessing) Yann Lecunn and co are accused of plagiarism. There was a trippy guy in the 70s from Sweden (Paul Werbose) who followed a very similar approach. He came up with the approach by thinking of Freudian theories. Bonkers. Yann Lecunn himself published a paper a year after the breakthrough one saying --- ahh this is where you might have seen it before. And he talks about all the Operational Research guys who did similar work, talks about Lagrangian systems.

And so it becomes clear that this isn't ONE thing. It's an old approach, applied to a new context. In fact, while discussing it with my mentor Asokan Pichai, he told me the approach goes far as NEWTON. I thought we were talking at least 20th century here, how far back does this go?

Turns out, there is something called Calculus of Variations, which I have been doing some cursory reading about - it is basically the process of finding an optimal path - kind of. The original problem was what path should a ball rolling take from one point to other, so that it may reach it as quick as possible. It was a famous problem in mathematics. Bernoulli sent out a challenge to mathematicians to solve the problem. When Newton anonymously submitted a solution, he was found out. Bernoulli said that he had "recognised the lion by its claws". Incredible.

You might think it's a straight line, but with other forces at play (plus it's ball, it's gotta roll), the actual path is something called a Cycloid, which is the path traced by one point on the ball as it rolls along the straight line between the two points. But the calculations applied in this case can be applied in any case where you associate a function (in this case path of ball from A to B) to a particular value (time taken in this case) - this is called a functional. Finding the optimal path (shortest or longest, as it may be) - it's called finding the stationary path of the functional.

It can be applied to finding the path taken by a ray of light in a medium to reach from point A to B, to the surface area of a soap film in water to, if you can guess, the function associated with the minimum prediction error for a given set of values. So PHYSICS is now connected to this as well. But really, what I figured out while reading about this stuff is that you can take whatever theory and retrofit it in anything. It doesn't really matter, the origin. It matters for scientific investigation and credit. In real life, what matters was that the approach WORKED in practice. What is interesting here however, is that we have covered subjects from Biology and Physics to Psychotherapy somehow. But to look at the past is interesting because we can see the many different ways we used to come at the same thing. And what other ways could there be, perhaps a connection is in there, waiting to be re-discovered. More details in the next part.


References

1.Eliza Chatbot - One of the first Chatbots
2.Learning Representations 1986 Paper
3.The Backstory of Backpropagation
4.A theoretical framework for back-propagation
5.Paul Werbos's fun website
6.Brachistochrone curve problem history