A perceptron is the simplest neural network we can create. Deep neural networks can have thousands, or even millions, of perceptrons tied together. For now, let's just focus on a single perceptron.
Interestingly, the perceptron has an uncanny resemblance to a biological neuron - it receives some input, operates on it, and produces an output:
Realizing the relationship between a perceptron and a biological neuron doesn't do us much good from a mathematical standpoint. Let's begin by updating our perceptron with some critical variables, and .
Every input to the perceptron has an associated weight, denoted by . The perceptron itself is associated with a bias, . Let's apply these variables to something you learned in math class all those years ago:
That's the equation for a line - look at you go! However, we're still missing something important. If you've read about neural networks before, you've probably heard of something called the activation function. Let's denote this as :
We can take our previous line equation and put it through the activation function, yielding:
But... what's the point? Let's say we're training a model to recognize whether a fruit is an apple or watermelon based on its weight. If , does the perceptron think it's an apple or watermelon?
What if we were able to squish into a value between and and treat it as a probability? We could then choose apple if the value is closer to and watermelon if the value is closer to . This makes the decision process much easier - let's further this example by introducing a popular activation function used in classification problems.
The purpose of the sigmoid function is to take an input, , and squash it to a decimal value between and . Here's its graph:
Back to our previous apple vs. watermelon example: After applying the perceptron's weight and bias to the fruit's mass of , we end up with:
Let's say the result comes out to be . We still don't really know what this means in terms of classifying something as an apple or watermelon.
If we establish what's known as a decision boundary, we will be able to make a finite decision. Let's let our decision boundary be . Then, if , we classify it as an apple. Otherwise, if , we classify it as a watermelon. If , the perceptron believes it to be an apple. On the other hand, if , then the perceptron believes it to be a watermelon. It's important to note that, at this point, all of these outputs are just guesses.
Of course, in the beginning, we won't get great results. The perceptron could start by deciding a fruit weighing is an apple and one weighing is a watermelon.
The perceptron "learns" by tuning the weight, and bias, , through a process called gradient descent. Gradient descent aims to minimize the error, or loss function. In simpler terms, we're minimizing how "wrong" the perceptron's guesses are.
Instead of presenting math, I'll introduce a simple analogy. Imagine your guitar is out of tune and you are attempting to tune the low E-string. Each image below is a step in the process of tuning this string:
All this time, we've been operating under the assumption there is only one feature in . This is fine for our apple vs. watermelon example; however, most problems have more than one feature.
For example, let's say we're trying to recommend apartments to renters based on square footage and number of bedrooms. This means we now have two features - you guessed it - square footage and number of bedrooms.
Let's call square footage and number of bedrooms . Instead of just a scalar number, our input is now a matrix:
Remember, each input is associated with its own weight. Let's update our perceptron with and :
And now for our formulas:
Now, we take the dot product of the transposed weight matrix and the input matrix as we're now dealing with matrices instead of scalars. After simplifying, we get:
After the dot product is expanded, we can see each input is multiplied with its correct weight.
In this post, I've given a brief introduction into deep neural networks by starting with its simplest element: the perceptron. I've introduced weights, biases, and the activation function. We know the perceptron is capable of learning by tuning its parameters in a process called gradient descent. The foundations of machine learning rely heavily on basic linear algebra and calculus - maybe not as complicated as you initially thought!
I'm currently in the process of writing a follow-up post to further the information presented in this post - please check back again as I'll link to the follow-up post in this one. Please feel free to reach out to me with questions, comments, edits, etc. Thank you for reading!