A perceptron is the simplest neural network we can create. Deep neural networks can have thousands, or even millions, of perceptrons tied together. For now, let's just focus on a single perceptron.

Interestingly, the perceptron has an uncanny resemblance to a biological neuron - it receives some input, operates on it, and produces an output:

Realizing the relationship between a perceptron and a biological neuron doesn't do us much good from a mathematical standpoint. Let's begin by updating our perceptron with some critical variables, $w$ and $b$.

Every input to the perceptron has an associated weight, denoted by $w$. The perceptron itself is associated with a bias, $b$. Let's apply these variables to something you learned in math class all those years ago:

That's the equation for a line - look at you go! However, we're still missing something important. If you've read about neural networks before, you've probably heard of something called the *activation function*. Let's denote this as $f$:

We can take our previous line equation and put it through the activation function, yielding:

But... what's the point? Let's say we're training a model to recognize whether a fruit is an apple or watermelon based on its weight. If $y=wx+b=36.27$, does the perceptron think it's an apple or watermelon?

What if we were able to squish $36.27$ into a value between $0$ and $1$ and treat it as a probability? We could then choose *apple* if the value is closer to $0$ and *watermelon* if the value is closer to $1$. This makes the decision process much easier - let's further this example by introducing a popular activation function used in classification problems.

The purpose of the sigmoid function is to take an input, $z$, and squash it to a decimal value between $0$ and $1$. Here's its graph:

Back to our previous apple vs. watermelon example: After applying the perceptron's weight and bias to the fruit's mass of $73\text{g}$, we end up with:

Let's say the result comes out to be $0.13$. We still don't really know what this means in terms of classifying something as an apple or watermelon.

If we establish what's known as a *decision boundary*, we will be able to make a finite decision. Let's let our decision boundary be $0.5$. Then, if $y <= 0.5$, we classify it as an apple. Otherwise, if $y>0.5$, we classify it as a watermelon. If $y=0.13$, the perceptron believes it to be an apple. On the other hand, if $y=0.62$, then the perceptron believes it to be a watermelon. It's important to note that, at this point, all of these outputs are just guesses.

Of course, in the beginning, we won't get great results. The perceptron could start by deciding a fruit weighing $9000\text{g}$ is an apple and one weighing $76\text{g}$ is a watermelon.

The perceptron "learns" by tuning the weight, $w$ and bias, $b$, through a process called *gradient descent*. Gradient descent aims to minimize the error, or loss function. In simpler terms, we're minimizing how "wrong" the perceptron's guesses are.

Instead of presenting math, I'll introduce a simple analogy. Imagine your guitar is out of tune and you are attempting to tune the low E-string. Each image below is a step in the process of tuning this string:

- Depicted by the image at left, the tuner initially tells us the string is too flat.
- To fix this, we try tightening the string to make the pitch higher. We're getting closer, but the tuner says it's still too flat.
- We try to take a bigger step and accidentally tighten the string too much. Indicated by the yellow region, the note is now too sharp.
- Let's try loosening the string to make the pitch lower. Indicated by the blue region, the note is just right!

This is an oversimplified example of how gradient descent works. In this example. We're tuning the weights and biases to improve the correctness of the guesses generated by the perceptron.

All this time, we've been operating under the assumption there is only one feature in $x$. This is fine for our apple vs. watermelon example; however, most problems have more than one feature.

For example, let's say we're trying to recommend apartments to renters based on square footage and number of bedrooms. This means we now have *two* features - you guessed it - square footage and number of bedrooms.

Let's call square footage $x_0$ and number of bedrooms $x_1$. Instead of just a scalar number, our input $x$ is now a matrix:

Remember, each input is associated with its own weight. Let's update our perceptron with $w_0$ and $w_1$:

And now for our formulas:

Now, we take the dot product of the transposed weight matrix and the input matrix as we're now dealing with matrices instead of scalars. After simplifying, we get:

After the dot product is expanded, we can see each input is multiplied with its correct weight.

In this post, I've given a brief introduction into deep neural networks by starting with its simplest element: the perceptron. I've introduced weights, biases, and the activation function. We know the perceptron is capable of learning by tuning its parameters in a process called gradient descent. The foundations of machine learning rely heavily on basic linear algebra and calculus - maybe not as complicated as you initially thought!

I'm currently in the process of writing a follow-up post to further the information presented in this post - please check back again as I'll link to the follow-up post in this one. Please feel free to reach out to me with questions, comments, edits, etc. Thank you for reading!