When we say we want the "best" line, what do we actually mean? Think about it this way: for each student in our data, our line makes a prediction, and we can measure how far off that prediction is from the real score.
For example, if:
We care about all errors, whether we predicted too high or too low. That's why we square these differences. In math terms, for each student :
Where:
To find the best line, we want to minimize the average of all these squared errors. We write this as:
Don't let this formula scare you! It just means:
To find the best values for and , we need to find where this error is smallest.
We usually use Gradient Descent to do that.