Gradient descent is an optimization algorithm we use to find the best values for β0 and β1 in our linear regression model.
The Process
- Start with initial guesses for β0 and β1
- Calculate the current error (cost function)
- Calculate the gradient (direction of steepest increase) of the error
- Take a step in the opposite direction of the gradient
- Repeat steps 2-4 until the error stops decreasing significantly
The Math Behind It
Our cost function (mean squared error) is:
R(β0,β1)=2n1∑i=1n(yi−(β0+β1xi))2
Where:
- n is the number of data points
- yi is the actual test score
- xi is the scaled study hours
- β0+β1xi is our predicted test score
The gradients (partial derivatives) of this function are:
∂β0∂J=−n1∑i=1n(yi−(β0+β1xi))
∂β1∂J=−n1∑i=1n(yi−(β0+β1xi))xi
Gradient Descent Algorithm
We use use gradient descent to minimize the cost function R and find the optimal values for β0 and β1. The algorithm works as follows:
Repeat until convergence:
Update simultaneously all βj for (j=0 and j=1)
β0:=β0−α∂β0∂R(β0,β1)
β1:=β1−α∂β1∂R(β0,β1)
Where:
- α is the learning rate
- R(β0,β1) is our cost function
In the linear regression case:
∂β0∂R=n1∑i=1n(yi−β0−β1xi)×(−1)
∂β1∂R=n1∑i=1n(yi−β0−β1xi)×(−xi)
Therefore, our update rules become:
β0:=β0−αn1∑i=1n(β0+β1xi−yi)
β1:=β1−αn1∑i=1n(β0+β1xi−yi)(xi)
Practical Considerations
- Scaling: Bring your features to a similar scale.
$ x_i := \frac{x_i - \mu_i}{\text{stdev}(x_i)} $
- Learning rate: Don't use a rate that is too small or too large.
- R should decrease after each iteration.
- Declare convergence if it starts decreasing by less than a small threshold ϵ.
Implementing Gradient Descent
To implement gradient descent for our study hours vs. test scores example:
- Initialize β0 and β1 to small random values or zeros.
- Choose a learning rate α (e.g., 0.01).
- For each iteration:
a. Calculate predictions: y^i=β0+β1xi for each data point.
b. Calculate the gradients using the formulas above.
c. Update β0 and β1 simultaneously.
d. Calculate the new cost R.
- Stop when the change in R is smaller than a predefined threshold or after a set number of iterations.
This approach allows us to find the optimal values for β0 and β1 iteratively, even for large datasets where direct calculation might be computationally expensive.