Aiphabet

Gradient Descent for Linear Regression

Gradient descent is an optimization algorithm we use to find the best values for β0\beta_0 and β1\beta_1 in our linear regression model.

The Process

  1. Start with initial guesses for β0\beta_0 and β1\beta_1
  2. Calculate the current error (cost function)
  3. Calculate the gradient (direction of steepest increase) of the error
  4. Take a step in the opposite direction of the gradient
  5. Repeat steps 2-4 until the error stops decreasing significantly

The Math Behind It

Our cost function (mean squared error) is:

R(β0,β1)=12ni=1n(yi(β0+β1xi))2R(\beta_0, \beta_1) = \frac{1}{2n} \sum_{i=1}^n (y_i - (\beta_0 + \beta_1x_i))^2

Where:

  • nn is the number of data points
  • yiy_i is the actual test score
  • xix_i is the scaled study hours
  • β0+β1xi\beta_0 + \beta_1x_i is our predicted test score

The gradients (partial derivatives) of this function are:

Jβ0=1ni=1n(yi(β0+β1xi))\frac{\partial J}{\partial \beta_0} = -\frac{1}{n} \sum_{i=1}^n (y_i - (\beta_0 + \beta_1x_i))

Jβ1=1ni=1n(yi(β0+β1xi))xi\frac{\partial J}{\partial \beta_1} = -\frac{1}{n} \sum_{i=1}^n (y_i - (\beta_0 + \beta_1x_i))x_i


Gradient Descent Algorithm

We use use gradient descent to minimize the cost function RR and find the optimal values for β0\beta_0 and β1\beta_1. The algorithm works as follows:

Repeat until convergence: Update simultaneously all βj\beta_j for (j=0 and j=1)(j = 0 \text{ and } j = 1)

β0:=β0αβ0R(β0,β1)\beta_0 := \beta_0 - \alpha \frac{\partial}{\partial \beta_0} R(\beta_0, \beta_1)

β1:=β1αβ1R(β0,β1)\beta_1 := \beta_1 - \alpha \frac{\partial}{\partial \beta_1} R(\beta_0, \beta_1)

Where:

  • α\alpha is the learning rate
  • R(β0,β1)R(\beta_0, \beta_1) is our cost function

In the linear regression case:

Rβ0=1ni=1n(yiβ0β1xi)×(1)\frac{\partial R}{\partial \beta_0} = \frac{1}{n} \sum_{i=1}^n (y_i - \beta_0 - \beta_1x_i) \times (-1)

Rβ1=1ni=1n(yiβ0β1xi)×(xi)\frac{\partial R}{\partial \beta_1} = \frac{1}{n} \sum_{i=1}^n (y_i - \beta_0 - \beta_1x_i) \times (-x_i)

Therefore, our update rules become:

β0:=β0α1ni=1n(β0+β1xiyi)\beta_0 := \beta_0 - \alpha \frac{1}{n} \sum_{i=1}^n (\beta_0 + \beta_1x_i - y_i)

β1:=β1α1ni=1n(β0+β1xiyi)(xi)\beta_1 := \beta_1 - \alpha \frac{1}{n} \sum_{i=1}^n (\beta_0 + \beta_1x_i - y_i)(x_i)


Practical Considerations

  1. Scaling: Bring your features to a similar scale. $ x_i := \frac{x_i - \mu_i}{\text{stdev}(x_i)} $
  2. Learning rate: Don't use a rate that is too small or too large.
  3. RR should decrease after each iteration.
  4. Declare convergence if it starts decreasing by less than a small threshold ϵ\epsilon.

Implementing Gradient Descent

To implement gradient descent for our study hours vs. test scores example:

  1. Initialize β0\beta_0 and β1\beta_1 to small random values or zeros.
  2. Choose a learning rate α\alpha (e.g., 0.01).
  3. For each iteration: a. Calculate predictions: y^i=β0+β1xi\hat{y}_i = \beta_0 + \beta_1x_i for each data point. b. Calculate the gradients using the formulas above. c. Update β0\beta_0 and β1\beta_1 simultaneously. d. Calculate the new cost RR.
  4. Stop when the change in RR is smaller than a predefined threshold or after a set number of iterations.

This approach allows us to find the optimal values for β0\beta_0 and β1\beta_1 iteratively, even for large datasets where direct calculation might be computationally expensive.