• An optimization algorithm pertains to the way to minimize the Loss Function to tune specific model parameters.

Gradient Descent

  • Let be some multivariate function be a vector such that is defined. be a specified learning rate

    Gradient descent is performed by updating the value of as follows

  • We may think of this process as going in the direction of steepest descent so that we reach the minimum faster.

  • Limitation: Even with a good learning, the algorithm may converge slowly for large datasets as it updates on each pass of the dataset.

Stochastic Gradient Descent

  • Stochastic Gradient Descent - A variation of Gradient Decent that is optimized for speed. In particular, rather than updating the model after reading the entire dataset, we instead update on reading every entry in the dataset.

  • Let be a parameter that we wish to find an optimum for. be a multivariate function with hyperparameter be a list of vectors corresponding to our dataset be a specified learning rate be the number of samples that we have.

    We update the values using the following algorithm:

    • Shuffle the entries of
    • For each entry, perform the update:
  • Limitation: While the algorithm solves the issue of the model updating too slowly, it has the problem of being computationally slow since we must update the hyperparameter in a serial manner.

  • Limitation: Also, we have the issue of this being unusable if we are trying to tune for a function that takes in a sequence of the observations as inputs (i.e., it takes more than one at a time).

Minibatch Stochastic Gradient Descent

  • Minibatch SGD is a compromise between gradient descent and stochastic gradient descent.

    We partition the dataset into minibatches and perform SGD on the minibatches

  • Let be a parameter that we wish to find an optimum for. be a multivariate function with hyperparameter be a list of vectors corresponding to our dataset be a specified learning rate be the number of minibatches that we wish to use. be the set of batches of .

    We update the values using the following algorithm:

    • Shuffle the entries of

    • is the set obtained by batching .

    • Perform the following updates for all batches. is the -th batch.