Gradient Descent Method: Theory and Example

The gradient descent method is one of the basic numerical methods used to find the minimum of a function of several variables. It is often applied when we need to move step by step toward a point where the value of the function becomes smaller. In this article, we will look at the main idea of the method, write down the iterative formula, explain the role of the step length, and show how the gradient can be calculated when partial derivatives are not given explicitly.

Gradient Descent Method: The Main Idea of Minimization

Consider a function of several variables

\[
f(\bar{x})=f(x_1,x_2,\dots,x_n),
\]

where $ \bar{x} $ is a vector of variables.

To minimize such a function means to find a point $ \bar{x} $ at which the value of $ f(\bar{x}) $ is the smallest, or at least locally the smallest in a certain region. The word “locally” is important here. In general, the gradient descent method does not always find the global minimum. It usually leads to a local minimum, which depends on the starting point and on the properties of the function.

To understand where to move, we use the gradient of the function. For a function of several variables, it has the form

\[
\nabla f(\bar{x})=
\left(
\frac{\partial f}{\partial x_1},
\frac{\partial f}{\partial x_2},
\dots,
\frac{\partial f}{\partial x_n}
\right).
\]

Each element of this vector is a partial derivative. In other words, it shows how the function changes when one variable changes while all the other variables remain fixed.

The gradient points in the direction of the fastest increase of the function. But we need to reduce the value of the function, right? That is why we should move in the opposite direction: $ -\nabla f(\bar{x}) $. This direction is called the negative gradient direction. It shows which way we should take a step so that the value of the function decreases as quickly as possible near the current point.

So, at the first stage, it is important to remember one simple idea: the method starts from a certain point, finds the gradient at that point, and then moves in the direction opposite to it.

Initial Approximation: How the Iterative Process Is Built

Now let’s move from the main idea to the algorithm. First, we need to choose an initial approximation:

\[
\bar{x}^{(0)}=
\left(
x_1^{(0)},
x_2^{(0)},
\dots,
x_n^{(0)}
\right).
\]

This is the starting point from which the calculations begin. The choice of this point matters. If the function is complex and has several local minima, then different starting points may lead to different results.

After choosing $ \bar{x}^{(0)} $, we calculate the gradient at this point: $ \nabla f(\bar{x}^{(0)}) $. Then we take the first step in the direction of the negative gradient:

\[
\bar{x}^{(1)}
=
\bar{x}^{(0)}-\alpha\cdot \nabla f(\bar{x}^{(0)}),
\qquad \alpha>0.
\]

Here, $ \alpha $ is the current step length. It shows how far we move from the starting point in the direction where the function decreases.

After that, we obtain a new point

\[
\bar{x}^{(1)}=
\left(
x_1^{(1)},
x_2^{(1)},
\dots,
x_n^{(1)}
\right).
\]

At the next iteration, we no longer work with the initial point, but with the new approximation. So we calculate the gradient again, now at the point $ \bar{x}^{(1)} $, and move to the point $ \bar{x}^{(2)} $:

\[
\bar{x}^{(2)}
=
\bar{x}^{(1)}-\alpha\cdot \nabla f(\bar{x}^{(1)}),
\qquad \alpha>0.
\]

In this way, step by step, we build a sequence of approximations. In general form, it can be written as follows:

\[
\bar{x}^{(k+1)}
=
\bar{x}^{(k)}-\alpha\cdot \nabla f(\bar{x}^{(k)}),
\qquad k=0,1,2,\dots
\]

where ( \alpha ) is the current step length used when moving from $ \bar{x}^{(k)} $ to $ \bar{x}^{(k+1)} $.

This formula is the foundation of the method. It shows how to move from the current point $ \bar{x}^{(k)} $ to the next point $ \bar{x}^{(k+1)} $. At the same time, each move is made with the direction of function decrease in mind.

Step Length: How to Avoid an Unsuccessful Move

After writing the iterative formula, an important practical question appears: how should we choose the step length $ \alpha $? After all, it determines how far we move in the negative gradient direction.

If the step is too small, the method may move very slowly. But if the step is too large, we may move too far and get a larger value instead of a smaller one. That is why, after each move, it is useful to check whether the function has really decreased:

\[
f(\bar{x}^{(k+1)})<f(\bar{x}^{(k)}).
\]

If this condition is satisfied, then the new point can be accepted. In this case, the current value of $ \alpha $ can be kept for the next iteration.

But if the value of the function has not decreased, or has even increased, then the step length must be reduced. One simple way to do this is to divide the step by two:

\[
\alpha \leftarrow \frac{\alpha}{2}.
\]

After that, the new point is calculated again, but now with a smaller step:

\[
\bar{x}^{(k+1)}
=
\bar{x}^{(k)}-\alpha\cdot \nabla f(\bar{x}^{(k)}).
\]

This check helps make the method more reliable. We do not simply move mechanically to a new point. Instead, we control whether we are really moving in the direction where the function decreases.

However, sooner or later, the changes become very small. That is why we need a stopping condition. For example, we can stop the calculations if the norm of the gradient becomes smaller than a given accuracy:

\[
\|\nabla f(\bar{x}^{(k)})\|<\varepsilon.
\]

We can also use the condition that two neighboring approximations are close to each other:

\[
\|\bar{x}^{(k+1)}-\bar{x}^{(k)}\|<\varepsilon.
\]

Or we can use the condition that the change in the function value is small:

\[
|f(\bar{x}^{(k+1)})-f(\bar{x}^{(k)})|<\varepsilon.
\]

Here, $ \varepsilon $ is the given accuracy of the calculations.

From a theoretical point of view, if a local minimum is reached inside the region, then the following condition holds at that point:

\[
\nabla f(\bar{x})=0.
\]

However, it is important to remember one thing: this condition is necessary, but not always sufficient. A point with a zero gradient can be not only a minimum point, but also a maximum point or a saddle point. So the result should be evaluated while taking the properties of the function into account.

Calculating the Gradient: What to Do Without Explicit Derivatives

In simple educational examples, partial derivatives are often found analytically. In other words, we take the given function, differentiate it with respect to each variable, and obtain a ready-made formula for the gradient.

But what should we do if the partial derivatives are difficult to find? Or if the function is given in such a way that its value can be calculated, but the explicit form of the derivatives is inconvenient? In this case, the elements of the gradient can be calculated approximately using numerical methods.

For the variable $ x_i $, we can use the forward difference formula:

\[
\frac{\partial f}{\partial x_i}\approx
\frac{
f(x_1,\dots,x_i+\Delta x_i,\dots,x_n)-
f(x_1,\dots,x_i,\dots,x_n)
}{
\Delta x_i
},
\qquad i=1,2,\dots,n.
\]

Here, $ \Delta x_i $ is a small increment of the variable $ x_i $. The idea is simple: we slightly change one variable, keep all the others unchanged, and look at how the value of the function changes.

For greater accuracy, the central difference formula is sometimes used:

\[
\frac{\partial f}{\partial x_i}\approx
\frac{
f(x_1,\dots,x_i+\Delta x_i,\dots,x_n)-
f(x_1,\dots,x_i-\Delta x_i,\dots,x_n)
}{
2\Delta x_i
}.
\]

This formula usually gives a more accurate approximation, but it requires more calculations of function values. So, in practice, the choice of method depends on what is more important: speed or accuracy.

Therefore, the gradient descent method can be used not only when the gradient is given explicitly. It can also be applied in numerical form if we know how to calculate the value of the function itself.

Practical Example: Gradient Descent Method for a Function of Two Variables

Now let’s look at how the gradient descent method works for a specific function of two variables. In this example, we need not only to write down the formula, but also to calculate the gradient step by step, find new approximations, and check the stopping condition. This is exactly where we can clearly see how the theoretical scheme turns into a real computational process.

Example 1. Find the minimum value of the function $ f(x_1,x_2)=(x_1-1)^2+(x_2+2)^2 $ with accuracy $ \varepsilon=0.05 $ using the gradient descent method

$Graph of the function $ f(x_1,x_2)=(x_1-1)^2+(x_2+2)^2 $$

Consider the function:

\[
f(x_1,x_2)=(x_1-1)^2+(x_2+2)^2.
\]

This is a function of two variables, $ x_1 $ and $ x_2 $. It has a simple analytical form, so it is convenient for showing how the gradient descent method is applied step by step.

Let us take the initial approximation

\[
\bar{x}^{(0)}=(3,2).
\]

We also choose the current step length $ \alpha=0.25 $. We will check the stopping condition using the rule

\[
\|\bar{x}^{(k+1)}-\bar{x}^{(k)}\|<0.05.
\]

For the gradient descent method, we use the formula

\[
\bar{x}^{(k+1)}
=
\bar{x}^{(k)}
-\alpha\cdot \nabla f(\bar{x}^{(k)}).
\]

First, let us find the gradient of the given function. To do this, we calculate the partial derivatives:

\[
\frac{\partial f}{\partial x_1}=2\cdot (x_1-1),
\qquad
\frac{\partial f}{\partial x_2}=2\cdot (x_2+2).
\]

Therefore,

\[
\nabla f(x_1,x_2)=
\left(
2\cdot (x_1-1),
2\cdot (x_2+2)
\right).
\]

We begin the first iteration. At the initial point $ \bar{x}^{(0)}=(3,2) $, the gradient is

\[
\nabla f(3,2)=
\left(
2\cdot (3-1),
2\cdot (2+2)
\right)
=
(4,8).
\]

Then we calculate the new approximation as follows:

\[
\bar{x}^{(1)}
=
(3,2)-0.25\cdot(4,8).
\]

Hence,

\[
\bar{x}^{(1)}
=
(3,2)-(1,2)=(2,0).
\]

Now let’s check whether the value of the function has decreased. At the initial point, we have

\[
f(3,2)=(3-1)^2+(2+2)^2=20.
\]

At the new point, we get

\[
f(2,0)=(2-1)^2+(0+2)^2=5.
\]

Since

\[
5<20,
\]

the move was successful. Therefore, the current value $ \alpha=0.25 $ can be kept for the next iteration.

Now let’s check the accuracy condition:

\[
\|\bar{x}^{(1)}-\bar{x}^{(0)}\|
=
\sqrt{(2-3)^2+(0-2)^2}.
\]

So,

\[
\|\bar{x}^{(1)}-\bar{x}^{(0)}\|
=\sqrt{1+4}
=\sqrt{5}
\approx 2.236.
\]

Since

\[
2.236>0.05,
\]

we continue the calculations.

We move on to the second iteration. We take the point

\[
\bar{x}^{(1)}=(2,0).
\]

We calculate the gradient at this point:

\[
\nabla f(2,0)=
\left(
2\cdot (2-1),
2\cdot (0+2)
\right)
=
(2,4).
\]

Then

\[
\bar{x}^{(2)}
=
(2,0)-0.25\cdot(2,4).
\]

Therefore,

\[
\bar{x}^{(2)}
=
(2,0)-(0.5,1)=(1.5,-1).
\]

The function value at this point is

\[
f(1.5,-1)=(1.5-1)^2+(-1+2)^2.
\]

We get

\[
f(1.5,-1)=0.5^2+1^2=1.25.
\]

Since

\[
1.250<5,
\]

the second step is also successful. In this example, after each completed move, the value of the function decreases, so there is no need to reduce the step $ \alpha $.

Let’s check the accuracy:

\[
\|\bar{x}^{(2)}-\bar{x}^{(1)}\|
=
\sqrt{(1.5-2)^2+(-1-0)^2}
\approx 1.118.
\]

Since

\[
1.118>0.05,
\]

we move on to the next iteration.

Continuing the Iterative Process

For the third iteration, we have

\[
\bar{x}^{(2)}=(1.5,-1).
\]

We calculate the gradient:

\[
\nabla f(1.5,-1)=
\left(
2\cdot (1.5-1),
2\cdot (-1+2)
\right)
=
(1,2).
\]

Then

\[
\bar{x}^{(3)}
=
(1.5,-1)-0.25\cdot(1,2).
\]

Hence,

\[
\bar{x}^{(3)}
=
(1.5,-1)-(0.25,0.5)=(1.25,-1.5).
\]

Now let’s calculate the function value:

\[
f(1.25,-1.5)=(1.25-1)^2+(-1.5+2)^2.
\]

So,

\[
f(1.25,-1.5)=0.25^2+0.5^2=0.313.
\]

We have

\[
0.313<1.25.
\]

Therefore, the value of the function has decreased again.

Now let’s check the accuracy condition:

\[
\|\bar{x}^{(3)}-\bar{x}^{(2)}\|
=
\sqrt{(1.25-1.5)^2+(-1.5+1)^2}
\approx 0.559.
\]

Since

\[
0.559>0.05,
\]

we continue the calculations.

For the fourth iteration, we take

\[
\bar{x}^{(3)}=(1.25,-1.5).
\]

We find the gradient:

\[
\nabla f(1.25,-1.5)=
\left(
2\cdot (1.25-1),
2\cdot (-1.5+2)
\right)
=
(0.5,1).
\]

Then

\[
\bar{x}^{(4)}
=
(1.25,-1.5)-0.25\cdot(0.5,1).
\]

We get

\[
\bar{x}^{(4)}
=
(1.25,-1.5)-(0.125,0.25)=(1.125,-1.75).
\]

The function value is

\[
f(1.125,-1.75)=(1.125-1)^2+(-1.75+2)^2.
\]

After calculation, we have

\[
f(1.125,-1.75)=0.125^2+0.25^2=0.078.
\]

Since

\[
0.078<0.313,
\]

we accept the move.

Now we check the accuracy:

\[
\|\bar{x}^{(4)}-\bar{x}^{(3)}\|
=
\sqrt{(1.125-1.25)^2+(-1.75+1.5)^2}
\approx 0.280.
\]

Since

\[
0.280>0.05,
\]

we need to perform one more iteration.

For the fifth iteration, we have

\[
\bar{x}^{(4)}=(1.125,-1.75).
\]

We calculate the gradient:

\[
\nabla f(1.125,-1.75)=
\left(
2\cdot (1.125-1),
2\cdot (-1.75+2)
\right)
=
(0.25,0.5).
\]

Then

\[
\bar{x}^{(5)}
=
(1.125,-1.75)-0.25\cdot(0.25,0.5).
\]

Hence,

\[
\bar{x}^{(5)}
=
(1.125,-1.75)-(0.063,0.125)=(1.063,-1.875).
\]

The function value at this point is

\[
f(1.063,-1.875)=(1.063-1)^2+(-1.875+2)^2.
\]

Therefore,

\[
f(1.063,-1.875)
\approx 0.063^2+0.125^2
\approx 0.02.
\]

We have

\[
0.02<0.078,
\]

so we accept the point $ \bar{x}^{(5)} $.

Now we check the stopping condition:

\[
\|\bar{x}^{(5)}-\bar{x}^{(4)}\|
=
\sqrt{(1.063-1.125)^2+(-1.875+1.75)^2}
\approx 0.14.
\]

Since

\[
0.14>0.05,
\]

we do not stop the iterative process yet.

Next, we perform two more iterations in the same sequence: calculate the gradient, take a step in the negative gradient direction, check the decrease of the function, and check the accuracy condition.

For the sixth iteration, we take

\[
\bar{x}^{(5)}=(1.063,-1.875).
\]

The gradient at this point is

\[
\nabla f(1.063,-1.875)
\approx
\left(
2\cdot (1.063-1),
2\cdot (-1.875+2)
\right)
=
(0.125,0.25).
\]

Then

\[
\bar{x}^{(6)}
=
(1.063,-1.875)-0.25\cdot(0.125,0.25).
\]

We obtain

\[
\bar{x}^{(6)}
=
(1.063,-1.875)-(0.031,0.063)
=
(1.031,-1.938).
\]

The function value is

\[
f(1.031,-1.938)
=
(1.031-1)^2+(-1.938+2)^2
\approx 0.005.
\]

Again, we see that the value of the function has decreased:

\[
0.005<0.02.
\]

Now let’s check the accuracy:

\[
\|\bar{x}^{(6)}-\bar{x}^{(5)}\|
=
\sqrt{(1.031-1.063)^2+(-1.938+1.875)^2}
\approx 0.07.
\]

Since

\[
0.07>0.05,
\]

we perform one more iteration.

For the seventh iteration, we have

\[
\bar{x}^{(6)}=(1.031,-1.938).
\]

We calculate the gradient:

\[
\nabla f(1.031,-1.938)
\approx
\left(
2\cdot (1.031-1),
2\cdot (-1.938+2)
\right)
=
(0.063,0.125).
\]

Then

\[
\bar{x}^{(7)}
=
(1.031,-1.938)-0.25\cdot(0.063,0.125).
\]

Hence,

\[
\bar{x}^{(7)}
=
(1.031,-1.938)-(0.016,0.031)
=
(1.016,-1.969).
\]

Now let’s calculate the function value:

\[
f(1.016,-1.969)
=
(1.016-1)^2+(-1.969+2)^2
\approx 0.001.
\]

Since

\[
0.001<0.005,
\]

the last move also decreased the value of the function.

Now we check the accuracy condition again:

\[
\|\bar{x}^{(7)}-\bar{x}^{(6)}\|
=
\sqrt{(1.016-1.031)^2+(-1.969+1.938)^2}
\approx 0.035.
\]

Since

\[
0.035<0.05,
\]

the iterative process can be stopped.

The last approximation found is

\[
\bar{x}^{(7)}=(1.016,-1.969).
\]

Therefore, the approximate minimum point can be written as

\[
\bar{x}^{*}\approx(1.016,-1.969).
\]

Now let’s find the approximate minimum value of the function:

\[
f_{\min}\approx f(1.016,-1.969).
\]

Substitute the found approximation:

\[
f(1.016,-1.969)
=
(1.016-1)^2+(-1.969+2)^2.
\]

Therefore,

\[
f(1.016,-1.969)
=
0.016^2+0.031^2
\approx 0.001.
\]

Thus, using the gradient descent method, we obtain

\[
\bar{x}^{*}\approx(1.016,-1.969),
\qquad
f_{\min}\approx 0.001.
\]

For convenience, let’s summarize the main iteration results in one table. It clearly shows how the point gradually approaches the minimum, while the value of the function decreases.

$ k $	$ \bar{x}^{(k)} $	$ f(\bar{x}^{(k)}) $	$ \\|\bar{x}^{(k)}-\bar{x}^{(k-1)}\\| $
$ 0 $	$ (3,2) $	$ 20 $	—
$ 1 $	$ (2,0) $	$ 5 $	$ 2.236 $
$ 2 $	$ (1.5,-1) $	$ 1.25 $	$ 1.118 $
$ 3 $	$ (1.25,-1.5) $	$ 0.313 $	$ 0.559 $
$ 4 $	$ (1.125,-1.75) $	$ 0.078 $	$ 0.28 $
$ 5 $	$ (1.063,-1.875) $	$ 0.02 $	$ 0.14 $
$ 6 $	$ (1.031,-1.938) $	$ 0.005 $	$ 0.07 $
$ 7 $	$ (1.016,-1.969) $	$ 0.001 $	$ 0.035 $

Comparison with the Exact Answer

Let’s check how the obtained result agrees with the exact answer. To do this, we return to the function itself:

\[
f(x_1,x_2)=(x_1-1)^2+(x_2+2)^2.
\]

The square of any number cannot be negative. Therefore, the value of the function will be the smallest when both expressions in parentheses are equal to zero:

\[
x_1-1=0,
\qquad
x_2+2=0.
\]

From this, we get

\[
x_1=1,
\qquad
x_2=-2.
\]

So, the exact minimum point has the form

\[
\bar{x}^{*}=(1,-2).
\]

The corresponding minimum value of the function is

\[
f_{\min}=f(1,-2).
\]

Substitute the exact values:

\[
f(1,-2)=(1-1)^2+(-2+2)^2=0.
\]

Therefore,

\[
\bar{x}^{*}=(1,-2),
\qquad
f_{\min}=0.
\]

The approximation obtained by the gradient descent method,

\[
\bar{x}^{*}\approx(1.016,-1.969)
\]

is close to the exact point $ (1,-2) $. So the result of the calculations agrees well with the exact answer. The method started from the point $ (3,2) $, consistently reduced the value of the function, and stopped the calculations when the distance between two neighboring approximations became smaller than the given accuracy.

What to Explore Next: Useful Topics for Further Study

The gradient descent method clearly shows the basic idea of moving toward a minimum with the help of the gradient. However, numerical optimization also includes other approaches. These methods help us better compare different ways of minimizing functions of several variables.

Coordinate Descent Method: Step-by-Step Minimization — This article will discuss how to find a minimum by changing separate coordinates of a function of several variables one after another.
Newton’s Method: Minimizing a Function of Several Variables — This material will show how minimization uses not only the gradient, but also information about the curvature of the function.
Newton’s Method with a Parameter: Controlled Movement Toward a Minimum — This article will explain how the step parameter helps make Newton’s method more controlled during minimization.

Gradient Descent Method in Code: From a Scheme to Your Own Program

After the theory and the practical example, it is worth taking one more useful step — trying to implement the method yourself. The ready-made flowchart from this section will help turn the mathematical idea into a clear programming algorithm for finding the minimum value of a function of several variables using the gradient descent method.

You can use it as a basis and write a small program in your favorite programming language. This way, you will not only check the calculation result, but also better understand how the choice of the starting point, step length, and stopping condition affects the work of the method.

\( k \)	\( \bar{x}^{(k)} \)	\( f(\bar{x}^{(k)}) \)	\( \\|\bar{x}^{(k)}-\bar{x}^{(k-1)}\\| \)
\( 0 \)	\( (3,2) \)	\( 20 \)	—
\( 1 \)	\( (2,0) \)	\( 5 \)	\( 2.236 \)
\( 2 \)	\( (1.5,-1) \)	\( 1.25 \)	\( 1.118 \)
\( 3 \)	\( (1.25,-1.5) \)	\( 0.313 \)	\( 0.559 \)
\( 4 \)	\( (1.125,-1.75) \)	\( 0.078 \)	\( 0.28 \)
\( 5 \)	\( (1.063,-1.875) \)	\( 0.02 \)	\( 0.14 \)
\( 6 \)	\( (1.031,-1.938) \)	\( 0.005 \)	\( 0.07 \)
\( 7 \)	\( (1.016,-1.969) \)	\( 0.001 \)	\( 0.035 \)

Gradient Descent Method: How to Reduce a Function Value Step by Step

Gradient Descent Method: The Main Idea of Minimization

Initial Approximation: How the Iterative Process Is Built

Step Length: How to Avoid an Unsuccessful Move

Calculating the Gradient: What to Do Without Explicit Derivatives

Practical Example: Gradient Descent Method for a Function of Two Variables

Example 1. Find the minimum value of the function \( f(x_1,x_2)=(x_1-1)^2+(x_2+2)^2 \) with accuracy \( \varepsilon=0.05 \) using the gradient descent method

Continuing the Iterative Process

Comparison with the Exact Answer

What to Explore Next: Useful Topics for Further Study

Gradient Descent Method in Code: From a Scheme to Your Own Program

Leave a Reply Cancel reply