Newton's Method: Multivariable Minimization

Minimizing a function of several variables is one of the important topics in numerical methods, because many problems require us to find the point at which a function attains its smallest value with respect to several parameters at once. Newton’s method is often used for exactly these kinds of problems. Why is this approach so important? The reason is that it uses not only the value of the function, but also its first- and second-order derivatives. As a result, it becomes possible to build a more accurate local approximation of the minimum point. In this article, we will gradually examine the main idea of the method, the role of the gradient and the Hessian matrix, and also derive the main iterative formula.

Newton’s Method: What Is the Main Idea Behind It

Suppose we are given a function of several variables

\[
f(\bar{x})=f(x_1,x_2,\dots,x_n),
\]

where \( \bar{x} \) is the vector of variables. We need to find a point at which the function attains a local minimum. But how do we move toward this point in practice? This is exactly where Newton’s method is used.

Its main idea is to replace the original function near the current point \( \bar{x}_k \) with a simpler local model. In particular, a quadratic approximation is used, obtained from the second-order Taylor expansion of the function. So, at each step, the method works not with the whole function directly, but with its quadratic model near the current point.

Then, near the point \( \bar{x}_k \), the function \( f(\bar{x}) \) is approximately written as

\[
f(\bar{x}) \approx f(\bar{x}_k)
+ \nabla f(\bar{x}_k)^T \cdot (\bar{x}-\bar{x}_k)
+ \frac{1}{2}\cdot (\bar{x}-\bar{x}_k)^T \cdot H(\bar{x}_k)\cdot (\bar{x}-\bar{x}_k).
\]

In this expression:

\( \nabla f(\bar{x}_k) \) is the gradient of the function at the point \( \bar{x}_k \), which describes the direction of the fastest change of the function.
\( H(\bar{x}_k) \) is the Hessian matrix at the point \( \bar{x}_k \), which describes the curvature of the function near this point.

So, the linear term in the formula takes into account the direction in which the function changes, while the quadratic term reflects how the surface of the function bends near \( \bar{x}_k \). That is exactly why the quadratic model gives much more information than an ordinary linear approximation.

Gradient and Hessian Matrix: What Quantities the Method Uses

First of all, to build the method, we need to know the gradient of the function. The gradient is a vector made up of the first partial derivatives:

\[
\nabla f(\bar{x})=
\begin{pmatrix}
\frac{\partial f}{\partial x_1} \\
\frac{\partial f}{\partial x_2} \\
\vdots \\
\frac{\partial f}{\partial x_n}
\end{pmatrix}.
\]

It shows how the function changes with respect to each variable. However, for Newton’s method, this alone is not enough. We also need to take into account how the shape of the function’s surface changes. That is why second-order derivatives are used here.

The matrix of second partial derivatives is called the Hessian matrix:

\[
H(\bar{x})=
\begin{pmatrix}
\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\
\frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2}
\end{pmatrix}.
\]

So, the Hessian matrix describes the curvature of the function. And this is very important, because to find a minimum, it is not enough to know only the direction in which the function decreases. We also need to understand how the surface behaves near the current point. That is why Newton’s method, unlike first-order methods, uses not only the gradient but also second-order derivatives.

However, one more point is important here. If the Hessian matrix at the current point is positive definite, then the quadratic model has a local minimum, and the Newton step will be directed toward decreasing the function. If the Hessian matrix is not positive definite, the resulting direction may fail to lead to a minimum. So, the properties of the Hessian matrix strongly affect how the method works.

If the second derivatives are difficult to find analytically, the elements of the Hessian matrix can be computed approximately using finite-difference formulas. For example, for the diagonal elements, the following approximation is used:

\[
H_{ii}(\bar{x}) \approx
\frac{f(x_1,\dots,x_i+h,\dots,x_n)-2\cdot f(x_1,\dots,x_i,\dots,x_n)+f(x_1,\dots,x_i-h,\dots,x_n)}{h^2}.
\]

For the mixed derivatives, that is, when \( i \ne j \), we can use the following approximation:

\[
H_{ij}(\bar{x}) \approx
\frac{1}{4\cdot h\cdot k}
\Bigl(
f(x_1,\dots,x_i+h,\dots,x_j+k,\dots,x_n)-f(x_1,\dots,x_i+h,\dots,x_j-k,\dots,x_n)
\\[6pt]
-f(x_1,\dots,x_i-h,\dots,x_j+k,\dots,x_n)+f(x_1,\dots,x_i-h,\dots,x_j-k,\dots,x_n)
\Bigr).
\]

So, even without exact analytical computation of the second derivatives, the Hessian matrix can often be built numerically. And then a natural question arises: how exactly do we use these quantities to obtain the formula for moving to a new point?

Newton’s Method: How the Main Iterative Formula Is Derived

Now let us move to the main step. Let us denote the increment of the argument by

\[
\Delta \bar{x}_k=\bar{x}_{k+1}-\bar{x}_k.
\]

Then the quadratic model of the function near the point \( \bar{x}_k \) can be rewritten as

\[
f(\bar{x}_{k+1}) \approx f(\bar{x}_k)
+ \nabla f(\bar{x}_k)^T \cdot \Delta \bar{x}_k
+ \frac{1}{2}\cdot \Delta \bar{x}_k^T \cdot H(\bar{x}_k)\cdot \Delta \bar{x}_k.
\]

Next, we need to find such an increment \( \Delta \bar{x}_k \) that this quadratic model reaches its minimum value. In other words, at the current step we are looking not for the minimum of the original function as a whole, but for the minimum of its local approximation near the point \( \bar{x}_k \).

To do this, we find the gradient of the quadratic model with respect to the vector \( \Delta \bar{x}_k \) and set it equal to zero:

\[
\nabla f(\bar{x}_k)+H(\bar{x}_k)\cdot \Delta \bar{x}_k=0.
\]

From this, we obtain:

\[
H(\bar{x}_k)\cdot \Delta \bar{x}_k=-\nabla f(\bar{x}_k).
\]

So, at each step of the method, we need to solve a system of linear algebraic equations in which the coefficient matrix is the Hessian matrix, and the right-hand side is the negative gradient.

If the matrix \( H(\bar{x}_k) \) is invertible, then

\[
\Delta \bar{x}_k=-H(\bar{x}_k)^{-1}\cdot \nabla f(\bar{x}_k).
\]

Now let us substitute this expression into the update formula

\[
\bar{x}_{k+1}=\bar{x}_k+\Delta \bar{x}_k.
\]

As a result, we obtain the main iterative formula:

\[
\bar{x}_{k+1}=\bar{x}_k-H(\bar{x}_k)^{-1}\cdot \nabla f(\bar{x}_k).
\]

This formula is the foundation of Newton’s method for minimizing a function of several variables. It shows how to move from the current approximation to the next one by using both the gradient and the Hessian matrix at the same time. On the one hand, the gradient gives the direction of the local change of the function. On the other hand, the Hessian matrix takes the curvature of the surface into account. So, the move to the new point becomes more accurate than in methods that rely only on first-order derivatives.

If the function is quadratic, then under suitable conditions the method can lead to the solution very quickly. But in the general case, the process is carried out step by step:

\[
\bar{x}_0,\ \bar{x}_1,\ \bar{x}_2,\ \dots,\ \bar{x}_k,\ \dots
\]

The iterations continue until one of the stopping conditions is satisfied, for example,

\[
\|\nabla f(\bar{x}_k)\| \le \varepsilon,
\]

where \( \varepsilon>0 \) is a preset accuracy.

This condition means that at the point \( \bar{x}_k \), the norm of the gradient is already small enough. So, near this point, the function changes only slightly, and the computation process can be stopped. After that, the last point found, \( \bar{x}_k \), is taken as an approximate value of the local minimum point, and the value \( f(\bar{x}_k) \) is taken as an approximate minimum value of the function.

So, in problems of multivariable minimization, Newton’s method is based on a step-by-step scheme: first, a quadratic model of the function is built near the current point, then the minimum of this model is found, and after that the method moves to a new approximation. It is exactly this structure that makes it one of the most important second-order methods.

Practical Part: How Newton’s Method Works in a Specific Example

The theoretical scheme of the method becomes much clearer when we look at it through a specific problem. That is why we will now go step by step through all the main stages for a function of two variables. This makes it easier to see how the formula of Newton’s method is applied in practice.

Example 1. Find the minimum value of the function \( f(x_1,x_2)=x_1^2+x_2^2-x_1\cdot x_2-2\cdot x_1-4\cdot x_2+6 \) with accuracy \( \varepsilon=0.001 \), using Newton’s method

Let us consider the function

\[
f(x_1,x_2)=x_1^2+x_2^2-x_1\cdot x_2-2\cdot x_1-4\cdot x_2+6.
\]

For this function, we will apply the formula of Newton’s method step by step. Let us take the initial approximation

\[
\bar{x}_0=
\begin{pmatrix}
0\\
0
\end{pmatrix},
\]

because this point is simple for calculations and convenient for starting the iterative process.

First, let us find the gradient of the function:

\[
\nabla f(x_1,x_2)=
\begin{pmatrix}
\frac{\partial f}{\partial x_1}\\
\frac{\partial f}{\partial x_2}
\end{pmatrix}
=
\begin{pmatrix}
2\cdot x_1-x_2-2\\
2\cdot x_2-x_1-4
\end{pmatrix}.
\]

Next, let us find the Hessian matrix:

\[
H(x_1,x_2)=
\begin{pmatrix}
\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1\partial x_2}
\\[6pt]
\frac{\partial^2 f}{\partial x_2\partial x_1} & \frac{\partial^2 f}{\partial x_2^2}
\end{pmatrix}
=
\begin{pmatrix}
2 & -1\\
-1 & 2
\end{pmatrix}.
\]

We can see that the Hessian matrix is constant, which means it does not depend on the point. Now let us calculate the gradient at the initial point \( \bar{x}_0=(0,0)^T \):

\[
\nabla f(0,0)=
\begin{pmatrix}
2\cdot 0-0-2\\
2\cdot 0-0-4
\end{pmatrix}
=
\begin{pmatrix}
-2\\
-4
\end{pmatrix}.
\]

Let us find the inverse of the Hessian matrix:

\[
H^{-1}=
\begin{pmatrix}
2 & -1\\
-1 & 2
\end{pmatrix}^{-1}
=
\frac{1}{3}
\cdot
\begin{pmatrix}
2 & 1\\
1 & 2
\end{pmatrix}.
\]

Then, by the formula of Newton’s method

\[
\bar{x}_{k+1}=\bar{x}_k-H^{-1}(\bar{x}_k)\cdot \nabla f(\bar{x}_k)
\]

for the first iteration we get

\[
\bar{x}_1=\bar{x}_0-H^{-1}\cdot \nabla f(\bar{x}_0).
\]

Now let us substitute the values we found:

\[
\bar{x}_1=
\begin{pmatrix}
0\\
0
\end{pmatrix}
–
\frac{1}{3}
\cdot
\begin{pmatrix}
2 & 1\\
1 & 2
\end{pmatrix}
\cdot
\begin{pmatrix}
-2\\
-4
\end{pmatrix}.
\]

First, let us multiply the matrix by the vector:

\[
\begin{pmatrix}
2 & 1\\
1 & 2
\end{pmatrix}
\cdot
\begin{pmatrix}
-2\\
-4
\end{pmatrix}
=
\begin{pmatrix}
2\cdot(-2)+1\cdot(-4)\\
1\cdot(-2)+2\cdot(-4)
\end{pmatrix}
=
\begin{pmatrix}
-8\\
-10
\end{pmatrix}.
\]

So,

\[
\bar{x}_1=
\begin{pmatrix}
0\\
0
\end{pmatrix}
–
\frac{1}{3}
\begin{pmatrix}
-8\\
-10
\end{pmatrix}
=
\begin{pmatrix}
\frac{8}{3}\\
\frac{10}{3}
\end{pmatrix}.
\]

Now let us find the gradient at the point \( \bar{x}_1 \):

\[
\nabla f\left(\frac{8}{3},\frac{10}{3}\right)=
\begin{pmatrix}
2\cdot\frac{8}{3}-\frac{10}{3}-2
\\[6pt]
2\cdot\frac{10}{3}-\frac{8}{3}-4
\end{pmatrix}
=
\begin{pmatrix}
0\\
0
\end{pmatrix}.
\]

So,

\[
\|\nabla f(\bar{x}_1)\|=0.
\]

Since

\[
0<0.001,
\]

the iterative process can be stopped after the very first step.

So, we take the last point found as the approximate value of the minimum point:

\[
\bar{x}^*=
\begin{pmatrix}
\frac{8}{3}\\
\frac{10}{3}
\end{pmatrix}.
\]

Since the Hessian matrix

\[
H=
\begin{pmatrix}
2 & -1\\
-1 & 2
\end{pmatrix}
\]

is positive definite, the stationary point found corresponds to a local minimum. Moreover, for this quadratic function, this minimum is also global.

Now let us calculate the minimum value of the function:

\[
f_{\min}\approx f\left(\frac{8}{3},\frac{10}{3}\right).
\]

Substituting the values found, we get:

\[
f\left(\frac{8}{3},\frac{10}{3}\right)
=
\left(\frac{8}{3}\right)^2+\left(\frac{10}{3}\right)^2-\frac{8}{3}\cdot\frac{10}{3}-2\cdot\frac{8}{3}-4\cdot\frac{10}{3}+6.
\]

Now let us calculate step by step:

\[
f\left(\frac{8}{3},\frac{10}{3}\right)
=
\frac{64}{9}+\frac{100}{9}-\frac{80}{9}-\frac{16}{3}-\frac{40}{3}+6.
\]

Let us rewrite the expression using a common denominator:

\[
f\left(\frac{8}{3},\frac{10}{3}\right)=\frac{64+100-80-48-120+54}{9}=\frac{-30}{9}=-\frac{10}{3}.
\]

So, using Newton’s method, we obtain

\[
\bar{x}^*=
\begin{pmatrix}
\frac{8}{3}\\
\frac{10}{3}
\end{pmatrix},
\qquad
f_{\min}=-\frac{10}{3}.
\]

Thus, this example consistently shows all the main steps of Newton’s method: choosing the starting point, calculating the gradient and the Hessian matrix, finding the new approximation, checking the accuracy condition, and computing the minimum value of the function at the point found.

What to Explore Next: Topics for Further Study

We have now looked at Newton’s method for a function of several variables in a clear and step-by-step way. But where should you go next? A very natural next step is to explore other approaches that help you better understand the main features of multivariable minimization.

Newton’s Method with a Parameter: How the Step Changes — This article will explain how an additional parameter affects the movement toward the minimum and why, in some cases, it can make the method more stable.
Coordinate Descent: How Alternating Search Works — Here, you will see how the minimum is found step by step with respect to individual variables and why this approach is convenient in many problems.
Gradient Descent: How to Move in the Direction of Decrease — This topic will discuss one of the best-known minimization methods, where new approximations are found by moving in the direction of the steepest decrease of the function.

Newton’s Method and Code: Try Building Your Own Program

The diagram below is not only an addition to the article, but also a very practical hint for independent work. Why not use it as a basis and create a small application in your favorite programming language that finds the minimum value of a function of several variables using Newton’s method? When you turn a mathematical idea into code, the topic becomes much clearer: formulas stop being only theoretical material and turn into a clear sequence of actions. Besides that, it is a good way to test yourself in practice and to see how the method works for different examples.

Newton’s Method: Theory and Practice of Multivariable Minimization