If is a convex functional on a (possibly in nite-dimensional) vector space ,
V
then we say
is a subgradient for at
if
is a linear functional on ,
sg
sg
v
2
V
V
and we have
( )
( ) + (
) for all
(13.3)
sg
z
v
z
;
v
z
2
V
:
The subdi erential ( ) consists of all subgradients of at note that it is a set
@
v
v
of linear functionals on .
V
If = n, then every linear functional on has the form T for some vector
V
R
V
g
z
n, and our two de nitions of subgradient are therefore the same, provided we
g
2
R
ignore the distinction between the vector
n and the linear functional on n
g
2
R
R
given by the inner product with .g
13.1.2
Quasigradients
For quasiconvex functions, there is a concept analogous to the subgradient. Suppose
: n
is quasiconvex, which we recall from section 6.2.2 means that
R
!
R
( + (1 )~) max ( ) (~)
for all 0
1
~
n
x
;
x
f
x
x
g
x
x
2
R
:
We say that is a quasigradient for at if
g
x
( )
( ) whenever T(
) 0
(13.4)
z
x
g
z
;
x
:
This simply means that the hyperplane T(
) = 0 forms a simple cut for ,
g
z
;
x
exactly as in gure 13.2: if we are searching for a minimizer of , we can rule out
the half-space T(
) 0.
g
z
;
x
>
If is di erentiable and ( ) = 0, then ( ) is a quasigradient if is convex,
r
x
6
r
x
then (13.2) shows that any subgradient is also a quasigradient. It can be shown
that every quasiconvex function has at least one quasigradient at every point. Note
that the length of a quasigradient is irrelevant (for our purposes): all that matters
is its direction, or equivalently, the cutting-plane for that it determines.
Any algorithm for convex optimization that uses only the cutting-planes that
are determined by subgradients will also work for quasiconvex functions, if we sub-
stitute quasigradients for subgradients. It is not possible to form any deep-cut for
a quasiconvex function.
In the in nite-dimensional case, we will say that a linear functional
on is
qg
V
a quasigradient for the quasiconvex functional at
if
v
2
V
( )
( ) whenever
(
) 0
qg
z
v
z
;
v
:
As discussed above, this agrees with our de nition above for = n, provided we
V
R
do not distinguish between vectors and the associated inner product linear func-
tionals.
13.1 SUBGRADIENTS
297
13.1.3
Subgradients and Directional Derivatives
In this section we brie y discuss the directional derivative, a concept of di erential
calculus that is more familiar than the subgradient. We will not use this concept in
the optimization algorithms we present in the next chapter we mention it because
it is used in descent methods, the most common algorithms for optimization.
We de ne the directional derivative of at in the direction as
x
x
(
) = lim ( +
)
( )
x
h
x
;
x
0
x
x
h & 0
h
(the notation
0 means that converges to 0 from above). It can be shown
h
&
h
that for convex this limit always exists. Of course, if is di erentiable at , then
x
0
(
) = ( )T
x
x
r
x
x:
We say that is a descent direction for at if (
) 0.
0
x
x
x
x
<
The directional derivative tells us how changes if is moved slightly in the
x
direction , since for small ,
x
h
+
(
)
0
x
( ) +
x
x
x
h
x
h
:
k
xk
k
xk
The steepest descent direction of at is de ned as
x
= argmin 0(
)
x
x
x
:
sd
x
k
k=1
In general the directional derivatives, descent directions, and the steepest descent
direction of at can be described in terms of the subdi erential at (see the
x
x
Notes and References at the end of the chapter). In many cases it is considerably
more di cult to nd a descent direction or the steepest descent direction of at x
than a single subgradient of at .x
If is di erentiable at , and ( ) = 0, then
( ) is a descent direction
x
r
x
6
;r
x
for at . It is not true, however, that the negative of any nonzero subgradient
x
provides a descent direction: we can have
( ), = 0, but
not a descent
g
2
@
x
g
6
;g
direction for at . As an example, the level curves of a convex function are
x
shown in gure 13.4(a), together with a point and a nonzero subgradient . Note
x
g
that increases for any movement along the directions , so, in particular,
g
;g
is not a descent direction. Negatives of the subgradients at non-optimal points
are, however, descent directions for the distance to a (or any) minimizer, i.e., if
( ) =
and ( ) =
, then (
) 0 for any
( ). Thus,
0
x
z
kz
;
x
k
x
;g
<
g
2
@
x
moving slightly in the direction
will decrease the distance to (any) minimizer
;g
, as shown in gure 13.4(b).
x
A consequence of these properties is that the optimization algorithms described
in the next chapter do not necessarily generate sequences of decreasing functional
values (as would a descent method).
298