require('header.inc.php'); ?>
The radial basis function network (RBFN) is a special type of neural networks
with several distinctive features
[Park and Sandberg, 1991,Poggio and Girosi, 1989,Ghosh and Nag, 2000,Mitchell, 1997,Orr, 1996,Kecman, 2001].
Since its first proposal, the RBFN has attracted a high degree of interest in
research communities. An RBFN consists of three layers, namely the input
layer, the hidden layer, and the output layer. The input layer broadcasts the
coordinates of the input vector to each of the nodes in the hidden layer. Each
node in the hidden layer then produces an activation based on the associated
radial basis function. Finally, each node in the output layer computes a
linear combination of the activations of the hidden nodes. How an RBFN reacts
to a given input stimulus is completely determined by the activation functions
associated with the hidden nodes and the weights associated with the links
between the hidden layer and the output layer. The general mathematical form
of the output nodes in an RBFN is as follows:
where
is the function corresponding to the
-
output unit
(class-)
and is a linear combination of
radial basis functions
with center
and bandwidth
.
Also,
is the weight vector of
class-
and
is the weight corresponding to the
- class
and
-
center. The general architecture of RBFN is shown as follows.
General Architecture of Radial Basis Function Networks
We can see that constructing an RBFN
involves determining the number of centers,
,
the center locations,
,
the bandwidth of each center,
,
and the weights,
.
That is, training an RBFN involves determining the values of three sets of
parameters: the centers
(),
the bandwidths
(),
and the weights
(),
in order to minimize a suitable cost function.
In QuickRBF package, we focus on the calculation of the weights, so we conduct the
simplest method to determine the centers and bandwidths. Therefore, we only offer
the tool to select the centers
randomly in our package. Also, we employ the simplest method which
uses the fixed bandwidth of each kernel function, and we set the bandwidth as 5
for each kernel function.
After the centers and bandwidths of the kernel functions in hidden layer have
been determined, the transformation between the inputs and the corresponding
outputs of the hidden units is now fixed. The network can thus be viewed as an
equivalent single-layer network with linear output units. Then, we use the
LMSE method to determine the weights associated with the links between the
hidden layer and the output layer.
Assume
is the output of the hidden layer.
where
is the number of centers,
is the output value of first kernel function with input
.
Then, the discriminant function
of
class-
can be expressed by the following:
where
is the number of class, and
is the weight vector of
class-.
We can show
as:
After calculating the discriminant function value of each class, we choose the
class with the biggest discriminant function value as the classification
result. We will discuss how to get the weight vectors by using least mean
square error method in the following sections.
For a classification problem with
classes, let
designate the
-
column vector of an
identity matrix and
be an
matrix of
weights:
Then the objective function to be minimized
is
where
and
are the a priori probability and the expected value of
class-,
respectively.
To find the optimal
that minimizes
,
we set the gradient of
to be
zero:Equation 1
Let
denote the class-conditional matrix of the second-order moments of
,
i.e.
If
denotes the matrix of the second-order moments under the mixture distribution,
we have
Then Eq. 1
becomesEquation 2
However, there is a critical drawback of this method. That is,
may be singular and this will crash the whole procedure. By observing the
matrix
,
we are aware of that the matrix
is symmetric positive semi-definite (PSD) matrix with rank=1.
Since
is the summation of
for each training instance,
is also a PSD matrix with rank <= n.
However, PSD matrix may be a singular matrix, so we should add the
regularization term to make sure the matrix will be invertible.
In the regularization theory, it consists in
replacing the objective function as
follows:
where
is the regularization parameter. Then the Eq. 2
becomesEquation 3
However, the PD matrix has many good properties, and one of them is a special
and efficient triangular decomposition, Cholesky decomposition. By using
Cholesky decomposition, we can decompose the
matrix as follows,
where
is a lower triangular matrix. Then, the Eq. 3 becomes
Actually, we can solve the linear system efficiently by using backsubstitution
twice.
Finally, we can get the optimal
for
class-
from
,
and then the optimal discriminant function
for
class-
is derived. By using the regularization theory, the optimal weights can be
obtained analytically and efficiently.
where
is a
null matrix.
where
If
is nonsingular, the optimal
can be calculated by
If we set
,
will be a positive definite (PD) matrix and therefore is nonsingular. The
optimal
can be calculated by