radial basis function network

The radial basis function network (RBFN) is a special type of neural networks with several distinctive features [Park and Sandberg, 1991,Poggio and Girosi, 1989,Ghosh and Nag, 2000,Mitchell, 1997,Orr, 1996,Kecman, 2001]. Since its first proposal, the RBFN has attracted a high degree of interest in research communities. An RBFN consists of three layers, namely the input layer, the hidden layer, and the output layer. The input layer broadcasts the coordinates of the input vector to each of the nodes in the hidden layer. Each node in the hidden layer then produces an activation based on the associated radial basis function. Finally, each node in the output layer computes a linear combination of the activations of the hidden nodes. How an RBFN reacts to a given input stimulus is completely determined by the activation functions associated with the hidden nodes and the weights associated with the links between the hidden layer and the output layer. The general mathematical form of the output nodes in an RBFN is as follows: MATH where $c_{j}(x)$ is the function corresponding to the $j$-$th$ output unit (class-$j$) and is a linear combination of $k$ radial basis functions $\phi()$ with center $\mu_{i}$ and bandwidth $\sigma_{i}$. Also, $w_{j}$ is the weight vector of class-$j$ and $w_{ji}$ is the weight corresponding to the $j $-$th$ class and $i$-$th$ center. The general architecture of RBFN is shown as follows.

Figure
General Architecture of Radial Basis Function Networks


We can see that constructing an RBFN involves determining the number of centers, $k$, the center locations, $\mu_{i}$, the bandwidth of each center, $\sigma_{i}$, and the weights, $w_{ji}$. That is, training an RBFN involves determining the values of three sets of parameters: the centers ($\mu_{i}$), the bandwidths ($\sigma_{i}$), and the weights ($w_{ji}$), in order to minimize a suitable cost function.

Least Mean Square Error Method

In QuickRBF package, we focus on the calculation of the weights, so we conduct the simplest method to determine the centers and bandwidths. Therefore, we only offer the tool to select the centers randomly in our package. Also, we employ the simplest method which uses the fixed bandwidth of each kernel function, and we set the bandwidth as 5 for each kernel function.

After the centers and bandwidths of the kernel functions in hidden layer have been determined, the transformation between the inputs and the corresponding outputs of the hidden units is now fixed. The network can thus be viewed as an equivalent single-layer network with linear output units. Then, we use the LMSE method to determine the weights associated with the links between the hidden layer and the output layer.

Assume $h$ is the output of the hidden layer. MATH where $k$ is the number of centers, $\phi_{1}(x)$ is the output value of first kernel function with input $x$. Then, the discriminant function $c_{j}(x)$ of class-$j$ can be expressed by the following: MATH where $m$ is the number of class, and $w_{j}$ is the weight vector of class-$j$. We can show $w_{j}$ as: MATH

After calculating the discriminant function value of each class, we choose the class with the biggest discriminant function value as the classification result. We will discuss how to get the weight vectors by using least mean square error method in the following sections.

For a classification problem with $m$ classes, let $V_{i}$ designate the $i$-$th$ column vector of an $m\times m$ identity matrix and $W$ be an $k\times m $ matrix of weights:MATH Then the objective function to be minimized isMATH where $P_{j}$ and $E_{j}\{\}$ are the a priori probability and the expected value of class-$j$, respectively.

To find the optimal $W$ that minimizes $J$, we set the gradient of $J(W)$ to be zero:MATHEquation 1
where $[0]$ is a $k\times m$ null matrix.

Let $K_{i}$ denote the class-conditional matrix of the second-order moments of $h$, i.e.MATH If $K$ denotes the matrix of the second-order moments under the mixture distribution, we have MATH Then Eq. 1 becomesMATHEquation 2
where MATH If $K$ is nonsingular, the optimal $W$ can be calculated by MATH

However, there is a critical drawback of this method. That is, $K$ may be singular and this will crash the whole procedure. By observing the matrix $hh^{T}$, we are aware of that the matrix $hh^{T}$ is symmetric positive semi-definite (PSD) matrix with rank=1. Since $K$ is the summation of $hh^{T}$ for each training instance, $K$ is also a PSD matrix with rank <= n. However, PSD matrix may be a singular matrix, so we should add the regularization term to make sure the matrix will be invertible.

In the regularization theory, it consists in replacing the objective function as follows:MATH where $\lambda$ is the regularization parameter. Then the Eq. 2 becomesMATHEquation 3
If we set $\lambda>0$, $(K+\lambda I)$ will be a positive definite (PD) matrix and therefore is nonsingular. The optimal $W^{\ast}$ can be calculated by MATH

However, the PD matrix has many good properties, and one of them is a special and efficient triangular decomposition, Cholesky decomposition. By using Cholesky decomposition, we can decompose the $(K+\lambda I)$ matrix as follows, MATH where $L$ is a lower triangular matrix. Then, the Eq. 3 becomes
MATH

Actually, we can solve the linear system efficiently by using backsubstitution twice.

Finally, we can get the optimal $w_{j}^{\ast}$ for class-$j$ from $W^{\ast} $, and then the optimal discriminant function $c_{j}(x)$ for class-$j$ is derived. By using the regularization theory, the optimal weights can be obtained analytically and efficiently.