\documentclass[conference]{IEEEtran}
% If the IEEEtran.cls has not been installed into the LaTeX system files,
% manually specify the path to it: e.g.,
% \documentclass[conference]{../sty/IEEEtran}

\usepackage{graphicx,times,psfig,amsmath} % Add all your packages here

% correct bad hyphenation here
\hyphenation{op-tical net-works semi-conduc-tor IEEEtran}

\IEEEoverridecommandlockouts    % to create the author's affliation portion
                % using \thanks

\textwidth 178mm    % <------ These are the adjustments we made 10/18/2005
\textheight 239mm   % You may or may not need to adjust these numbes again
\oddsidemargin -7mm \evensidemargin -7mm \topmargin -6mm \columnsep
5mm

\begin{document}

\title{Predicting protein-protein interface residues using various approaches}
    \author{Seyna}
    \maketitle

\begin{abstract}

More and more protein structures with unknown function are being
produced in recent years. According to this fact, one can help
biologist annotate all these structures properly in reasonable time
and cost if accurate, reliable, automated predictors of protein
function is provided. The meaning of identifying the interface
between two interacting proteins is that it provides preliminary but
important clues to the function of a protein, and thus helps
understand the protein networks. In this report we experiment on
various approaches to evaluate the prediction performance. The
results show that by combining sequence profile with the information
of non-surface residues, the information of secondary structure of
proteins, some physical-chemical and conservation properties of
residues, the prediction will cover more actually existed interface
residues and the precision will also increased.
\end{abstract}

\section{INTRODUCTION}

In order to decipher the networks of interacting proteins,
large-scale efforts have been made to identify interacting pairs
experimentally, and thus lead to completion of many genomes.
According to this fact, protein-protein interactions do play a
pivotal role in protein function, and with the protein-protein
interacting sites being identified, it can contribute to the
specificity and strength of protein-protein interactions and have
broad applications ranging from drug design to the analysis of
metabolic and signal transduction networks. However, experimental
detection of residues in protein-protein interaction surfaces must
come from determination of the structure of protein-protein
complexes. But the determination of protein-complex structures using
X-ray and NMR methods cannot keep up with the increasing pace of the
number of known protein sequences. Hence, there is a need for an
accurate, reliable computational method.

Over years, many studies try to predict interface residues based on
the characteristics of known protein-protein interactions, since it
has been shown that binding sites share common properties that can
distinguish them from the rest of the protein. For example,
hydrophobic residues cluster at some interfaces (Glaser et al.,
2001; Young et al., 1994). Other interfaces have a significant
number of polar residues (Jones and Thornton, 1996; Lo Conte et al.,
1999; Larsen et al., 1998). Jones and Thornton (1997a) also
implicated shape and solvent accessibility is useful in
distinguishing binding sites from the rest of the protein surface.
Based on these different known protein-protein interactions, several
methods are proposed for predicting these sites. For example, Gallet
analyze the hydrophobicity distribution around a target residue
(Gallet et al., 2000), Jones and Thornton analyzed and combined more
than one of these physical-chemical properties such as salvation
potential, residue interface propensity, hydrophobicity, planarity,
protrusion and accessible surface area.( Jones and Thornton, 1997b)

In 2004, Yan proposed a two stage prediction method for
protein-protein interacting sites. In the first stage, they identify
the interface residue primarily on the basis of protein sequence
information, by means of a sliding window which consists of 9
residues, producing the training input to the support vector machine
(SVM). Then, they have the results from the first stage be the input
of the second stage. In the second stage, they use a Bayesian
network classifier.

In this report, we study a set of 77 proteins which is same as those
of Yan's. We employ and combine various approaches to predict the
interacting sites between proteins. These approaches include
sequence information, secondary structure information of the
proteins, and some physical-chemical properties. Finally we apply
the support vector machine (SVM) to obtain the result.

\section{METHOD AND MATERIALS}
\subsection{Datasets}
We extracted individual proteins from a set of 70 protein- protein
heterocomplexes used in the study of Chakrabarti and Janin (2002).
After removal of redundant proteins and molecules with fewer than 10
residues, we obtained a dataset of 77 individual proteins with
sequence identity $<30\%$ These proteins represent six different
categories of protein-protein interfaces, classified according to
the scheme of Chakrabarti and Janin (2002). The six categories and
the number of representatives in each category are:
antibody-antigen, protease-inhibitor, enzyme complexes, large
protease complexes, G-proteins, cell cycle, signal transduction and
miscellaneous.

\subsection{\textbf{Definition of surface residue and interface residues}}
The definition of interface residues used in this study is based on
the reduction of solvent accessible surface area (ASA) upon complex
formation. ASA was computed for each residue in the unbound molecule
(MASA) and in the complex (CASA) using the DSSP program (Kabsch and
Sander, 1983). A residue is defined to be a surface residue if its
MASA is at least $25\%$ of its nominal maximum area as defined by
Rost and Sander (1994). A surface residue is defined to be an
interface residue if its calculated ASA in the complex is less than
that in the monomer by at least 1 $\AA^{2}$ (Jones and Thornton,
1996). Surface residues were extracted and divided into interface
residues and non-interface residues, using structural information
from Protein Data Bank (PDB) files. We obtained a total of 2340
positive examples corresponding to interface residues and 5091
negative examples corresponding to non-interface residues.

\subsubsection{\textbf{Approach using binary sequence encoding (O-seq)}}
In this approach, the input to the SVM is an encoding of the
identities of eleven contiguous amino acid residues, corresponding
to a window containing the target residue and five neighboring
residues on either side of the target residue. Each of the eleven
residues in the window is represented by a 20-bit vector (with 1-bit
for each letter of the 20-letter amino acid alphabet). As a
consequence, the SVM classifier produces a Boolean output, in which
1 denotes an interface residue and 0 denotes a non-interface
residue.

\subsubsection{\textbf{Approach using PSSM (L-PSSM)}}
In this approach, we use Position specific iterative BLAST
(PSI-BLAST) to generate a profile, which is produced from local
alignments of the most highly scoring hits in the initial BLAST
results by calculating position-specific scores for every position
in the alignment. A highly conserved position will receive a high
score, whereas weakly conserved positions receive scores near zero.
Then we perform standard logistic function to convert each
position-specific score of each position to a value ranging from 0
to 1.

\subsubsection{\textbf{B-seq/R-PSSM with non-surface information (NSO-seq, NSL-PSSM)}}

Besides using sequence information of the proteins or the results
calculated from PSI-BLAST, we add non-surface information with an
eye to help SVM make more accurate and sensitive results. Because it
is reasonable to conjecture that a target residue can have higher
probability to be an interface residue if most of its neighbors in
the window are exposed as surface residues. Since we have known
which residues are not exposed as surface residues by using DSSP
program. We can thus turn all the position-specific score of each
non-surface residue into zero. Since we adopt a window of size
eleven, there are total 220 vectors in the input.

\subsubsection{\textbf{NSL-PSSM with secondary structure information (SNSL-PSSM)}}
In many cases, an interacting region has only one residue. For
example, because of the unique shape of $\alpha-$helix, it is
possible that the first residue lies in the acting region of
proteins while the second one does not. Therefore, we need more
information about the structures to compensate the insufficiency of
using merely sequence information.

PSIPRED is a simple and reliable secondary structure prediction
method, incorporating two feed-forward neural networks which perform
an analysis on output obtained from PSI-BLAST (Position Specific
Iterated - BLAST). In this approach we use PSIPRED to obtain the
secondary structure information of the proteins. Each residue is
represented by a 3-bit vector, (100 stands for $\alpha-$helix, 010
stands for $\beta-strand$, 001 stands for others).

\subsubsection{\textbf{NSL-PSSM with conservation information (CNSL-PSSM)}}
Besides the position-specific scores matrix produced by PSI-BLAST,
PSI-BLAST also gives the number of a given target residue hits to
all 20 residues in multiple sequence alignment. We call this number
FSSM. We have found that when FSSM is above 20, the $CS_{j}$ will
increase slowly.
\begin{equation}
\label{eq1} CS_{j}=\frac{\#FSSMI_{j}\times N}{\#FSSMS_{j}},   \emph{
j = 0 \ldots 100}
\end{equation}
while $\#FSSMI_{j}$ represents the number of $FSSM=j$ in acting
region,$\#FSSMS_{j}$ represents the number of $FSSM=j$ in non-ating
region. N is (total number of surface residues in non-acting
region)$/$(total number of surface residues in acting region).

We extract FSSM from PSI-BLAST, and divide each residue's FSSM by
100 to make a value vector which is between 0 to 1. Furthermore,
because there is abnormal behavior when FSSM is less than 20, thus
we add another feature vector, in which 1 represents FSSM is greater
than 20, and 0 represents FSSM is less than 20.

\subsubsection{\textbf{NSL-PSSM with various physical-chemical properties}}
In this approach, we add physical-chemical information to previously
generated NSL-PSSM one by one to evaluate the outcome. These
properties include aliphatic, aromatic, positive, small, and
hydrophobic.

\subsubsection{\textbf{Performance measures}}
Let TP is the number of true positives (residues predicted to be
interface residues that actually are interface residues); FP the
number of false positives (residues predicted to be interface
residues that are in fact not interface residues); TN the number of
true negatives; FN the number of false negatives; $N = TP + TN + FP
+ FN$ (the total number of examples).Then we have:

\section{RESULTS}
\begin{table}[t]
%\caption{The Remaining Features}\centering \normalsize
\begin{tabular}{rrrrrr}
  \hline & Sens. & Prec. & ACC. & MCC & F-score \\
  \hline O-seq & 0.534 & 0.318 & 0.558 & 0.091 & 0.400 \\
  \hline L-PSSM & 0.558 & 0.324 & 0.559 & 0.104 & 0.410 \\
  \hline NSO-Seq & 0.561 & 0.360 & 0.606 & 0.166 & 0.438 \\
  \hline NSL-PSSM & 0.587 & 0.361 & 0.601 & 0.174 & 0.447 \\
  \hline & & & & &
\end{tabular}
\end{table}

\begin{table}[t]
\begin{tabular}{rrrrrr}
  \hline & Sens. & Prec. & ACC. & MCC & F-score \\
  \hline NSL-PSSM & 0.587 & 0.361 & 0.601 & 0.174 & 0.447 \\
  \hline NSL+CNSL-PSSM & 0.604 & 0.373 & 0.612 & 0.197 & 0.461 \\
  \hline NSL+SNSL-PSSM & 0.583 & 0.368 & 0.611 & 0.183 & 0.451 \\
  \hline & & & & &
\end{tabular}
\end{table}

\begin{tabular}{rrrrrr}
  \hline & Sens. & Prec. & ACC. & MCC & F-score \\
  \hline NSL-PSSM & 0.587 & 0.361 & 0.601 & 0.174 & 0.447 \\
  \hline & & & & &
\end{tabular}

\section{CONCLUSION}
We have shown how to find useful information from huge amount of
protein-protein interacting region by the method of machine data
learning and to improve prediction performance in identifying
interacting residues. In this report, we study the constituent of
the protein-protein interacting sites, and the conservation during
biological evolutionary. Besides, we add secondary structure
information.

After many experiments, our results affirm that some biochemical and
biophysical properties can contribute to the enhancement of
prediction performance.
\begin{enumerate}
\item \textbf{There is clustering phenomenon in protein-protein binding
sites:}

With the adjustment of the number of neighbors in the window, the
experimental results show that the prediction performance is slowly
enhanced with bigger window size. When the window size becomes
eleven or larger, the changes of prediction accuracy is not
manifest, and this result is accord with the conclusion of
Ofran's[10] study.
\item \textbf{Prediction with information of known surface residues:}

Based on the truth that predicting protein-protein interacting
region by using merely protein sequence information cannot achieve
good performance. We make an assumption that the information of
whether given residue is exposed as surface residue or not can
contribute to machine learning. Consequently, we regard feature
vector of all non-surface residues as the same. The experimental
results show that this approach indeed benefits the prediction
performance.

\item \textbf{Prediction with physical-chemical properties:}

According to previous work by other study groups, it has been showed
that binding sites share common properties that can distinguish them
from the rest of the proteins. Thus we compile statistics from the
proteins in FSSP and analyze the difference of constituents in
protein-protein acting sites and non-acting sites. In statistics, it
asserts that hydrophobic residues are more liable to appear in the
acting region, as well as the residues with non-polar property. Then
we try several other properties of residues to make prediction and
in subsequence enhance the performance. The experimental results are
in accord with pervious study, attesting that there exist some
common properties which can be used to distinguish interface
residues.

\item \textbf{Prediction with conservation information:}

Protein sequences with a significant similarity are expected to be
homologous, hence they are also expected to share a common
(approximate) three-dimensional structure and similar functions.
When multiple sequences from a family of proteins are available, it
is also possible to infer functionally important amino acid residues
by constructing a multiple sequence alignment and by identifying
conserved regions in the sequences. The basic assumption in such
approaches is that functionally important residues are well
conserved and functionally less important residues are liable to
frequent substitutions (Kimura, 1983). The result of appending the
conservation information for machine data learning attests that the
conservation property of acting region does help enhance the
prediction accuracy.

\item \textbf{Prediction with secondary structure information:}

In recent years, the accuracy of predicting secondary structure of
proteins using sequence profile is increased to above $78\%$. In
Zhou's study, they have used sequence profile of neighboring
residues from PSI-BLAST and their solvent exposures as input, then
applying to the neural network predictor. Similarly we use the
sequence profile from PSI-BLAST but in contrast to Zhou's method, we
use SVM as classifier. Our result attests that secondary structure
information can contribute to the prediction performance.

In this report, we start at using preliminary protein sequences,
then using the information from multiple alignments, the information
of non-surface residues, the physical-chemical properties of
residues, the conservation of residues, and finally the secondary
structure information. The sensitivity, precision, F-score are
increased from original 0.543, 0.318, 0.4 to 0.614, 0.381, 0.471
respectively.


\end{enumerate}

\end{document}

