clearer insight into the fundamental biology of

Transcription

clearer insight into the fundamental biology of
Alternative Two Sample Tests in Bioinformatics
Xiaohui Zhong and Kevin Daimi
Department of Mathematics, Computer Science and Software Engineering
University of Detroit Mercy,
4001 McNichols Road, Detroit, MI 48221
{zhongk, daimikj}@udmercy.edu
Abstract— Bioinformatics is a multidisciplinary field.
Statistics is getting immense popularity in bioinformatics
research. The goal of this paper is to introduce a survey of
two sample tests applied to bioinformatics. The vast
majority of these methods do not follow the classical two
sample test techniques, which require strict assumptions.
Thus, unlike other classical surveys, this paper will
emphasize the justifications behind the deviations from the
standard approach, and the implementation of such
deviations.
Index Terms— Statistical Methods, Sequence Analysis,
Microarray, Two-sample Testing, Bootstrap Hypothesis
Testing, Non-traditional Hypothesis Testing
I. INTRODUCTION
Bioinformatics is a rapidly growing discipline that has
matured from the fields of Molecular Biology,
Computer Science, mathematics, and Statistics. It refers
to the use of computers to store, compare, retrieve,
analyze and predict the sequence or the structure of
molecules. According to Cohen [2], “The underlying
motivation for many of the bioinformatics approaches is
the evolution of organisms and the complexity of
working with incomplete and noisy data.”
Bioinformatics is a multidisciplinary field in which
teams from Biology, Biochemistry, Mathematics,
Computer Science, and Statistics work together to
stipulate perception into the functions of the cell [3],
and [10].
More precisely, Bioinformatics is the
marriage between the fields of biology and computer
science together in order to analyze biological data and
consequently solve biological problems [12].
The need for collaboration in bioinformatics research
and teaching is inevitable. “The explosive increase in
biological information produced by large-scale genome
sequencing and gene/protein expression projects has
created a demand that greatly exceeds the demand for
researchers trained both in biology and in computer
science” [4]. According to the European Bioinformatics
Institute [5], “Bioinformatics is an interdisciplinary
research area that is the interface between the biological
and computational sciences. The ultimate goal of
bioinformatics is to uncover the wealth of biological
information hidden in the mass of data and obtain a
clearer insight into the fundamental biology of
organisms. This new knowledge could have profound
impacts on fields as varied as human health, agriculture,
environment, energy and biotechnology.”
The field of statistics plays a vital role in
bioinformatics. Modified statistical techniques are
being constantly evolving. Statistics is the science of
collection, organization, presentation, analysis, and
interpretation of data. [16], [18]. Statistical methods
which summarize and present data is referred to as
descriptive statistics. Data modeling methods that
account for randomness and uncertainty in the
observations and drawing inferences about the
population of interest lie within the inferential statistics.
When the focus is on the biological and health science
information, biostatistics is applicable [18].
The techniques of statistics that have been applied
include hypothesis test, ANOVA, Bayesian method,
Mann–Whitney test method, and regressions tailored
mainly to microarray data sets, which take into account
multiple comparisons or cluster analysis and beyond. In
bioinformatics, microarrays readily lend themselves to
statistics resulting in a number of techniques being
applied [15], [22].
The above mentioned methods
assess statistical power based on the variation present in
the data and the number of experimental replicates.
They even help to minimize Type I and type II errors in
demanding analysis. While these methods sound
familiar to people with statistics background, they might
be foreign to researchers in the field of bioinformatics.
On the other hand, statisticians will enjoy the benefit of
seeing how these techniques are being applied to the
field of bioinformatics when getting to know what DNA
sequences or protein sequences are.
This paper
aims to survey some basic statistical techniques,
especially different kinds of hypothesis testing
techniques that have been developed lately and used in
the context of bioinformatics. The goal of this survey is
to pinpoint the motivations for modifying the classical
two-sample tests when applied to bioinformatics by
researchers. The classical two-sample tests have strict
assumptions. The reason that forced researchers to
relax or violate some of these assumptions will be
explored.
II. CLASSICAL TWO-SAMPLE t -TESTS
The classical two-sample t-test has been applied to only
few bioinformatics problems. The reason for that
should be clear shortly. An example is the following
scenario. When measuring the level of gene expression
in a segment of DNA, the process usually requires
several repeated experiments in order to obtain the
measurements of one cell type. This is due to biological
and experimental variability. The objective is to
compare the levels of the gene expression between two
types of DNA based on the measured levels of gene
expressions for these two types of DNA’s. Such a
procedure is a typical classical two sample t-test.
Assuming that M t ,itt are the measurements from type
t  1 , 2 respectively, with 1  it  n t , the null
hypothesis H 0 : 1   2 is tested with alternative
hypothesis  1  2 . The appropriate test statistics is
t
( M 1  M 2 ) n1 n 2
S n1  n 2


where S  


 
2
t 1
,
(1)
1
2
( M t ,i  M 1 ) 
i 1
 .
n1  n 2  2


nt
2
2
Using the assumptions that M t ,it are nt NID(  t ,  t )
random variables, the statistics t follows a t 
distribution with degrees of freedom n1  n 2  2 if the
null hypothesis is true. While this test procedure is
very simple, it requires very strict assumptions. Some
or all of these assumptions cannot be met in real life
applications. In some cases, it is either not known or
hard to confirm whether the variables M t ,it are normally
distributed. If they are normally distributed, then the
requirement of both normal populations sharing a
common variance could be hard to fulfill. Another
requirement to be satisfied mandates these variables to
be independent, which is generally true in many gene
expressions measurements. In practice, some or all of
these conditions are not satisfied, but the decision on the
equality of two means is still needed.
Thus,
alternatives to this standard classical t-test are required.
In this paper, we will survey several modified tests
appearing in recent bioinformatics literature.
III. TWO SAMPLE TEST WITH INTRA-DEPENDENCY
Gilbert et al [9] compared the genetic diversity of the
virus between two groups of children who were infected
with HIV at birth. The children were classified into a
group of 9 slow/non-progressors (group 1) and a group
of 12 progressors (group 2). Between 3 to 7 HIV gag
P17 sequences were sampled from each child and pair-
wise sequence distances were derived for each child’s
sample as the measures of diversity within a child. The
goal was to assess whether the level of HIV genetic
diversity differed between the two groups in order to
help identify the role of viral evolution in HIV
pathogenesis. In what follows, we will show why the
authors have to deviate from the standard two sample
test. We will first introduce and explain their statistical
model.
g
Let M kij
represent the distance between sequence i
and j of child k in group g , g  1 or 2 . It was found
that if a sequence is involved in two distances of a
child’s sequences, then the two distances are positively
correlated. Also the contrasts involving common
individual are also positively correlated. Therefore, the
conditions for a classical t -test described in section II is
violated. This will force the application of this
procedure to produce bias results. The natural option is
to perform the test based on a subset of independent
samples in which not all the information is fully
considered. Thus, a new two-sample test that took
account of the correlations between samples was
proposed. The detail is described as follows:
Assume that there are n g children from group g ,
g  1 or 2 respectively, and child k has m kg sequences
sampled. Then there are N g 

ng
k 1
mkg (mkg  1) / 2
many pair-wise distances from each group.
also Q 
g

ng
2(mkg
k 1
There are
 2) many covariances between
the distances for the individuals in each group. The test
M1 M 2
statistics is similar to (1) above, t 
. The
 (M 1  M 2 )
main idea is to estimate the standard deviation
 ( M 1  M 2 ) with
the
correlations
g
between M kij
,
assuming the null hypothesis H 0 : 1   2 is true.
Here, the mean distances are defined as
M g  ( N g ) 1
 
ng
k 1
i j
g
.
M kij
It is noticeable that
the correlation only occurs within the group and
particularly within individuals, so the estimate of the
variance within one group can be discussed without
indexing on g and k . Since there are n(n  1) / 2 pairwise distances, the standard estimate for the variance of
M is

 2 (M)  (n(n 1) / 2 1)1

i j
(Mij  M)2 .
But this estimate is too small because it did not
account for the positive correlations between distances
sharing the same sequences.
Another option is

 2 ( M )  (n  1) 1

i j
( M ij  M ) 2 .
However, this
one is too large unless the correlations between the
sequences are perfectly linear. Therefore, something in
between these two estimates could be a more accurate
estimate of the variance. Because the correlation only
occurs between the pair-wise distances sharing the same
sequence, this variance can be estimated by calculating
the covariance in two parts:

 2 ( M )  ( n( n  1) / 2) 1{2( n  2) 12   22 }
where  12  cov(M ij , M il ) is the covariance of the pairwise distances that share the same sequence, and
 22  var( M ij ) is the variance of all pair-wise
distances.
The empirical estimates of these two variances are:
{( M ij  M )(M il  M )  ( M ij  M )(M jl  M )}
 2 i  j l
1 
,
n(n  1)(n  2) / 3  1
(2)


 2  (n(n  1) / 2  1) 1

i j
( M ij  M ) 2 .
(3)
Since there are two groups, the estimate can be modified
to
2
ng



 2 (M 1  M 2 ) 
N g1
{2(mkg  2) g2,1   g2, 2 }


g 1
k 1
(4)
where

 g21 
ng
 (m
k 1
g
( {(M kij
i  j l

g
g
k ( mk
 1)(mkg  2) / 3  1) 1
g
 M g )(M kil
M g)
g
g
 ( M kij
 M g )(M kij
 M g )} )
and

 g2 2  ( N g  1) 1
 
ng
k 1
i j
g
( M kij
 M g )2
N g ( N g  1) / 2
2
g 1
2( N g  2)  g2  1
is large enough, where  g is
the correlation coefficient of the two pair-wise distances
sharing the same sequence in group g .
The authors provided the comparative results for the
DNA sequences of the 21 children described earlier.
Classical two sample t –test was performed on the
differences based on synonymous distance with sample
means D 1  0.0113 and D 2  0.00713 , and sample
sizes N 1  387 and N 2  523 respectively. The result
suggested a difference between the two groups with
p  2.2 10 6 .
However, it was estimated that the
correlations of the pair-wise distances within individuals
are 1  0.55 and  2  0.61 respectively. The classical
t –test ignored these positives correlations, which
resulted in a smaller estimated variance for the
difference of the means. Thus, the newly developed
procedure was applied producing p  0.56 , which
indicates that the difference between the mean distances
of the two groups is not significant.
The above two-sample test method provided an
alternative to the traditional two-sample t –test to
accommodate the situation where data within the group
may be correlated. This approach will have significant
impact on many areas. First, a new method for the
existing statistical tests is introduced. This method not
only can be applied in the area of bioinformatics, but
can also be applied in other fields, such as finance,
engineering, chemistry, and behavior science. Most
important, it can have distinct significance in the
bioinformatics domain. For example, in the analysis of
DNA sequences [6], one of the tasks is to test the
similarity or differentially expressed genes of two
sequences by matching the subsequences. One of the
assumptions for such matching rules is that the
occurrences of the nucleotides must be independent.
Such an assumption was found to be inaccurate in many
DNA sequences. This method provides an alternative
formula for the test statistics by calculating the variance
of the mean of data that might be dependent on each
other. Furthermore, the method for calculating the
variance can be extended to building statistical models
from data that might be interdependent.
IV. BOOTSTRAP AND PERMUTATION METHODS
ij
1

2
(M  M )
Modified this way, the test statistics t   1
 (M  M 2 )
will have asymptotic normal distribution, provided that
The test discussed in last section dealt with
comparing means from two samples.
With the
advancements of biology and other bioscience,
collections of microscopic DNA spots attached to a
solid surface called microarrays are studied. With the
power of computation, scientists use DNA microarrays
to measure the expression levels of large numbers of
genes simultaneously. One of their objectives is to
detect differentially expressed genes between two types
of cells.
Suppose we have two types of cells. Associated with
each cell are a number of microarrays. Let the number
of microarrays be n1 and n2 respectively. The n1 arrays
contain m genes from the first type of cells, and the n2
arrays have m genes from the second type. Let M ijc be
the expression value of the i th gene in the j th array in
cell c, c  1,2 . Let t i , i  1,2...m be the two sample test
statistics calculated using formula (1).
The null
hypothesis for each test is H io :  i1   i 2 . When this
hypothesis is being rejected as a positive result, the two
genes will be differentially expressed. Assuming the
cumulative distribution function of t i is Di (t ) when the
null hypotheses are true, the p -value of each test can be
calculated as pi  (1  Di (| t i |)  Di ( |t i |)) . These p values
are
arranged
in
ascending
order
p (1)  p ( 2)  ...  p ( m) . Any gene tested with a p -value
below certain threshold will be rejected (indicating the
test is positive). These genes are ranked in the order of
p -values with the smallest value as the most significant
for further study. The remaining task is to find the
distributions Di (t ) .
There are many different ways to identify these
distribution functions. Under the classical assumptions
that all M ijc are normally identically distributed, the
distributions are either student t –distribution or
standard normal distribution. As discussed in the last
section, such an assumption is either unrealistic or
difficult to verify. As a result of increasing computing
power, resampling methods, such as permutation and
bootstrap methods are being widely used. These
methods generate empirical distributions Di , which are
also the distributions of p i .
The classical bootstrapping/permutation resampling
scheme is described as follows.
 Calculate the test statistics from the original sample
t i for each gene using formula (1).
 All n1  n 2 arrays are put in the same pool. The n1
arrays are randomly drawn to be assigned to type 1
cell, and n1 arrays are randomly drawn to be
assigned to type 2 cell.
 If the draws are with replacement, the bootstrap
method will be used. If the draws are without
replacement, the permutation method will be
applied.
 Repeat the above steps B times. In the case of
permutation, not all possible permutations have to be
considered. In this paper, the two methods will be
treated similarly.
 Calculate the t –statistic t ib , i  1,2,..., B using
formula (1) for each sample.
 Under the null hypotheses that there is no
differentially expressed gene, the t –statistics
should have the same distribution regardless of how
the arrays are arranged. Hence, the empirical p values can be calculated by:
pi 
1
B

# { j :| t bj || t i |, j  1,2,..m}
B
b 1
(5)
m
This scheme was discussed and applied in a number of
papers [1], [8], [13], [19]-[21]. It also has another
alternative described in [13] as Posterior Mixing
Scheme:
 Resample the n1 arrays from type 1 cell and place
on type 1 cell, and resample n 2 arrays from type I
cell and place on type II cell.
 Using the data in question, calculate tib1 , i  1,2,..., B
for each sample using formula (1).
 Repeat the above two steps on the array from type
II cell and obtain t ib2 , i  1,2,..., B . Finally,
calculate
tib 
n1 b
n2 b
ti1 
ti2 .
n1  n2
n1  n2
(6)
Then p i ’s are calculated with formula (5). It was
concluded that the Posterior Mixing Scheme will have
better power [1 – P (type II error)] than the classical
one. To our knowledge, this formula for calculating the
test statistics has not been employed in the
bioinformatics literature yet. The formula should be
appealing to researchers to further investigate and
validate it, and obtain more accurate results for
identifying differentially expressed genes.
Mukherjee et al [17] took the bootstrap method for
calculating these statistics a step further. From the
bootstrap schemes described above, the bootstrap, t i ’s
are assumed to be normally distributed with empirical
B b
1
t i and standard deviation  .
mean  t i 
B b1
Formula (5) was not used to calculate the p -values.
Instead the expected p -value was calculated by


pi  E( pi ) 



(1 Di (| x |)  Di ( | x |)G(x,ti , )dx
where Di is the cumulative distribution function of t i
and G is the Gaussian.
This procedure was applied to some widely analyzed
microarray data with Di replaced by t -distribution with
degrees of freedom n1  n 2  2 , and the variance  2
was set between 1 and 3. Results of ranking on the
genes by this proposed bootstrap method and classical
two sample t  test were compared. It was found that
the genes identified to be differentially expressed were
subsequently confirmed by further costly test to rank an
average of 25.5 places higher than genes ranked by the
classical method [17]. This shows that the bootstrap
method provides a powerful alternative to the classical
method by estimating the p -values more accurately.
Bootstrap two sample test is widely used by many
researches in identifying the differentially expressed
genes. This method is particularly suitable for the cases
when the underlining distributions are unknown. For
example, Troyanskaya et al [21] used this procedure to
perform 50,000 permutation on a data set comprised of
normal lung and squamous cell lung tumor specimens
with the Bonferroni correction p -values. The result of
this method was compared to the result of rank sum test
and ideal discriminator method. It was concluded that
the bootstrap two sample test is most appropriate for a
high-sensitivity test [21]. Many other researchers, such
as Pan [19], Ge [8], and Abul [1] also used this method
as an integral part of their more comprehensive study of
microarrays.
The procedure of bootstrapping requires intensive
computation. Computer packages/algorithms are also
developed to tackle the issues related to computation
time, storage and efficiency. Li et al [14] developed an
algorithm, Fast Pval, to efficiently calculate very low pvalues from large number of resampled data. The
software package, SAFEGUI, was designed to bootstrap
resampling t-tests for testing gene categories [7].
value for 5% Type I error. This will produce 49 [5% of
(1000-20)] miss-identified genes, which is even more
than the actual differentially expressed genes. Thus, the
Type I error for the entire array is greater than 5%,
which is undesirable result. A well-known classical
procedure to correct this problem is the Bonferroni

correction by replacing the cut-off Type I error  by
m
where m is the total number of tests [6], [21]. For
m  1000 , Type I error becomes 0.0005, which forces
the test to miss most of the significantly differentially
expressed genes. Actually, the possible outcomes of any
multiple tests can be described in the tabular format
(Table 1) below. The numbers in parentheses represent
the intended scenario.
A variety of measurement schemes in the
development of procedures dealing with microarray data
were proposed. These include Per-comparison error
rate (PCER), Family-wise error rate (FWER), False
discovery rate (FDR), and positive False discovery rate
(pFDR). They are stated as [8]:
E(V )
m
FWER= Pr(V  0)
V
FDR= E( | R  0) Pr( R  0)
R
V
pFDR= E( | R  0)
R
PCER=
TABLE 1
POSSIBLE OUTCOMES FOR 1000 GENES WITH 5% P-VALUE
V. MULTIPLE TESTING WITH Q-VALUES
Modified two-sample tests, and bootstrap twosample tests introduced in the last section concentrated
on finding the p -values of the test so that genes can be
ranked accordingly. Notice that the p -value is only the
probability that the test statistic falls in the critical
region controlled by the maximum tolerance of Type I
error for one test. In the case of multiple tests, such as
the gene expressions in microarrays, the Type I error
can be inflated. For example, assume that 1000 genes
are represented in each array of the two types of cells,
and 20 out of the 1000 genes are differentially
expressed. To find these 20 genes, two-sample t -tests
are performed among 1000 pairs of genes using a p -
Among these four measures, the most commonly used
is the pFDR. Since this quantity is only meaningful and
useful when R is positive, this rate is usually written as
V
FDR= E( ) which is the symbol used here. A
R
V
, which
R
represents the ratio of number of false positive and the
total tested positive. While the traditional multiple tests
have to deal with thousands of test with only one cut-off
value for the p -values, the false discovery rate takes
into account the joint behavior of all the p -values. The
false discovery rate is therefore a useful measure of the
overall accuracy of a set of significant tests. We will
discuss a method using a q -value developed by Storey
et al [20]. The q-value method took into consideration
the FDR balancing the identification of as many
significant features as possible, while keeping a
relatively low proportion of false positives. This
method and an important application of this method [1]
will be discussed below.
A value similar to the p -value is defined by Storey et
al [20] as the q –value corresponding to a particular p –
value. Assume the p -values for each test are calculated
straightforward estimate for FDR is FDR 
as pi by one of the methods introduced in previous
sections. Then the q –value is calculated by:
V ( )
q ( p i )  min FDR(  )  min {
}
pi   1
pi   1 S (  )
where V ( )  # {false positive | pi   , i  1,2,..., m} ,
and S ( )  #{ p i   , i  1,2,..., m} . The objective is to
simultaneously control the q -value and the p -value so
that the FDR will not be out of proportion. A procedure
for finding the q -values and the criteria for selecting
the threshold in a sequential procedure are described
below [20].
1) Assume the test statistics are calculated by (1), with
p -values p i calculated by (5), for i  1,2,..., m .
2)
Arrange
the
p –values
in
ascending
order
p (1)  p ( 2)  ...  p ( m) , which is also the order of genes
in terms of their order against the null hypotheses.
3) Use one of the options described below to estimate

the value of  0 .

 0 mt
4) Estimate q( p( m) )  min
 p( m)
t  p( m ) # ( pi  t )
5) For j  m  1, m  2,...,1 , estimate

 0 mt
q ( p ( j ) )  min {
, q( p( j 1) )}
t  p( j ) # ( p i  t )

 0 mp( j )
 min{
, q( p j 1) )}
j
Now, two lists for p -values and q -values are
simultaneously formed:
p (1)  p ( 2)  ...  p ( m)
,
q( p (1) )  q( p( 2) )  ...  q( p( m) )
One can select the maximum index 1  k  m in the
above lists so that both p -values and q -values up to
k th gene will satisfy both thresholds.
The quantity  0 in step 3 is the proportion of null
genes (no differences between the two cell types) of the
total number m of genes tested. Despite the fact of
having a difficult task to deal with, three different ways
have been developed to estimate this quantity [20].
A. Rule of Thumb Method
Let  0 
# ( pi   )
for some λ, 0    1 .
m(1   )
The rationale for this estimate is that the null p –
values are uniformly distributed after certain value,
 . A simple rule of thumb is choosing   0.5 .
This implies that the value of π0 is estimated by
# ( pi  0.5)

0 
.
0.5m
B. Bootstrap Method
Assumed that all p -values are calculated from the
original set of data. Calculate  0 ( k ) 
for  k  k , k  0,1,..., M ,  
# ( pi  k )
m(1   k )
 max
,0   max  1
M
from these p -values. Here, max is close to 1 and
M is the number of desired points. Let
  min { 0 (k )} . Resample the data B times,
0 k  M
calculating
 0b ( k ) 
# ( pib   k )
from
m(1   k )
the
bootstrap p -values for all  k each time. Define the
mean square error to be:
MSE ( k ) 

B
b 1
( 0b ( k )   ) 2
B
.
Then the estimate of the proportion of null genes
will be:

 0  min{ 0 ( k ),1} , where  is the collection of
k 
 k s such that MSE ( k ) is minimum.
A simple
algorithm was given in [1].
C. Curve Fitting Method
The ideal estimate for  0 ( ) is  0 ( max ) , where
 max is close to 1 since genes should be null in this
region. However, the value of  0 ( ) is very sensitive to
change of  . To obtain a stable estimate, a natural cubic
spline f ( ) is suggested to be fitted to the points
{(k ,  ( k )) |  k  {0,  , ,.., max } , the estimate is

 0  f (1) . There were two suggestions for fitting the
curve. Storey et al [20] suggested that the curve fitting
should be weighted by (1   ) to control the instability
near 1. However, Abul et al [1] suggested that the result
with no weighting is better to avoid underestimation.
For any new set of data, both weighted and un-weighted
fitting should be tried and the better estimate used.
The procedure of estimating  0 was extended to onesided hypothesis [1] with some adjustments. For
example, if the tests are right-sided (up-regulated), the
formula for the t -statistics remains the same as (1).
The corresponding p  values can also be calculated by
the bootstrap process described in last section.
However, formula (5) for calculating the p  values
should be modified to
pi 
This
1
B

B
# { j : t bj  t i , j  1,2,..m}
b 1
change
m
will
.
make lim  0 ( )  1 ,
 1
(5' )
which
is
meaningless. The adjustment will be to set  max as the
upper bound of  for which  0 ( )  1 . This results in
 max  sup{0    1 |  0 ( )  1} . Bootstrap or curve

fitting will be deployed to estimate  0 , which is needed
for finding the q -values.
Experiments on some artificial data demonstrated that
this approach could provide very accurate estimates.
The procedure described above can guide researchers to
fine tune the selection of genes for further experiments. By
bounding false-discoveries, the amount of wasted time and
cost can also be bounded with the same rate of falsediscoveries beforehand. This procedure has many
applications in microarray experiments and gene analysis.
VI. CONCLUSIONS
Bioinformatics is being used in many fields such as
molecular medicine, preventative medicine, gene
therapy, drug development, and waste cleanup. The
interdisciplinary nature of bioinformatics demands close
collaboration between biologists, computer scientist,
mathematicians, and statisticians. Statistics is playing a
significant role in various applications of
bioinformatics. One of the important areas of statistics
that has been heavily used is two sample tests. These
tests classically have rigorous postulations. Researcher
involved in bioinformatics concluded that these tests are
not readily suitable for their work due to the nature of
many of the bioinformatics applications. Consequently,
they were forced to weaken some/all of these
postulations. This paper surveyed a number of methods
that pushed researcher to diminish these constraints.
Assumptions that were relaxed and the reasons behind
this relaxation were demonstrated.
It is our future goal to introduce studies dealing with
variation of formulas for two sample tests, variety
methods of controlling the false discovery rate, such as
selecting proper sample size, methods taking into
account the dependency of sample data, and extension
of these techniques to multi-sample testing. While
many of these techniques were proposed based on
certain set of data or artificial data, work needs to be
done on different data sets to validate the results. More
importantly, statisticians can help in seeking theoretical
justification or support for these methods. Computer
scientists can assist in developing more efficient
algorithms to implement these techniques. It is hoped
that these methods can spark new ideas in the future
research in bioinformatics.
REFERENCES
[1] O. Abul, R Alhajj, and F Polat, “A Powerful
Approach for Effective Finding of Significantly
Differentially Expressed Genes,” IEEE/ACM
Transactions on Computational Biology and
Bioinformatics, Vol. 3, No. 3, pp. 220-231, 2006.
[2] J. Cohen, “Bioinformatics: An Introduction for
Computer Scientists,” ACM Computing Surveys,
Vol. 36, No. 2, pp. 122-158, 2004.
[3] J. Cohen, “Computer Science and Bioinformatics,”
Communications of the ACM, Vol. 48, No. 3, pp.
72-78, 2004.
[4] Editorial, “Training for Bioinformatics and
Computational Biology,” Bioinformatics, Vol. 17,
No. 9, pp. 761-762, 2001.
[5] European Bioinformatics Institute, Available:
http://www.ebi.ac.uk/2can/home.html.
[6] W. J. Ewens and G. R. Grant, Statistical Methods in
Bioinformatics: An Introduction, New York:
Springer-Verlag, 2001.
[7] D. M. Gatti, M. Sypa, I. Rusyn, F. A. Wright, and
W. T. Barry, “SAFEGUI: Resampling-Based Tests
of Categorical Significance in Gene Expression
Data Made Easy,” Bioinformatics, Vol. 25, No. 4,
pp. 541-542, 2009.
[8] Y. Ge, S. Dudoit, and T. P. Speed, ”ResamplingBased Multiple Testing for Microarray Data
Analysis,” Dept. of Statistics, University of
California, Berkeley, Tech. Rep. 633, 2003.
[9] P. B. Gilbert, A. J. Rossini, and R. Shankarappa,
“Two-Sample Tests for Comparing Intra-Individual
Genetic Sequence Diversity between Populations,”
Biometrics, Vol. 61, No. 1, pp. 106-117, 2005.
[10] S. Gopal, A. Haake, R. P. Jones, and P. Tymann,
Bioinformatics: A Computing Perspective, New
York: McGraw Hill, 2009.
[11] Y. Ji, Y. Lu and G. Mills, “Bayesian Models Based
on Test Statistics for Multiple Hypothesis Testing
Problems,” Bioinformatics, Vol. 24, No.7, pp. 943949, 2008.
[12] M. LeBlanc, and B. Dyer, “Bioinformatics and
Computing Curricula 2001: Why Computer Science
is Well Positioned in a Post Genomic World,” ACM
SIGCSE Bulletin, Vol. 36, No. 4, pp. 64-68, 2004.
[13] S. Lele and E. Carlstein, “Two-Sample Bootstrap
Tests: When to Mix?” Department of Statistics,
University of Carolina at Chapel Hill, Tech. Rep.
2031.
[14] M. J. Li, P. C. Sham, and J. Wang, “FastPval: A
Fast and Memory Efficient Program to Calculate
Very Low P-values from Empirical Distribution,”
Bioinformatics, Vol. 26, No. 22, pp. 2897-2899,
2010.
[15] P. Liu, J. T. Hwang, “Quick Calculations for
Sample Size While Controlling False Discovery
Rate with Application to Microarray Analysis,”
Bioinformatics, Vol. 23, No. 6, pp. 739-746, 2007.
[16] V. Mantzapolis, and X. Zhong, Probability and
Statistics, Dubuque: Kendall Hunt Publishing
Company, 2010.
[17] S. N. Mukherjee, P. Sykacek, S. J. Roberts, and S.
J. Gurr. “Gene Ranking Using Bootstrapped PValues,” SIGKDD Explorations, Vol. 5, No. 2, pp.
16-22, 2003.
[18] M. Pagano, and K. Gauvreau, Principles of
Biostatistics, Belmont: Brooks/Cole, 2000.
[19] W. Pan, “A Comparative Review of Statistical
Methods for Discovering Differentially Expressed
Genes in Replicated Microarray Experiments,”
Bioinformatics, Vol. 18, No. 4, pp. 546-554, 2002.
[20] J. Storey and R. Tibshirani, “Statistical Significance
for Genome-wide Experiments,” Proceedings of the
National Academy of Sciences of the United Stated
of America, Vol. 100, No. 16, pp. 9440-9445,
2003.
[21] O. G. Troyanskaya, M. E. Garber, P. O. Brown, D.
Botstein, and R. B. Altman, “Nonparametric
Methods for Identifying Differentially Expressed
Genes in Microarray Data,” Bioinformatics, Vol.
18, No. 11, pp1454-1461, 2002.
[22] Y. Zhao, and W. Pan, Modified Nonparametric
Approaches to Detecting Differently Expressed
Genes in Replicated Microarray Experiments,
Bioinformatics, Vol. 19, No. 9, pp. 1046-1054,
2003.