"Analysis of Expression Data: An Overview". In: Current Protocols in

Transcription

"Analysis of Expression Data: An Overview". In: Current Protocols in
Analysis of Expression Data: An Overview
Transcriptome profiling, the simultaneous measurement and analysis of expression
changes in thousands of RNAs, has been
enabled largely by microarray technology
(Schena et al., 1995). Technologies such as
serial analysis of gene expression (SAGE)
or, more recently, massively parallel signature sequencing (MPSS; Brenner et al.,
2000), provide alternative means to measure mRNA abundance. While all three technologies generate high volume data that
require serious consideration of methods of
analysis, this overview focuses on analysis
of data from the most widely used platform,
microarrays.
Since their inception in 1989, the underlying technology and DNA probe content
applied to microarrays has diversified considerably. Probes may be deposited or directly
synthesized on either a silica chip, a glass slide,
in a gel-matrix, or on a bead. Probes often vary
in length from as short as 20 nucleotides (often synthesized in silico) to as long as several
thousand bases (using spotted cDNA technology). In fact, the DNA does not even have
to correspond exclusively to transcribed sequences, such as when microarrays are used
for comparative genome hybridization or single nucleotide polymorphism (SNP) analysis.
One-Color Versus Two-Color Arrays
Microarray technologies can generally be
divided into two categories, one-color arrays
and two-color arrays. In one-color arrays, one
RNA sample is processed, labeled with fluorescent dye, and applied to one microarray. Thus, one raw intensity score is generated from each feature or spot on the array
(a feature or a spot refers to the immobilized
DNA in one physical location on a microarray that corresponds to one sequence). With
two-color arrays, two RNA samples are applied to each microarray (Shalon et al., 1996).
A separate raw intensity score can still be obtained for each sample from each feature because the RNA samples are labeled with dyes
that fluoresce across different spectral ranges.
The advantages of two-color arrays are that
one of the two labeling fluors can be used
as an internal standard. Alternatively, onecolor arrays are often easier to use in large
cohort studies, where obtaining a common
reference sample for all subjects is difficult.
Ultimately, the choice of technologies is
very often influenced by the experimental
design.
EXPERIMENTAL DESIGN
The term “experimental design” refers to
the structure of the experiment and dictates
the structure of the resulting data as well
as the types of analyses that can be performed.
Because of this, the design phase takes place
before any experimentation is carried out, and
should be carefully considered to maximize
the usefulness of the experiment.
The primary consideration in designing an
experiment should be what question or questions the experiment will attempt to answer.
Typical goals of a microarray experiment are to
find genes that are differentially expressed between two treatments, or to study how expression levels in a set of genes change across a set
of treatments. Because baseline gene expression levels are not usually known beforehand,
most microarray experiments involve comparing gene expression levels between treatment samples and those representing control
samples.
Once treatment and control conditions have
been decided upon, the next step is specifying
the units or samples to which the treatments
will be applied and the rules by which these
treatments are to be allocated. In the simplest
case, mRNA samples are prepared for both
treatment and control conditions, and then hybridized either to separate one-color chips, or
together on a two-color chip. More sophisticated designs are also possible, such as dyeswap and loop designs (Kerr and Churchill,
2001); these have the advantage of increased
control over non-biological sources of variation, such as labeling efficiencies, dye effects,
hybridization differences, and other sources of
error arising from the measurement process.
Since conventional statistical techniques
depend on the existence of replicates, the number and structure of replicates need to be determined during the design phase. Replication structure will typically be dictated by
the goals of the experiment; the number of
replicates to use can be estimated by deciding which statistical analyses will be used and
performing a priori power calculations using
data from a pilot experiment. The goal of these
calculations is to estimate the number of
Contributed by Anoop Grewal, Peter Lambert, and Jordan Stockton
Current Protocols in Bioinformatics (2007) 7.1.1-7.1.12
C 2007 by John Wiley & Sons, Inc.
Copyright UNIT 7.1
Analyzing
Expression
Patterns
7.1.1
Supplement 17
replicates that will be needed to detect
specified differences in gene expression levels. Frequently, the cost of microarrays or tissue availability prohibits the researcher from
conducting a pilot study or using an optimal
number of biological replicates recommended
by a statistician. While using more biological replicates is always better, most reviewers
currently accept experimental designs with at
least three biological replicates per experimental condition. Bear in mind that variability between replicate samples will depend on the
model system. For instance, an in vitro yeast
study employing cells with identical genetic
backgrounds is likely to yield more significant
results with three replicates per group than a
study measuring expression from blood samples of humans whose genetic backgrounds
and daily routines add variability between
samples The latter study will almost certainly
necessitate many more biological replicates
per treatment to yield significant results.
The end goal of any microarray experiment
is to precisely measure some type of biological variation; an optimal design will facilitate
this while also controlling for as many nonrelevant sources of variation as possible. To
this end, early interaction and communication
between biologists and statisticians is desirable, and can result in higher-quality data that
is more suitable for answering the questions of
interest.
RAW DATA OUTPUT
Scanner Data Processing
Analysis of
Expression Data:
An Overview
Virtually all array scanners include software that converts a scanned image into a tabular set of signals, each corresponding to a
feature on the array.
In the case of Affymetrix GeneChips, each
transcript is measured by multiple features
consisting of 25-mer sequences that map
to different regions of the same transcript
(Lipshutz et al., 1999). To provide some additional information about nonspecific hybridization and background, each of these features is paired with a control mismatch feature
that matches the ‘perfect match’ 25-mer at all
positions except for a single nucleotide difference or ‘mismatch’ at the central or 13th
position. A collection of 11 to 20 pairs of such
features mapping to a common transcript are
referred to as a probeset. Thus, the raw data
gathered at this stage needs to be processed further in a step referred to as ‘probe-level analysis’ to yield a single expression measurement
for a given RNA transcript. Probe-level anal-
ysis is discussed in greater detail below, see
Normalization.
In addition to the summary expression
score obtained for each RNA transcript, scanner software may produce many additional
statistics to allow the data analyst to validate
the overall RNA sample quality, to measure
sample-to-sample consistency, to assess spotto-spot variability, and to assess other sources
of technical variation. Scanner software may
also include normalization options to correct
for technical variations across dyes (for twocolor arrays), features, and arrays.
Flags
One type of data value designed to indicate
the quality of a probe score or spot is referred
to as the “flag.” Most, but not all, scanners
will generate at least one flag call, usually a
nominal value, per feature.
Earlier-generation Affymetrix software
(MAS 4.0 and MAS 5.0) reports a detection
call that may be treated as a ‘flag’ for each
probeset (i.e., RNA transcript) by some software, but this call is meant to be interpreted
differently from other scanner vendors’ flags.
Here, the flag is a call to assess whether a gene
is expressed at a detectable level (flag value =
Present) or not (flag value = Absent). A third
possible value, ‘Marginal’ designates a situation where a definitive call of ‘Present’ or
‘Absent’ cannot be made.
Because of the scanner-to-scanner variability in what a flag value means, microarray data
analysis software supporting multiple platforms make limited use of flag information.
When flag values are imported, they can be
used in quality control assessment to remove
associated feature measurements when the flag
indicates spot quality problems. In the case of
Affymetrix data, flag values can be used to restrict genes of interest at a first pass to those
that are ‘Present’ in at least one or some other
minimal number of samples in the study.
Microarray Data Analysis Software
Software specialized for microarray data
is typically used to analyze the raw microarray data output. These software solutions are
preferred to more generic statistical software
packages because of the inclusion of normalization methods, clustering algorithms, and
tools for biological interpretation, such as gene
ontology analysis, gene annotation management, and pathway analysis. Thus, as data
analysis techniques are discussed below, companion tables provide a guide of software solutions that offer the described techniques. Note
7.1.2
Supplement 17
Current Protocols in Bioinformatics
Table 7.1.1 Microarray Platforms Supported by Common Data Analysis Software Packagesa
Software (manufacturer or citation)
Affymetrixb Other one-color
Two-color
Acuity (Molecular Devices)
X
X
X
ArrayAssist (Stratagene)
X
X
X
ArrayStat (Imaging Research)
X
X
X
Avadis (Strand Genomics)
X
X
X
BioConductor (Gentleman et al., 2004)
X
X
X
ChipInspector (Genomatix)
X
d-Chip (Li and Wong, 2001)
X
Expressionist (Genedata)
X
X
X
GeneLinker Gold (Improved Outcomes
Software)
X
X
X
GeneSifter (VizX Labs)
X
X
X
GeneSight (BioDiscovery)
X
X
X
GeneSpring (Agilent)
X
X
X
J-Express Pro (Molmine)
X
X
X
Partek Genomics Suite (Partek)
X
X
X
Rosetta Resolver System (Rosetta Biosoftware)
X
X
X
S+ArrayAnalyzer (Insightful)
X
X
X
TeraGenomics (IMC)
X
TM4 (Saeed et al., 2003)
X
X
X
a Check with software vendor for automated recognition of specific scanner software output files.
b Affymetrix is listed separately from other one-color arrays because some software only supports Affymetrix data.
that products are listed by features that are current to the publication of this unit and not all
product features are discussed below. Additional features are likely to be added to products, so check the product Website for current
capabilities.
Table 7.1.1 lists commonly used software
solutions for microarray data analysis according to the general microarray platforms
(Affymetrix, other one-color, and two-color)
they support.
DATA NORMALIZATION
Gene expression data come from a variety of sources and are measured in a variety of ways. The measurements are usually
in arbitrary units, so normalization is necessary to compare values between different
genes, samples, or experiments. The goal of
normalization is usually to produce a dimensionless number that describes the relative
abundance of each gene in each sample or
experiment. Ideally, any difference in raw expression scores for a gene across two conditions could be interpreted as having biological
significance. However, technical variation can
be introduced at many steps of experimental
sample processing. Thus, normalization methods are required to minimize the impact of
technical variation.
Affymetrix probe-level analysis methods
usually apply normalization steps during the
course of calculating a summary score and are,
therefore, discussed in detail.
One-Color Arrays
A gene expression score corresponding to
the fluorescence intensity from an array feature cannot be easily converted to a measurement with biological relevant units (such as
transcripts per cell or microgram RNA). Since
the goal of gene expression studies is generally to identify differentially expressed genes
between or among conditions, biologically relevant units are not necessary. Rather, it is necessary to apply normalizations, mathematical
techniques to minimize bias introduced with
technical variation, so that the measurements
can nevertheless be compared across arrays
and even, genes.
Common global normalization methods
employed to correct for chip-to-chip variation
Analyzing
Expression
Patterns
7.1.3
Current Protocols in Bioinformatics
Supplement 17
include median-centering normalization (dividing all measurements on an array by the median feature measurement), mean-centering
normalization, Z-score normalization, quantile normalization, median polishing, and
selected-gene-set (i.e., housekeeping genes)–
based normalizations.
Affymetrix Data Pre-Processing and
Normalization: Probe Level Analysis
Because multiple methods exist for calculating a summary score for the probes that correspond to a single transcript on an Affymetrix
GeneChip, and because such methods can produce different results when the output is used
to perform secondary analyses, an overview of
the most common available methods is provided. Table 7.1.2 indicates the probe-level
processing capabilities of data analysis software packages.
The original method for probe-level
analysis, the average difference method
(Affymetrix Microarray Analysis Suite, MAS
version 4.0, Santa Clara) was applied to most
Affymetrix data dating from before 2001.
MAS 4.0 calculated a robust average from the
set of differences between the perfect matches
(PM) and their respective mismatch (MM)
over a probeset.
When assessed for precision, consistency,
specificity, and sensitivity using data from a
GeneLogic study (2002) in which several dif-
ferent cRNA sequences are spiked-in to RNA
samples at known amounts as well as a companion study employing RNA samples applied
to Affymetrix arrays in a dilution series, results
indicate that the summary scores calculated for
a single gene do not correlate linearly over a
range of spike-in or dilution concentrations.
To reduce the dependence on variance,
which is not equal over the intensity range (as
the original Affymetrix model does), and to
prevent the generation of meaningless negative
signal statistics, Affymetrix MAS 5.0 probe
level analysis (Affymetrix Microarray Analysis Suite, version 5.0, Santa Clara) includes
an adjustment to avoid calculating negative
numbers and calculates the Tukey-biweightderived robust mean over a log transformation.
In both cases, array-to-array technical variation is reduced by scaling all measurements
further to a trimmed mean over the array.
Li and Wong (2001) have since developed
the Model Based Expression Index (MBEI)
for calculating summary scores, and Irizarry
et al. have proposed the Robust Multi-chip Averages method (RMA; Irizarry et al., 2003).
The RMA method is notable for employing
quantile normalization that forces the distributions of probe-level measurements to be
equal across multiple arrays before probe-set
summaries are calculated. A modification on
RMA, GC-RMA (Wu et al., 2004) uses the
mismatch data that RMA ignores to model
Table 7.1.2 Software Solutions Offering Affymetrix Probe-Level Analysis
Software (manufacturer or citation)
GC-RMA
MAS5
MBEI
PLIER
Acuity (Molecular Devices)
X
ArrayAssist (Stratagene)
X
X
X
X
ArrayAssist Lite
(Affymetrix/Stratagene)
X
X
X
X
Avadis (Strand Genomics)
X
X
X
X
Bioconductor
X
X
X
X
dChip (Li and Wong, 2001)
DecisionSite for Microarray
Analysis (Spotfire)
GeneSpring (Agilent Technologies)
X
X
X
Expressionist (GeneData)
X
X
X
X
Genowiz (Ocimum Biosolutions)
S+ArrayAnalyzer (Insightful)
TeraGenomics (IMC)
X
X
X
X
RMAExpress 0.4.1 (Bolstad, 2006)
Analysis of
Expression Data:
An Overview
RMA
X
X
X
X
X
X
X
7.1.4
Supplement 17
Current Protocols in Bioinformatics
the effects of GC-content on nonspecific
binding. Affymetrix has introduced the PLIER
algorithm that applies similar corrections to
RMA and GC-RMA. The results of these algorithms and other algorithmic variations, as
applied to the GeneLogic spike-in and dilution series datasets, can be viewed at the
Affycomp Website of Cope et al., (2004):
http://affycomp.biostat.jhsph.edu.
Two-Color Arrays
In a two-color array, two measurements
are obtained for each feature. Typically, these
measurements are derived from the hybridization of control and experimental samples, each
with different-colored dyes. Because of differences in the rates of dye incorporation, and
different detection efficiencies of the scanners
for different wavelengths, it is often worthwhile to normalize with respect to the overall
expression of each dye, in the same way as
the per-chip normalization is described above
for the one-dye measurements. A normalization that uses Lowess curve fitting (Yang et al.,
2002) works well in this situation.
DATA ANALYSIS
Identifying Differentially Expressed
Genes
Once normalizations similar to those described above have been applied to gene expression data, the results are ratios that can be
analyzed using standard statistical techniques.
While earlier microarray data analyses have
focused on using fold change as the sole criterion for differential expression, the limitations
of this approach have become apparent. Traditional data-driven statistical techniques, such
as ANOVA and t tests, base the significance of
differential expression on the variation across
replicate groups versus that within replicate
measurements. Most of these techniques depend upon the existence of replicate measurements to estimate variability and error. When
replicates are not available, there are alternative methods that can be used to estimate error;
these are described below.
When analyzing ratios resulting from microarray data, it is generally a good idea to first
apply a log transform. This is because treatment effects on gene expression levels are generally believed to fit an additive model, with
treatment effects being multiplicative. The log
transform therefore places the data on a linear scale, and the resulting values are symmetric about zero. The most straightforward
method of identifying differential expression
is to apply a series of t tests on the log ratios,
on a gene-by-gene basis. For two-color data,
where the ratios contain expression values for
both the control and treatment, a one-sample t
test can be performed. When comparing across
more than two treatments, e.g., in a time-series
experiment, this approach can be generalized
by using ANOVA instead of simple t tests. For
each gene, the means in all treatment conditions are compared simultaneously and a single p value is generated. If the p value falls below the threshold, the gene being tested can be
considered differentially expressed. More sophisticated ANOVA models that analyze the
variance across an entire experiment at once
have also been suggested (Kerr et al., 2000).
One particular statistical method, Significance
Analysis of Microarrays (SAM), has been devised by Tusher et al. (2001). This popular
statistical method for finding genes exhibiting statistically significant differences in expression involves comparing mean differences
and standard deviations in defined groups to
those obtained when the same data is randomly
permutated.
Microarray experiments typically result in
expression information for thousands of genes.
When performing univariate tests for every
gene in an experiment, inflated experimentwise error rates and false positives become issues that need to be addressed. In this situation,
it is generally a good idea to apply some sort of
multiple testing correction (MTC). A method
for controlling the false discovery rate, such
as Benjamini-Hochberg, represents a reasonable approach to controlling the yield of false
positives and false negatives (Benjamini and
Hochberg, 1995).
Error models
In cases where replicate measurements are
not available, standard statistical formulas for
variance and standard error cannot be used.
It is still possible to estimate variances under
these circumstances through the technique of
pooling residuals from many different genes. It
has been observed that gene expression variability is a function of the “normal” expression level. This quantity can be measured using control samples, or through normalization
procedures similar to those detailed above (see
Normalization). Because of this dependence,
the pooling of error information is usually
done locally. Another approach is to apply a
variance stabilizing transformation to the entire sample or experiment. Once this has been
done, all genes can be assumed to have similar
variance and thus all measurements in a given
Analyzing
Expression
Patterns
7.1.5
Current Protocols in Bioinformatics
Supplement 17
sample can be used to compute a common
variance (Durbin et al., 2002). These techniques can also be used in cases where very
few replicates are available, where they can
lead to more reliable estimates of error. In all
cases, true replicate measurements are always
the best source of information about error and
variance.
Volcano Plots
A popular graphic to display results from a
pairwise statistical comparison between conditions is the volcano plot. By displaying the
p value from a statistical test, such as the twosample t test against the average fold-change,
usually on a logarithm 2-dimensional plot, the
volcano plot summarizes analysis results by
the two desired criteria, statistical significance
and fold-change. This technique provides an
easy method for users to identify differentially
expressed genes that are both significant and
exhibit differences that are biologically significant in the mind of the user.
Table 7.1.3 provides a non-exhaustive list of
statistical features offered by microarray data
analysis software providers.
Identifying Expression of Splice
Variants
Exon arrays attempt to include sets of features from RNA transcripts that can differentiate between specific transcripts where a
gene is known to have multiple splice variants. In this case, additional analysis is required to determine if differential expression
is present in the form of the expression of different splice variants. Software such as Exon
Array Computation Tool (Affymetrix), ChipInspector Exon Array (Genomatix), and Partek
Genomics Suite (Partek) offer exon array analysis to investigate this specific case of differential expression.
Table 7.1.3 Software Solutions Offering Statistical Analysis Features for Microarray Data
Analysis of
Expression Data:
An Overview
Software (manufacturer or citation)
Statistical featuresa
ArrayAssist (Stratagene)
ANOVA, MTC, t tests, volcano plots
ArrayStat (Imaging Research)
ANOVA, MTC, EM, power analysis, t tests
Avadis (Strand Genomics)
ANOVA, paired t tests, volcano plots
Bioconductor (Gentleman et al., 2004)
ANOVA, MTC, power analysis, SAM, t tests
ChipInspector (Genomatix)
SAM-like t test
d-Chip (Li and Wong)
ANOVA, MTC, t tests
Expressionist (Genedata)
ANOVA, t tests
GeneLinker Gold/Platinum (Improved
Outcomes Software)
ANOVA, MTC, t tests
GeneSifter (VizX Labs)
ANOVA, MTC, t tests
GeneSight (BioDiscovery)
ANOVA, MTC, t tests
GeneSpring (Agilent)
ANOVA, EM, MTC, t tests, volcano plots
J-Express Pro (Molmine)
ANOVA, SAM
Partek Genomics Suite (Partek)
ANOVA, MTC, t tests, volcano plots
Rosetta Resolver System (Rosetta
Biosoftware)
ANOVA, EM, MTC, t tests
S+ArrayAnalyzer (Insightful)
ANOVA, EM, MTC, paired t test, volcano plots
SAM (Tusher et al., 2001)
SAM
SAS Microarray (SAS)
ANOVA, MTC, power analysis, t tests
Spotfire’s Decision Site for Microarray
Analysis (Spotfire)
ANOVA, t tests
TeraGenomics (IMC)
t tests
TM4 (Saeed et al., 2003)
ANOVA, SAM,
a Abbreviations: ANOVA, analysis of variance; EM, error model(s) to estimate variance; MTC, multiple testing
corrections; SAM, significance analysis of microarrays.
7.1.6
Supplement 17
Current Protocols in Bioinformatics
Other techniques involve using arrays with
probes that represent both exons and exonexon junctions. These techniques have the
advantage in that they can unambiguously
identify the presence of exon skipping events
and other post-transcriptional modifications
(Fehlbaum et al., 2005). Blencowe and Frye
have developed the GenASAP algorithm to
work with data generated from similar arrays.
They are able to identify the frequency of excluded exons with remarkably high fidelity
(Pan et al., 2004).
Clustering
Clustering is a generic name applied to the
idea of grouping genes, usually based upon expression profiles. The general idea is that genes
with similar expression profiles are likely to
have a similar function or share other properties. To do this, the concept of “similarity” of
expression profiles needs to be defined.
The objective is to define a function that
produces a score of the similarity of expression
patterns of two genes. There are various ways
to do this; the most common are distance formulas and various correlations. The simplest
method for finding similar genes is to compare the expression pattern for a single gene
against all the other genes in the experiment.
This finds genes that have an expression profile similar to the gene of interest. Hopefully,
the similar genes will somehow be related.
Often, the goal is to find distinct groups
of genes that have a certain, similar pattern.
When one has no idea of what to look for
in advance, all the genes can be divided up
according to how similar they are to each
other. There are many clustering algorithms;
two of the most common are k means and selforganizing maps. In both algorithms, the number of groups desired is roughly specified, and
the genes are divided into approximately that
number of hopefully distinct expression patterns. These algorithms are computationally
intensive and are typically performed using
software (e.g., UNIT 7.3).
Another common method for clustering expression data is called “hierarchical clustering” or “tree building.” When a phylogenetic
tree (see Chapter 6) is constructed, organisms
with similar properties are clustered together.
A similar structure of genes can also be used to
make a tree of genes, such that genes with similar expression patterns are grouped together.
The more similar the expression patterns, the
further down on the tree those genes will be
joined. A similar tree can be made for experiments or samples, where experiments or sam-
ples that affect genes in similar ways can be
clustered together. This technique has an advantage over the abovementioned methods in
that the number of groups does not need to be
specified in advance. Groups of genes can then
be extracted as branches of the tree.
Table 7.1.4 lists software solutions according to whether each of the three most common clustering methods for microarray gene
clustering is offered; however, multiple, other
equally valid clustering algorithms are also offered. Note, too, that even when the same clustering algorithm is provided, parameters such
as linkage method in the case of hierarchical
clustering can vary leading to nonidentical results with a common data input.
Principal Components Analysis
Principal Components Analysis (PCA) attempts to reduce the dimensionality of highdimension microarray data by finding the dominant trends among genes or samples and expressing each gene or sample as sums of a
small number of profiles. PCA is more commonly performed on samples than genes to
assess whether samples of a common class
or treatment cluster together when plotted according to their correlation by the first two
or three principal components. PCA may also
help in identifying suspect samples that do not
group with biological replica cohorts.
Table 7.1.4 summarizes software solutions
according to the availability of common clustering algorithms and principal components
analysis.
Classification
Classification of tumors and other tissues is
another potentially useful application of geneexpression data. These techniques can be used
to find genes that are good predictors for cancers and other conditions, to verify tissue classifications obtained by other means, and for diagnostic purposes. The clustering techniques
mentioned previously (see Clustering) can be
applied to samples instead of genes; when used
in this context, they are examples of unsupervised learning, i.e., the identification of new or
unknown classes using gene-expression profiles. The term supervised learning refers to
the classification of samples or tissues into
known classes. In this setting, a set of tissue
samples where the classification is previously
known, e.g., cancerous tumors, is analyzed in
a microarray experiment. The resulting geneexpression data can then be used to classify or
predict the class of new samples based on their
gene-expression levels.
Analyzing
Expression
Patterns
7.1.7
Current Protocols in Bioinformatics
Supplement 17
Table 7.1.4 Software Solutions Offering Tools for Clustering and PCAa
Software (manufacturer or citation)
Acuity (Molecular Devices)
HC
k means
PCA
SOMs
X
X
X
X
ArrayAssist (Stratagene)
X
X
X
X
Avadis (Strand Genomics)
X
X
X
X
Bioconductor (Gentleman et al., 2004)
X
X
X
d-Chip (Li and Wong, 2001)
X
Expressionist (Genedata)
X
X
X
X
GeneLinker Gold (Improved Outcomes Software)
X
X
X
X
GeneSifter (VizX Labs)
X
X
X
GeneSight (BioDiscovery)
X
X
X
X
GeneSpring (Agilent)
X
X
X
X
Genowiz (Ocimum Biosolutions)
X
X
X
X
J-Express Pro (Molmine)
X
X
X
X
Partek Genomics Suite (Partek)
X
X
X
Rosetta Resolver System (Rosetta Biosoftware)
X
X
X
X
S+ArrayAnalyzer (Insightful)
X
X
SAS Microarray (SAS)
X
X
Spotfire’s Decision Site for Microarray Analysis (Spotfire)
X
X
X
X
TM4 (Saeed et al., 2003)
X
X
X
X
a Tools are tabulated for the three most common clustering methods; some software solutions contain additional
clustering algorithm options. Hierarchical clustering (abbreviated HC) is most often available for both genes and
samples, while k means and self-organizing maps (SOMs) is usually only available for genes. PCA may be provided
for genes and/or samples.
Analysis of
Expression Data:
An Overview
There are a number of statistical techniques
and algorithms that can be applied to perform
class prediction. They include various types of
discriminant analyses, nearest-neighbor techniques, classification trees, and others that fall
under the general heading of machine learning (Dudoit et al., 2000). In all of these techniques, the basic steps followed are similar.
First, predictor genes are chosen based on their
expression profiles in the samples with known
classification. These tend to be genes whose
expression profiles are significantly different
between the classes of interest, and thus are
good for discriminating between those classes.
Next, the expression profiles for these genes in
the samples of unknown classification are examined. This information is then used to place
the new or unknown samples into the appropriate classes. If the set of samples being classified have already been classified by alternative
clinical methods, this can be used as a validation or verification of those methods. For samples not yet classified, this information is potentially valuable for diagnostic purposes. The
following software packages contain one or
more methods for class prediction: ArrayAssist (Stratagene), Avadis (Strand Genomics),
PAM (Tibshirani et al., 2002), dChip (Li
and Wong, 2001), Expressionist (GeneData),
GeneLinker Platinum (Improved Outcomes),
GeneSpring (Agilent Technologies), Genowiz
(Ocimum Biosolutions), Partek Genomics
Suite (Partek), Rosetta Resolver (Rosetta Inpharmatics), S+ ArrayAnalyzer (Insightful),
SAS Microarray (SAS), and TM4 (Saeed et al.,
2003).
Sequence Analysis
Genes with similar expression profiles
may be regulated by common transcription
factors. For organisms whose genomes are
completely sequenced and mapped (e.g., S.
cerevisiae and C. elegans), high-throughput
computation now enables the search for
candidate DNA binding sites in upstream
regions of genes clustered based on expression profile similarities (Wolfsberg et al.,
1999). Software solutions offering some of
sequence analysis include AlignACE (Hughes
et al., 2000), Gene2Promoter (Genomatix),
7.1.8
Supplement 17
Current Protocols in Bioinformatics
Table 7.1.5 Comprehensive Microarray Data Analysis Solutions Offering
Gene Ontology and/or Pathway Analysis
Software
Manufacturer or citation
ArrayAssist
Stratagene
d-Chip
Li and Wong (2001)
GeneSifter
VizX Labs
Genowiz
Ocimum Biosolutions
Rosetta Resolver System
Rosetta Biosoftware
Avadis
Strand Genomics
Expressionist
Genedata
GeneSpring
Agilent Technologies
J-Express Pro
Molmine
Spotfire’s Decision Site for Microarray Analysis
Spotfire
GeneSpring
(Agilent
Technologies),
MotifSampler (Thijs et al., 2002), and
TOUCAN2 (Stein et al., 2005).
Pathway and Ontology Analysis
Once genes are identified as strong candidates for differential expression across the
conditions of interest, the results need to be
interpreted in the context of known biology
to extend molecular data to an understanding
of higher-level biological effects. Comparing
the list of gene profiles of interest against previously assembled lists of genes grouped by
function, pathway of action, or cellular localization can provide useful insights.
Facilitating the effort, the Gene Ontology
(GO) Consortium (UNIT 7.2) has established
standard hierarchical classifications for genes
grouped by biological process, cellular localization, or molecular process with a fixed
and controlled vocabulary for class names
(Asburner et al., 2000). Furthermore, the group
has embarked on gene curation efforts to assign genes to the defined classes. Investigators
can now mine NCBI’s LocusLink (UNIT 1.3) for
gene-classification information and effectively
set up classifications based on GO annotations.
Table 7.1.5 lists comprehensive microarray data analysis solutions that offer gene
ontology analysis. An extensive list of the
many other dedicated tools for Gene Ontology analysis can be found on the Website of the Gene Ontology Consortium at
http://www.geneontology.org/GO.tools.shtml.
Several online pathway databases have also
come to the fore, specifically, the Kyoto Encyclopedia of Genes and Genomes (KEGG;
Kanesha et al., 2002; also see Internet Resources), Biocarta (see Internet Resources),
and GenMAPP, which includes analysis software (Dahlquist et al., 2002). Finally, statistics relating the expected probability of overlap to observed overlap between gene sets
can further be brought to bear to examine
the significance of potential relationships. See
Table 7.1.6 for a list of software solutions that
use data from these resources, their own manually curated data and/or results from naturallanguage-processing (NLP)-based algorithms
to derive gene interaction information that can
Table 7.1.6 Dedicated Pathway-Analysis Software and
Resources for Microarray Data Analysis
Software
Manufacturer or citation
BiblioSphere PathwayEdition
Genomatix
GenMapp
Dahlquist et al. (2002)
PathwayArchitect
Stratagene
Cytoscape
Shannon et al. (2003)
Ingenuity Pathway Analysis
Ingenuity
PathwayStudio
Ariadne Genomics
Analyzing
Expression
Patterns
7.1.9
Current Protocols in Bioinformatics
Supplement 17
be used to make sense of the results from microarray statistical analyses.
INFORMATICS AND DATABASES
The primary challenges of archiving and
retrieving gene expression data are a result of
the speed at which such data can be generated.
The cost of performing array-based expression experiments has dropped significantly in
the past few years, so that even a medium-sized
microarray facility can produce data from hundreds of arrays each month. For each of these
arrays, there exist clinical and experimental
parameters that are invaluable for interpreting
the resulting expression data.
In such an environment, it becomes necessary to be able to query the data based on these
experimental parameters as well as based on
actual measurements of gene expression. A
typical query might ask to find all of the genes
that are significantly up-regulated in any sample treated with compound X. Such a question is nontrivial because it asks both a statistical question (significantly up-regulated) and a
historical question (sample treatment). In general, there are different types of tools to answer these two different types of questions.
The ability to integrate the archival tools with
the analysis tools is key to building a truly
useful informatics system.
When performing many experiments, especially in a large organization, a laboratory information management system (LIMS)
database is useful to keep track of who has
done what to which experiments, and other
useful information. This database is often custom built to work with the flow of a particular laboratory, but several commercial suppliers offer preconfigured LIMS databases. In
a high-throughput work environment, a LIMS
system can help by tracking a sample from its
creation/isolation through to the data collected
from a hybridized microarray. Often these data
are useful for quality control, e.g., for tracking contaminated reagents. LIMS systems are
often connected directly to array scanners so
that the sample annotation, data collection, and
subsequent data analysis are directed from a
single platform.
Archiving Data
Analysis of
Expression Data:
An Overview
In any database for microarray data, the
actual results from microarray experiments
should be stored in association with parameters that describe the experiments (i.e., the
difference between the experiments). In addi-
tion, it is necessary to archive the results of
statistical analyses (e.g., lists of genes with interesting behaviors across specific experimental parameters). There are two common techniques for this, as described in the following
sections (also see UNIT 9.1).
Text
The data may be stored as text files that have
been produced by the image analysis software.
This is an inexpensive, fast, and convenient
method for individual scientists; however, it
can make working in groups difficult, and can
increase the chances of losing historical data.
Storing data in text files lacks the facility of
a LIMS to provide a detailed description of
the experiment associated with it. For the laboratory that is confined to using flatfile data
storage, archiving can be improved by using a
document management system like Pharmatrix Base4 or a flatfile data repository like
GeneSpring WorkGroup from Agilent Technologies (see Internet Resources).
Access via Structured Query Language
(SQL)
Data may be stored in an SQL-compliant
database (UNIT 9.2), preferably associated with
the LIMS for tracking production if it exists.
A variety of analysis tools can then extract
data from the database. This tends to be slow
and expensive, but it makes backing up and
archiving more reliable. The AADM database
from Affymetrix is such a database, and it integrates with the Affymetrix LIMS system. If
parameters of the experiments are stored in
this database, then they can be retrieved automatically by a number of data-analysis packages. However, such databases provide little
functionality for storing the results of statistical analyses. A dedicated enterprise-level expression repository like Agilent Technologies’
GeneSpring WorkGroup can store both raw
expression data and display the results of statistical analyses (e.g., hierarchical clustering
dendrograms) in a single package.
Making Data Globally Accessible
Many people want to publish data on the
Web, and a growing number of journals mandate that expression analysis data included in
published papers be available to all readers
on the Internet. There are several methods for
making such results globally accessible as described in the following sections.
7.1.10
Supplement 17
Current Protocols in Bioinformatics
FTP
Raw data files can be placed on an FTP
server. This method is simple for the experimenter, but hard for others to use, as it requires
a detailed description of the data structure for
the data to be useful.
Public databases
The NCBI and EBI have created similar public databases (see Internet Resources).
These solutions make the data available to
anyone on the Web, and so are reasonable
for academics. In addition, Agilent’s GeneSpring WorkGroup provides users with a Webaccessible repository that can be placed outside the firewall of a particular institution, so
that guest users can access selected data via a
Web browser.
CONCLUSION
Over the past few years, the hardware and
technologies underlying microarray experiments have become more readily available
to scientists interested in working with geneexpression data, and have matured to the point
at which data acquisition and quality control
are no longer the limiting factors. The focus
is increasingly on analysis and interpretation,
along with data management, storage, and accessibility. Contributions from statistics and
computational biology have led to the availability of a wide variety of models and analyses
for scientists working with microarray data.
At the same time, the ongoing development of
specialized software solutions that combine all
these aspects of the microarray experimental
process are allowing scientists to investigate
the basic biological questions that this technology was designed to address.
LITERATURE CITED
Asburner, M., Ball, C.A., Blake, J.A., Botstein,
D., Butler, H., Cherry, J.M., Davis, A.P.,
Dolinski, K., Dwight, S.S., Eppig, J.T., Harris,
M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A.,
Lewis, S., Matese, J.C., Richardson, J.E.,
Ringwald, M., Rubin, G.M., and Sherlock, G.
2000. Gene Ontology: Tool for the unification
of biology. The Gene Ontology Consortium.
Nature Genet. 25:25-29.
Benjamini, Y. and Hochberg, Y. 1995. Controlling
the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist.
Soc. B 57:2889-3000.
Bolstad, B.M. 2006. RMAExpress. URL: http://
rmaexpress.bmbolstad.com/.
Brenner, S., Johnson, M., Bridgham, J., Golda, G.,
Lloyd, D.H., Johnson, D., Luo, S., McCurdy, S.,
Foy, M., Ewan, M., Roth, R., George, D., Eletr,
S., Albrecht, G., Vermaas, E., Williams, S.R.,
Moon, K., Burcham, T., Pallas, M., DuBridge,
R.B., Kirchner, J., Fearon, K., Mao, J., and Corcoran, K. 2000. Gene expression analysis by
massively parallel signature sequencing (MPSS)
on microbead arrays. Nat. Biotechnol. 18:630634.
Cope, L.M., Irizarry, R.A., Jaffee, H.A., Wu, Z., and
Speed, T.P. 2004. A benchmark for Affymetrix
GeneChip expression measures. Bioinformatics
20:323-331.
Dahlquist, K.D., Salomonis, N., Vranizan, K.,
Lawlor, S.C., and Conklin, B.R. 2002.
GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat.
Genet. 31:19-20.
Dudoit, S., Fridlyand, J., and Speed, T. 2000. Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression
Data. Tech. Rep. 576, Dept. of Statistics, University of California, Berkeley.
Durbin, B.P., Hardin, J.S., Hawkins, D.M., and
Rocke, D.M. 2002. A variance-stabilizing transformation for gene expression microarray data.
Bioinformatics 18:S105-S110.
Fehlbaum, P., Guihal, C., Bracco, L., and Cochet,
O. 2005. A microarray configuration to quantify expression levels and relative abundance of
splice variants. Nucleic Acids Res. 10:e47.
GeneLogic. 2002. Datasets. http://www.genelogic.
com/newsroom/studies/index.cfm.
Gentleman, R.C., Carey, V.J., Bates, D.J., Bolstad,
B.M., Dettling, M., Dudoit, S., Ellis, B.,
Gautier, L., Ge, Y., Gentry, J., Hornik, K.,
Hothorn, T., Huber, W., Iacus, S., Irizarry, R.,
Leisch, F., Li, C., Maechler, M., Rossini, A.J.,
Sawitzki, G., Smith, C., Smyth, G.K., Tierney,
L., Yang, Y.H., and Zhang, J. 2004. Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol.
5:R80.
Hughes, J.D., Estep, P.W., Tavazoie, S., and Church,
G.M. 2000. Computational identification of cisregulatory elements associated with groups of
functionally related genes in Saccharomyces
cerevisiae. J. Mol. Biol. 296:1205-1214.
Irizarry, R.A., Bolstad, B.M., Collin, F., Cope,
L.M., Hobbs, B., and Speed, T.P. 2003. Summaries of Affymetrix GeneChip probe level
data. Nucleic Acids Res. 31:e15.
Kanehisa, M., Goto, S., Kawashima, S., and
Nakaya, A. 2002. The KEGG databases at
GenomeNet. Nucleic Acids Res. 30:42-46.
Kerr, M.K. and Churchill, G.A. 2001. Statistical
design and the analysis of gene expression microarrays. Genet. Res. 77:123-128.
Kerr, M.K., Martin, M., and Churchill, G.A. 2000.
Analysis of variance for gene expression microarray data. J. Comput. Biol. 7:819-837.
Li, C. and Wong, W. 2001. Model-based analysis of
oligonucleotide arrays: Expression index computation and outlier detection. Proc. Natl. Acad.
Sci. U.S.A. 98:31-36.
Analyzing
Expression
Patterns
7.1.11
Current Protocols in Bioinformatics
Supplement 17
Lipshutz, R., Fodor, S., Gingeras, T., and Lockart,
D. 1999. High density synthetic oligonucleotide
arrays. Nat. Genet. 21:20-24.
Pan, Q., Shai, O., Misquitta, C., Zhang, W.,
Saltzman, A.L., Mohammad, N., Babak, T., Siu,
H., Hughes, T.R., Morris, Q.D., Frey, B.J., and
Blencowe, B.J. 2004. Revealing global regulatory features of mammalian alternative splicing
using a quantitative microarray platform. Mol.
Cell. 16:929-941.
Saeed, A.I., Sharov, V., White, J., Li, J., Liang,
W., Bhagabati, N., Braisted, J., Klapa, M.,
Currier, T., Thiagarajan, M., Sturn, A., Snuffin,
M., Rezantsev, A., Popov, D., Ryltsov, A.,
Kostukovich, E., Borisovsky, I., Liu, Z.,
Vinsavich, A., Trush, V., and Quackenbush,
J. 2003. TM4: A free, open-source system
for microarray data management and analysis.
Biotechniques 34:374-378.
Schena, M., Shalon, D., Davis, R.W., and Brown,
P.O. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA
microarray. Science 270:467-470.
Shalon, D., Smith, S.J., and Brown, P.O. 1996. A
DNA microarray system for analyzing complex
DNA samples using two-color fluorescent probe
hybridization. Genome Res. 6:639-645.
Shannon, P., Markiel, A., Ozier, O., Baliga,
N.S., Wang, J.T., Ramage, D., Amin, N.,
Schwikowski, B., and Ideker, T. 2003. Cytoscape: A software environment for integrated
models of biomolecular interaction networks.
Genome Res. 13:2498-2504.
Stein, A., Van Loo, P., Thijs, G., Mayer, H.,
de Martin, R., Moreau, Y., and De Moor, B.
2005. TOUCAN2: The all-inclusive open source
workbench for regulatory sequence analysis.
Nucleic Acids Res. 33:W393-W396.
Thijs, G., Marchal, K., Lescot, M., Rombauts,
S., De Moor, B., Rouze, P., and Moreau, Y.
2002. A Gibbs sampling method to detect overrepresented motifs in upstream regions of coexpressed genes. J. Comput. Biol. 9:447-464.
Tibshirani, R., Hastie, T., Narasimhan, B., and Chu,
G. 2002. Diagnosis of multiple cancer types by
shrunken centroids of gene expression. Proc.
Natl. Acad. Sci. U.S.A. 99:6567-6572.
Wolfsberg, T.G., Gabrielian, A.E., Campbell,
M.J., Cho, R.J., Spouge, J.L., and Landsman,
D. 1999. Candidate regulatory sequence elements for cell cycle–dependent transcription in
Saccharomyces cerevisiae. Gen. Res. 9:775792.
Wu, Z., LeBlanc, R., and Irizarry, R.A. 2004.
Stochastic Models Based on Molecular Hybridization Theory for Short Oligonucleotide
Microarrays Technical report, Johns Hopkins
University, Dept. of Biostatistics Working
Papers.
Yang, Y.H., Dudoit, S., Luu, P., Lin, D.M., Peng, V.,
Ngai, J., and Speed, T.P. 2002. Normalization
for cDNA microarray data: A robust composite method addressing single and multiple slide
systematic variation. Nucleic Acids Res. 4:e15.
INTERNET RESOURCES
http://www.ncbi.nlm.nih.gov/geo/
The Gene Expression Expression Omnibus (GEO)
is a public database of expression data derived from
a number of different expression analysis technologies.
http://www.ebi.ac.uk/arrayexpress/
ArrayExpress is a public repository for gene expression data, focused on providing a rich source
of experimental background for each experiment
set.
http://www.biocarta.com/genes/index.asp
Web site for Biocarta Pathways—interactive
graphic models of molecular and cellular pathways.
http://www.genome.ad.jp/kegg/
Kyoto Encyclopedia of Genes and Genomes.
Contributed by Anoop Grewal and
Peter Lambert
NextBio
Cupertino, California
Jordan Stockton
Agilent Technologies
Santa Clara, California
Tusher, V.G., Tibshirani, R., and Chu, G. 2001.
Significance analysis of microarrays applied
to the ionizing radiation response. Proc. Natl.
Acad. Sci. U.S.A. 98:5116-5121.
Analysis of
Expression Data:
An Overview
7.1.12
Supplement 17
Current Protocols in Bioinformatics
The Gene Ontology (GO) Project:
Structured Vocabularies for Molecular
Biology and Their Application to Genome
and Expression Analysis
UNIT 7.2
Judith A. Blake1 and Midori A. Harris2
1
2
The Jackson Laboratory, Bar Harbor, Maine
EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, United Kingdom
ABSTRACT
Scientists wishing to utilize genomic data have quickly come to realize the benefit of standardizing
descriptions of experimental procedures and results for computer-driven information retrieval
systems. The focus of the Gene Ontology project is three-fold. First, the project goal is to compile
the Gene Ontologies: structured vocabularies describing domains of molecular biology. Second,
the project supports the use of these structured vocabularies in the annotation of gene products.
Third, the gene product-to-GO annotation sets are provided by participating groups to the public
through open access to the GO database and Web resource. This unit describes the current
ontologies and what is beyond the scope of the Gene Ontology project. It addresses the issue of
how GO vocabularies are constructed and related to genes and gene products. It concludes with
a discussion of how researchers can access, browse, and utilize the GO project in the course of
C 2008 by John Wiley & Sons, Inc.
their own research. Curr. Protoc. Bioinform. 23:7.2.1-7.2.9. Keywords: Gene Ontology r functional annotation r bioOntology
INTRODUCTION
With the age of whole genome analysis,
systems biology, and modeling of whole cells
upon us, scientists continue to work towards
the integration of vast amounts of biological
information. The goal, of course, is not the integration itself, but the ability to traverse this
information space in the quest for knowledge.
We want to construct knowledge systems so
that we can infer new knowledge from existing and emerging information. With technological advances permitting expression analysis for tens of thousands of genes at a time,
researchers seek clarity in finding and validating information.
Recently, much interest has focused on
the semantics used by information systems to
report on biological knowledge, such as
molecular function, or the parameters of experimental systems, such as with microarray
experiments. The problem has been the multiplicity of ways that the same phenomena can
be described in the literature or in database
annotations. While it is difficult to persuade
laboratory scientists to employ standardized
descriptions of experimental procedures and
results in their publications, those wishing to
utilize genomic data have quickly come to realize the significance and utility of such standards to computer-driven information retrieval
systems.
WHAT ARE ONTOLOGIES AND
WHY DO WE NEED THEM?
Ontologies, in one sense used today in the
fields of computer science and bioinformatics,
are “specifications of a relational vocabulary”
(Gruber, 1993; http://www-ksl.stanford.edu/
kst/what-is-an-ontology.html). Simply put, ontologies are vocabularies of terms used in a
specific domain, definitions for those terms,
and defined relationships between the terms.
Ontologies provide a vocabulary for representing and communicating knowledge about
some topic, and a set of relationships that
hold among the terms of the vocabulary. They
can be structurally very complex, or relatively simple. There is a rich field of study in
ontologies in computer science and philosophy (Schulze-Kremer, 1998; Jones and Paton,
1999). Most importantly, ontologies capture
Current Protocols in Bioinformatics 7.2.1-7.2.9, September 2008
Published online September 2008 in Wiley Interscience (www.interscience.wiley.com).
DOI: 10.1002/0471250953.bi0702s23
C 2008 John Wiley & Sons, Inc. All rights reserved.
Copyright Analyzing
Expression
Patterns
7.2.1
Supplement 23
domain knowledge in a computationally
accessible form (Stevens et al., 2000). Because
the terms in an ontology and the relationships
between the terms are carefully defined, the
use of ontologies facilitates making standard
annotations, improves computational queries,
and can support the construction of inference
statements from the information at hand.
Ontology-Based Enhancement of
Bioinformatics Resources
Bioinformatics systems have long employed keyword sets to group and query
information. Journals typically provide keywords, which subsequently permit indexing of
the published articles. Hierarchical classifications (e.g., taxonomies, Enzyme Commission
Classification) have been used extensively in
biology, and molecular function classifications
started to appear with the work of Monica
Riley in the early 1990s (Riley, 1993, 1998;
Karp et al., 1999). The Unified Medical
Language System (UMLS) incorporates
multiple vocabularies in the area of medical
informatics.
In recent years, bio-information providers
have increasingly focused on the development
of bio-ontologies for capture and sharing of
data (Baker et al., 1999; Stevens et al., 2000;
Sklyar, 2001). Bio-ontologies support a shared
understanding of biological information. The
development of these ontologies has paralleled
the technological advances in data generation.
Genomic sequencing projects and microarray
experiments, alike, produce electronically
generated data flows that require computer
accessible systems to work with the information. As systems that make domain knowledge
available to both humans and computers,
bio-ontologies are essential to the process of
extracting biological insight from enormous
sets of data.
WHAT IS THE GENE ONTOLOGY
(GO) CONSORTIUM?
The Gene
Ontology (GO)
Project
The Gene Ontology Consortium includes
people from many of the model organism
database groups and from other bioinformatics research groups who have joined together
to build GOs and to support their use. The
GOs, annotations to GO, and tools to support
the use of GO, are in the public domain. Information, documentation, and access to various
components of GO can be found at the GO
Web site (http://www.geneontology.org) or in
supporting publications (The Gene Ontology
Consortium, 2000, 2001, 2008).
WHAT ARE THE OBJECTIVES OF
THE GO PROJECT?
The focus of the GO project is three-fold.
The first project goal is to compile and provide the GOs: structured vocabularies describing domains of molecular biology. The three
domains under development were chosen as
ones that are shared by all organisms: Molecular Function, Biological Process, and Cellular Component. These domains are further described below. A later ontology developed by
the GO Consortium is the Sequence Ontology
(Eilbeck et al., 2005).
Second, the project supports the use of these
structured vocabularies in the annotation of
gene products. Gene products are associated
with the most precise GO term supported by
the experimental evidence. Structured vocabularies are hierarchical, allowing both attributions and queries to be made at different levels
of specificity.
Third, the gene product-to-GO annotation
sets are provided by participating groups to the
public through open access to the GO database
and Web resource. Thus, the community can
access standardized annotations of gene products across multiple species and resources. The
GO Consortium supports the development of
GO tools to query and modify the vocabularies, to provide community access to the annotation sets, and to support data exploration.
WHAT ARE THE CURRENT
ONTOLOGIES SUPPORTED BY
THE GO PROJECT?
The current ontologies of the GO project
are Molecular Function, Biological Process,
and Cellular Component. These three areas
are considered orthogonal to each other, i.e.,
they are treated as independent domains. The
ontologies are developed to include all terms
falling into these domains without consideration of whether the biological attribute is restricted to certain taxonomic groups. Therefore, biological processes that occur only in
plants (e.g., photosynthesis) or mammals (e.g.,
lactation) are included.
Molecular Function
Molecular Function refers to the elemental
activity or task performed, or potentially performed, by individual gene products. Enzymatic activities such as “nuclease,” as well
as structural activities such as “structural
constituent of chromatin” are included in
Molecular Function. An example of a broad
functional term is “transporter activity” (enabling the directed movement of substances,
7.2.2
Supplement 23
Current Protocols in Bioinformatics
such as macromolecules, small molecules, and
ions, into, out of, or within a cell). An example
of a more detailed functional term is “proteinglutamine gamma-glutamyltransferase activity,” which cross-links adjacent polypeptide
chains by the formation of the N6-(Lisoglutamyl)-L-lysine isopeptide; the gammacarboxymide groups of peptide-bound glutamine residues act as acyl donors, and the
6-amino-groups of peptidyl- and peptidebound lysine residues act as acceptors,
to give intra- and inter-molecular N6-(5glutamyl)lysine cross-links.
Biological Process
Biological Process refers to the broad biological objective or goal in which a gene product participates. Biological Process includes
the areas of development, cell communication,
physiological processes, and behavior. An example of a broad process term is “mitosis” (the
division of the eukaryotic cell nucleus to produce two daughter nuclei that, usually, contain
the identical chromosome complement to their
mother). An example of a more detailed process term is “calcium-dependent cell-matrix
adhesion” (the binding of a cell to the extracellular matrix via adhesion molecules
that require the presence of calcium for the
interaction).
Cellular Component
Cellular Component refers to the location
of action for a gene product. This location may
be a structural component of a cell, such as the
nucleus. It can also refer to a location as part
of a molecular complex, such as the ribosome.
WHY DOES THE GO PROJECT
REFER TO GENE PRODUCTS?
GO vocabularies are built to support annotation of particular attributes of gene products.
Gene products are physical things, and may be
transcripts, proteins, or RNAs. The term “gene
product” covers the suite of biological (physical) objects that are being associated with GO
terms. Gene products may be polypeptides that
associate into complex entities, or “gene product groups.” These gene product groups may
be relatively simple, e.g., a heterodimeric enzyme, or very complex assemblies of a variety
of different gene products, e.g., a ribosome.
In addition, in most of the model organism
database systems, the biological object being
annotated is a loosely defined “gene” object
with the potential of producing a protein or
other molecule that could engage in a molecular function or be located in or at a partic-
ular cellular component. The use of the term
“gene product” encompasses all these physical objects. Further development of biological
databases and information systems will support more precise descriptions of gene products, such as splice variants or modified proteins. GO vocabularies can be used to assign
attributes to any of them.
WHAT IS BEYOND THE SCOPE OF
THE GO PROJECT?
Almost as important as understanding the
scope of the GO project is understanding what
the GO project is not. The most common misapprehensions are (1) that GO is a system for
naming genes and proteins and (2) that GO attempts to describe all of biology. GO neither
names genes or gene products, nor attempts
to provide structured vocabularies beyond the
three domains described above.
GO is Not a Nomenclature for Genes
or Gene Products
The vocabularies describe molecular
phenomena, not biological objects (e.g.,
proteins or genes). Sharing gene product
names would entail tracking evolutionary
histories and reflecting both orthologous
and paralogous relationships between gene
products. Different research communities
have different naming conventions. Different
organisms have different numbers of members
in gene families. The GO project focuses on
the development of vocabularies to describe
attributes of biological objects, not on the
naming of the objects themselves.
This point is particularly important to understand because many genes and gene products are named for their function. For example, enzymes are often named for their function; the protein DNA Helicase is a physical
object that exerts the function “DNA helicase
activity,” a term in the GO molecular function
ontology (GO:0003678).
GO is Neither a Dictated Standard Nor
a Means to Unify Biological Databases
The members of the GO consortium have
chosen to work cooperatively to define and
implement the GO system in their databases.
However, the commitment is to the development of GO, the use of a common syntax
for constructing GO annotation datasets, and
the support of tools and the GO database for
community access to GO and GO association
files. Model organism databases and others
using GO do so within the context of their
own informatics systems. While GO was not
Analyzing
Expression
Patterns
7.2.3
Current Protocols in Bioinformatics
Supplement 23
developed to unify biological databases, it is
true that the more GO is used in annotation systems, the easier it will be to navigate
bioinformation space and to harness the power
and potential of computers and computational
systems.
GO Does Not Define Evolutionary
Relationships
Shared annotation of gene products to GO
terms reflect shared association with a defined
molecular phenomena. Multiple biological objects (proteins) can share function or cellular
location or involvement in a larger biological
process, and not be evolutionarily related in the
sense of shared ancestry. That said, many proteins that share molecular function attributes,
in particular, do share ancestry. However, the
property of shared ancestry is separate from
the property of function assignment and is not
reflected explicitly in GO associations to gene
products.
Other Ontologies Under Development
Complement GO
GO vocabularies do not describe attributes
of sequence such as intron/exon parameters, protein domains, or structural features.
They do not model protein-protein interactions. They do not describe mutant or disease phenotypes. There are efforts under way
to develop ontologies for each of these domains. The GO consortium has played a leading role in the Open Biomedical Ontologies
(OBO) effort (http://obofoundry.org/ ), which
The Gene
Ontology (GO)
Project
aims to support the development of ontologies in the biomedical domain, with particular emphasis on a core set of interoperable ontologies, the “OBO Foundry,” which
meet a number of inclusion criteria. The
OBO Foundry requirements detailed on the
Web (http://www.obofoundry.org/crit.shtml),
include that the ontology be orthogonal to existing ontologies, that the terms and relationships be defined, that they be publicly available, and that they be structured in a common
syntax, such as OWL or OBO format.
HOW ARE GO VOCABULARIES
CONSTRUCTED?
GO vocabularies are updated and modified
on a regular basis. A small number of GO
curators are empowered to make additions to
and deletions from GO. Currently, a Concurrent Versions System (CVS) is employed to
regulate and track changes. Those interested
can request e-mail notification of any changes.
Each committed set of changes is versioned
and archived. Suggestions from the community for additional terms or for other improvements are welcome (details below).
Properties of GO Vocabularies
GO vocabularies are DAGs
GO vocabularies are structured as directed
acyclic graphs (DAGs), wherein any term may
have one or more parent as well as zero, one,
or more children (Fig. 7.2.1). Within each vocabulary, terms are defined, and parent-child
Figure 7.2.1 The GO vocabularies are sets of defined terms and specifications of the relationships between them. As indicated in this diagram, the GO vocabularies are directed acyclic
graphs: there are no cycles, and “children” can have more than one “parent.” In this example,
germ cell migration has two parents; it is a “part of” gamete generation and “is a” (is a subtype of)
cell migration. The GO uses these elementary relationships in all vocabularies.
7.2.4
Supplement 23
Current Protocols in Bioinformatics
relationships between terms are specified. A
child term is a subset of its parent(s). Thus,
for example, the fact that the nucleolus is part
of the nuclear lumen, which in turn is part of
the (eukaryotic) cell, can be captured; further,
the DAG structure permits GO to represent
“endoribonuclease” as a subcategory of both
“endonuclease” and “ribonuclease.”
GO terms with their definitions are
accessioned
The accession ID is tracked by GO. The
accession ID more precisely belongs with the
definition. Thus, if a term changes (e.g., from
“chromatin” to “structural component of chromatin”), but the definition for the term does
not change, the accession ID will remain the
same. Terms can become obsolete. Obsolete
terms continue to be maintained and tracked
in the GO database system.
True-path rule
The multiple parentage allowed by the
DAG structure is critical for accurately representing biology. GO developers impose an
additional constraint on the parent-child relationships specified in the vocabularies. Every
possible path from a specific node back to the
root (most general) node must be biologically
accurate.
Because some functions, processes, and
cellular components are not found in all
species, many terms will not be used to annotate gene products from a given organism. The
general working rule is that terms are included
if they apply to more than one taxonomic class.
In accordance with the true-path rule, however,
relationships between terms must be specified,
such that the paths from any term leads to the
root only via parent terms that are relevant to
the organism in question. A parent term must
never be specific to a narrower taxon than any
of its children.
Relationship types
At present, GO vocabularies define two semantic relationships between parent and child
terms: “is a” and “part of.” The is a relationship is used when the child is a subclass, of
the parent, e.g., “endonuclease activity” is a
subcategory of “nuclease activity.”
The part of relationship is used when the
child is a component of the parent, such as
a subprocess (“DNA replication initiation” is
part of “DNA dependent DNA replication”) or
physical part (“nucleolus” is part of “nuclear
lumen”). Further, the relationship’s meaning
is restricted in one direction but not the other:
part of means “necessarily part of” but not
“necessarily has part.” In other words, the parent need not always encompass the child, but
the child exists only as part of the parent. For
example, in the cellular component ontology,
“prereplicative complex” is a part of “nucleoplasm” although it is only present in the nucleoplasm at particular times during the cell
cycle. Whenever the prereplicative complex
is present, however, it is in the nucleoplasm.
In addition, any term may be a subtype of
one parent and part of another, e.g., “nuclear
membrane” is part of “nuclear envelope” (and
therefore also part of the nucleus) and is a of
“organelle membrane.”
Initially, GO curators used the part of relationship to link the regulation terms with the
processes being regulated. We are now implementing the replacement of these “part of” relationships with a new relationship type called
“regulates.” We have also created a regulates
hierarchy in the graph with “regulation of biological process,” “regulation of molecular
function,” and “regulation of biological quality” as the parent nodes. These terms have been
used to create the appropriate subsumption hierarchy for terms that describe regulation of
biological processes, molecular functions, and
measurable biological attributes. In the cases
of biological process and molecular function,
automated reasoning has been used to ensure
that the regulates portion of the graph and the
portion of the graph describing the processes
being regulated are consistent. The introduction of this new relationship type better reflects
the underlying biology. Users can now choose
to exclude or include in their analyses gene
products that play a regulatory role in a biological process.
One of the limitations of GO is the paucity
of relationship types. As noted above, the is a
and part of relationships can be seen to contain
several sub-relationships. Further development and formalization of GO should result in
more robust analysis and representation of relationships among the terms; GO will use relationships drawn from the OBO Relations Ontology (http://www.obofoundry.org/ro/; Smith,
2005).
HOW DO GO VOCABULARIES
RELATE TO OTHER RESOURCES
SUCH AS THE INTERPRO?
Various other classification schemes have
been indexed to GO, including the SwissProt
keyword set and MetaCyc Pathways and Reactions (UNIT 1.17). These mappings are provided
Analyzing
Expression
Patterns
7.2.5
Current Protocols in Bioinformatics
Supplement 23
to the public at the GO Web site (http://www.
geneontology.org/GO.indices.shtml). They are
reviewed and updated as needed.
HOW ARE GENES AND GENE
PRODUCTS ASSOCIATED WITH
GO TERMS?
Genes and gene products can obviously
be associated with GO terms by whoever
wishes to do so. For the groups participating
in the GO Consortium, some general rules
concerning gene associations to GO have
been formulated. A gene product may be
annotated to zero or more nodes of each
ontology, and may be annotated to any level
within the ontology. A well-characterized
RNA or protein might be annotated using very
specific terms, whereas a little-studied gene
product might be annotated using only general
terms. All GO terms associated with a gene
product should refer to its normal activity
and location. Functions, processes, or localizations observed only in mutant or disease
states are therefore not included. Participating
databases contribute sets of GO annotations
to the GO site, providing a set of data in a
consistent format. Details of these conventions can be found in the GO Annotation Guide
(http://www.geneontology.org/GO.annotation.
html).
Evidence Codes and Citations
The Gene
Ontology (GO)
Project
Every association made between a GO
term and a gene product must be attributed to a source, and must indicate
the evidence supporting the annotation. A
simple controlled vocabulary is used to
record evidence types; it is described in the
GO Evidence Codes document (http://www.
geneontology.org/GO.evidence.shtml ). For a
single gene product, there may be strong evidence supporting annotation to a general term,
and less reliable evidence supporting annotation to a more specific term.
Many of the evidence codes represent certain types of experimental data, such as inferred from mutant phenotype (IMP) or inferred from direct assay (IDA), which might be
found in the literature describing a gene product. One evidence code, inferred from electronic analysis (IEA), is distinguished from
the rest in that it denotes annotations made by
computational methods, the results of which
are not usually checked individually for accuracy. Annotations using the “IEA” code are
therefore generally less reliable than those that
have other types of evidence supporting them.
HOW DO I BROWSE GO AND FIND
GO ANNOTATIONS FOR “MY”
GENES?
Several browsers have been created for
browsing the GO and finding GO associations
for genes and gene products. These can be
accessed at the GO Web site. The AmiGO
browser, as an example, allows searches by
both GO term (or portion thereof) and gene
products. The results include the GO hierarchy for the term, definition and synonyms
for the term, external links, and the complete set of gene product associations for
the term and any of its children (Fig. 7.2.2;
http://amigo.geneontology.org/ ).
CAN I DOWNLOAD GO?
GO vocabularies, association tools, and
documentation are freely available and have
been placed in the public domain. GO is
copyrighted to protect the integrity of the
vocabularies, which means that changes to
GO vocabularies need to be done by GO
developers. However, anyone can download
GO and use the ontologies in their annotation
or database system. The GO vocabularies are
available in several formats, including OBO
(GO, 2004), OWL (Horrocks, 2003), OBO
XML, and RDF-XML. Monthly snapshots
of the OBO v1.0 format GO file are also
saved and posted on the GO Web site, which
provide other information systems with a
stable version of GO and the ability to plan for
regular updates of GO in their systems. GO is
also available in a MySQL database (UNIT 9.2);
the database schema accommodates both
vocabulary and gene association data, and
downloads with and without the gene associations are available. More information about
downloading GO can be found on the Web site
(http://www.geneontology.org/GO.downloads.
shtml), as can the citation policy (http://www.
geneontology.org/GO.cite.shtml).
WHERE CAN I ACCESS AND/OR
OBTAIN THE COMPLETE GENE
PRODUCT/GO ASSOCIATION
SETS?
As with the vocabularies, the gene product/GO association sets from contributing
groups are available at the GO Web site.
Tab-delimited files of the associations between gene products and GO terms that
are made by the member organizations are
available from their individual FTP sites,
or from a link on the Current Annotations
table. The “gene association” file format is
7.2.6
Supplement 23
Current Protocols in Bioinformatics
Figure 7.2.2 The AmiGO browser provides access to the GO and to contributed gene associations sets. Queries can initiate with GO terms or gene product terms, results can be filtered in
various ways. The AmiGO browser was developed by the Berkeley Drosophila Genome Project.
described in the Annotation Guide (http://
www.geneontology.org/GO.format.annotation.
shtml). These files store IDs for objects
(genes/gene products) in the database that
contributed the file (e.g., FlyBase IDs, SwissProt accessions IDs for proteins), as well
as the citation and evidence data described
above. The FTP directory is found at ftp://ftp.
geneontology.org/pub/go/gene-associations/.
There are also files containing SwissProt/TrEMBL protein sequence identifiers for
gene products that have been annotated using
GO terms; they are available via FTP from
ftp://ftp.geneontology.org/pub/go/gp2protein/.
WHERE CAN I FIND GO
ANNOTATIONS FOR
TRANSCRIPTS AND SEQUENCES?
Gene objects in a model organism database
typically have multiple nucleotide sequences
from the public databases associated with
them, including ESTs and one or more pro-
tein sequences. There are two ways to obtain sets of sequences with GO annotations:
(1) from the model organism databases or
(2) from the annotation sets for transcripts and
proteins contributed to GO by Compugen and
Swiss-Prot.
Obtaining GO Annotations for Model
Organism Sequence Sets
In gene association files, GO terms are associated with an accession ID for a gene or
gene product from the contributing data resource. Usually, the association files of the
gene to sequenceIDs are also available from
the contributing model organism database.
For example, the Mouse Genome Informatics FTP site (ftp://ftp.informatics.jax.org/pub/
infomatics/reports) includes the gene association files contributed to GO, and other reports
that include official mouse gene symbols and
names and all curated gene sequence ID
associations.
Analyzing
Expression
Patterns
7.2.7
Current Protocols in Bioinformatics
Supplement 23
Obtaining GO Annotations for
Transcript and Proteins in General
Large transcript and protein sequence
datasets are annotated to GO by SwissProt/TrEMBL. These files can be downloaded
directly from the GO Web site. Species of origin for the sequence is included in the association files.
HOW CAN GO BE USED IN
GENOME AND EXPRESSION
ANALYSIS?
Using Gene Association Sets in
Annotation of New Genes
Genome and full-length cDNA sequence
projects often include computational (putative) assignments of molecular function based
on sequence similarity to annotated genes or
sequences. A common tactic is to use a computational approach to establish some threshold sequence similarity to a Swiss-Prot sequence. Then GO associations to the SwissProt sequence can be retrieved and associated
with the gene model. Under GO guidelines,
the evidence code for this event would be
IEA. For example, various permutations of this
approach were used in the functional annotation of 21,000 mouse cDNAs (The RIKEN
Genome Exploration Research Group Phase II
Team and the FANTOM Consortium, 2001).
One aspect of the use of GO for annotation of large datasets is the ability to group
gene products to some high-level term. While
gene products may be precisely annotated as
having a role in a particular function in carbohydrate metabolism (i.e., glucose catabolism),
in the summary documentation of the dataset,
all gene products functioning in carbohydrate
metabolism could be grouped together as being involved in the more general phenomenon
carbohydrate metabolism. Various sets of GO
terms have been used to summarize experimental datasets in this way. The published sets
of high-level GO terms used in genome annotations and publications can be archived at the
GO site.
Using the Gene Association Sets in
Annotation of Expression Information
The Gene
Ontology (GO)
Project
The inclusion of GO annotation in microarray datasets can often reveal aspects of why a
particular group of genes share similar expression patterns. Sets of co-expressed genes can
encode products that are involved in a common
biological process, and may be localized to the
same cellular component. In cases where a few
uncharacterized genes are co-expressed with
well-characterized genes annotated to identical or similar GO process terms, one can infer
that the “unknown” gene product is likely to
act in the same process.
Recently, software for manipulating and
analyzing microarray gene expression data
that incorporates access to GO annotations
for genes is becoming available. For example, the Expression Profiler is a Web-based
set of tools for the clustering and analysis of
gene expression data developed by Jaak Vilo
at the European Bioinformatics Institute (EBI;
for review, see Quackenbush, 2001). One of
the tools available in this set is the EP:GO,
a tool that allows users to search GO vocabularies and extract genes associated with
various GO terms to assist in the interpretation of expression data. The GO Consortium
provides a Web presence where developers
can provide access to their GO tools (http://
www.geneontology.org/GO.tools.shtml ).
HOW CAN I SUGGEST
ADDITIONAL TERMS OR
CONTRIBUTE TO THE GO
PROJECT?
For changes to the ontologies, a page at
the SourceForge site allows GO users to
submit suggestions to GO curators (http://
sourceforge.net/projects/geneontology). This
system allows the submitter to track the status
of their suggestion, both online and by e-mail,
and allows other users to see what changes are
currently under consideration.
GO also welcomes biologists to join
Curator Interest Groups or to participate in
meetings devoted to specific areas of the vocabularies. Both interest groups and meetings
provide mechanisms to bring GO curators
and community experts together to focus
on areas of the ontology that may require
extensive additions or revisions. Curator
Interest Groups are listed on the GO Web site
(http://www.geneontology.org/GO.interests.
shtml); ontology content meetings are organized as the need arises and as biological
experts become available to participate.
GO also has several mailing lists, covering general questions and comments, the
GO database and software, and summaries
of changes to the ontologies. The lists are
described at (http://www.geneontology.org/
GO.mailing.lists.shtml). Any questions about
contributing to the GO project should be
directed to the GO helpdesk at (gohelp@
geneontology.org).
7.2.8
Supplement 23
Current Protocols in Bioinformatics
SUMMARY
The development of GOs is a practical and
on-going approach to the need for consistent
and defined structured vocabularies for
biological annotations. Originating from the
biological community, the project continues
to be enhanced through the involvement
of the ontology engineers and through the
availability of software tools for access to
GO and to GO association datasets. GO
is one example of several emerging bioontology and biological standards projects
that include the work of the MGED group
(http://www.cbil.upenn.edu/Ontology/MGED
ontology.html),
various
species-specific
anatomies (Bard and Winter, 2001), and structured vocabularies for phenotypes and disease
states. This work both facilitates research in
comparative genomics and proteomics, as
well as the interconnection of bioinformatics
and medical informatics systems. The GO
project continues to provide a vital and illuminating example of community development
of an information resource that benefits all
biological research.
ACKNOWLEDGEMENTS
We thank Martin Ringwald, Carol Bult, and
Jane Lomax for careful reading and useful suggestions. This work summarizes the efforts of
all the people working together as part of the
Gene Ontology Consortium. The Gene Ontology Consortium is supported by a grant to the
GO Consortium from the National Institutes
of Health (HG02273) and by donations from
AstraZeneca Inc, and Incyte Genomics.
LITERATURE CITED
Baker, P.G., Goble, C.A., Bechhofer, S., Paton,
N.W., Stevens, R., and Brass, A. 1999. An ontology for bioinformatics applications. Bioinformatics 15:510-520.
Bard, J. and Winter, R. 2001. Ontologies of developmental anatomy: Their current and future
roles. Brief. Bioinformatics 2:289-299.
Eilbeck, K., Lewis, S.E., Mungall, C.J., Yandell, M.,
Stein, L., Durbin, R., and Ashburner, M. 2005.
The Sequence Ontology: A tool for the unification of genome annotations. Genome Biol.
6:R44.
The Gene Ontology Consortium. 2000. Gene Ontology: Tool for the unification of biology. Nat.
Genet. 25:25-29.
The Gene Ontology Consortium. 2001. Creating the
gene ontology resource: Design and implementation. Genome Res. 11:1425-1433.
The Gene Ontology Consortium. 2008. The Gene
Ontology Project in 2008. Nucleic Acids Res. 36
(Database issue): D440-D444.
GO. 2004. The OBO Flat File Format Specification. Version 1.2 available at http://www.
geneontology.org/GO.format.obo-1 2.shtml.
Gruber, T.R. 1993. A translational approach to
portable ontologies. Know Acq. 5:199-220.
Horrocks, I., Patel-Schneider, P.F., and van
Harmelen, F. 2003. From SHIQ and RDF
to OWL: The Making of a Web Ontology
Language.Web Semant. 1:7-26.
Jones, D.M. and Paton, R.C. 1999. Toward principles for the representation of hierarchical knowledge in formal ontologies. Data Knowl. Eng.
31:102-105.
Karp, P.D., Riley, M., Paley, S.M., Pellegrini-Toole,
A., and Krummenacker, M. 1999. Eco Cyc:
Encyclopedia of Escherichia coli genes and
metabolism. Nucleic Acids Res. 27:55-58.
Quackenbush, J. 2001. Expression Profiler: A suite
of Web-based tools for the analysis of microarray gene expression data. Brief. Bioinform.
2:388-404.
The RIKEN Genome Exploration Research Group
Phase II Team and the FANTOM Consortium.
2001. Functional annotation of a full-length
mouse cDNA collection. Nature 409:685-690.
Riley, M. 1993. Functions of the gene products of
Escherichia coli. Microbiol. Rev. 57:862-952.
Riley, M. 1998. Systems for categorizing functions
of gene products. Curr. Opin. Struct. Biol. 8:388392.
Schulze-Kremer, S. 1998. Ontologies for molecular
biology. Pacific Symp. Biocomput. 3:695-706.
Sklyar, N. 2001. Survey of existing Bio-ontologies.
Technical Report 5/2001, Department of Computer Science, University of Leipzig, Germany.
Smith, B., Ceusters, W., Klagges, B., Köhler, J.,
Kumar, A., Lomax, J., Mungall, C., Neuhaus,
F., Rector, A.L., and Rosse, C. 2005. Relations
in biomedical ontologies. Genome Biol. 6:R46.
Stevens, R., Goble, C.A., and Bechhofer, S. 2000.
Ontology-based knowledge representation for
bioinformatics. Brief. Bioinform. 1:398-414.
INTERNET RESOURCES
http://www.nlm.nih.gov/research/umls/
UMLS Unified Medical Language System.
http://www.geneontology.org/
The Gene Ontology Web site.
http://www.mged.sourceforge.net/ontologies/
The Microarray Gene Expression Data (MGED)
Society Ontology Working Group (OWG) Web site.
http://dol.uni-leipzig.de/pub/2001-30/en
A survey of existing bio-ontologies (Sklyar, 2001).
http://www.w3.org/TR/owl-guide
OWL Web Ontology Language Guide.
Analyzing
Expression
Patterns
7.2.9
Current Protocols in Bioinformatics
Supplement 23
Analysis of Gene-Expression Data Using
J-Express
UNIT 7.3
Anne Kristin Stavrum,1, 2 Kjell Petersen,3 Inge Jonassen,1, 3 and Bjarte Dysvik4
1
University of Bergen, Bergen, Norway
Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital,
Bergen, Norway
3
Computational Biology Unit, BCCS, University of Bergen, Bergen, Norway
4
MolMine AS, Thormoehlens, Bergen, Norway
2
ABSTRACT
The J-Express package has been designed to facilitate the analysis of microarray data
with an emphasis on efficiency, usability, and comprehensibility. The J-Express system
provides a powerful and integrated platform for the analysis of microarray gene expression data. It is platform-independent in that it requires only the availability of a Java
virtual machine on the system. The system includes a range of analysis tools and a
project management system supporting the organization and documentation of an analysis project. This unit describes the J-Express tool, emphasizing central concepts and
principles, and gives examples of how it can be used to explore gene expression data
C 2008 by John Wiley & Sons, Inc.
sets. Curr. Protoc. Bioinform. 21:7.3.1-7.3.25. Keywords: gene expression r J-Express r microarray r spot intensity quantitation
INTRODUCTION
The J-Express package has been designed to facilitate the analysis of microarray data with
an emphasis on efficiency, usability, and comprehensibility. An early version of J-Express
was described in an article in Bioinformatics in 2001 (Dysvik and Jonassen, 2001). This
unit describes the J-Express tool, emphasizing central concepts and principles. Examples
show how it can be used to explore gene-expression data sets.
The J-Express system provides a powerful and integrated platform for the analysis of
microarray gene-expression data. It is platform-independent in that it requires only the
availability of a Java virtual machine on the system. The system includes a range of analysis tools, and, importantly, a project-management system supporting the organization
and documentation of an analysis project.
The package can be used not only for analysis of microarray gene-expression data,
but also to analyze any set of objects where each measurement is represented by a
multidimensional vector. For example, it has been used to analyze data from 2-D gel
experiments.
J-Express allows the user to import output files from spot-quantitation programs such
as GenePix and Scanalyze and to take the data through filtering and normalization procedures to produce log-ratio data (see Basic Protocol 1). Alternatively, the user can
input externally processed gene-expression data. These data can be log-ratio type data
(relative quantitation of mRNA abundances) or more direct mRNA quantitations produced, for example, using Affymetrix technology. The program offers a choice of different unsupervised analysis methods, including clustering and projection methods (see
Basic Protocol 2). Supervised analysis methods include differential expression analysis
and gene-set enrichment approaches. For a discussion of supervised and unsupervised
analysis methods, see Background Information.
Current Protocols in Bioinformatics 7.3.1-7.3.25, March 2008
Published online March 2008 in Wiley Interscience (www.interscience.wiley.com).
DOI: 10.1002/0471250953.bi0703s21
C 2008 John Wiley & Sons, Inc.
Copyright Analyzing
Expression
Patterns
7.3.1
Supplement 21
J-Express automatically keeps track of the processing and analysis steps through which
the user takes the data. This helps the user to keep track of his/her own project and allows
documentation of produced results and visualizations.
BASIC
PROTOCOL 1
CREATE A GENE-EXPRESSION MATRIX FROM SPOT INTENSITY DATA
WITH J-EXPRESS
In order to analyze microarray data, J-Express creates a gene-expression matrix from
spot-intensity data. The program accepts multiple spot-quantitation file formats, and
then filters and normalizes the data before creating the final gene-expression matrix. This
protocol discusses loading and filtering raw data, as well as the normalization options.
Necessary Resources
Hardware
Suggested minimum requirements for PC system: 1 GHz processor and 1 Gb
RAM. Large data sets may, however, require more resources.
Software
The J-Express software can be obtained from the Web site http://www.molmine.
com/. A license key needs to be installed; such licenses are available from the
molmine Web site. Version 2.7 is free for academic users; see Web site for more
detailed information. The software runs on top of a Java environment, which can
be downloaded for most operating systems (e.g., Linux, Windows, and Solaris).
Most operating systems are shipped with Java installed, and upgrading or
installing new versions is very easy.
Files
A number of spot-quantitation file formats are accepted, and readers for new file
formats are easily added. As long as the files are tab-delimited text files with a
number of measured and calculated quantities for each spot, customized loaders
can be manually set up in a data format tool included in the framework.
As an example, the following protocol uses a synthetic data set, after an idea from
Quackenbush (2001). The data were generated by creating seven seed profiles
and applying noise to these. The sources and actual data are shown in
Figure 7.3.1.
1. Download the software from http://www.molmine.com. Install and start the program.
An installation program is downloaded and executed. The installation program unpacks
the J-Express program and places the files in a directory that can be chosen by the user.
The procedure is self-explanatory and straightforward.
Load spot intensity data
2. Load “raw data (SpotPix Suite)” using a flexible data-import wizard (see Fig. 7.3.2).
Add all your arrays to the experiment by dragging them into the experiment table.
Rename each row to correspond to the sample value.
More documentation on setting up your experiment can be found in the documentation
and as additional PDF files installed together with J-Express.
Analysis of
Gene-Expression
Data Using
J-Express
The wizard allows the user to set up the experiment by adding virtual arrays in a table
representing the experiment. Each row in the table represents a sample measurement,
and for each sample measurement, a number of replicate arrays can be added (these
will become additional columns in the experiment table). The first column in each row
contains the name of the measurement. This is also the name that will appear at the top of
each sample column in the final expression matrix. When the experiment has been defined
by adding arrays and linking them to data files, the next step is to perform a quality
control of the data associated with each array.
7.3.2
Supplement 21
Current Protocols in Bioinformatics
Figure 7.3.1 Synthetic data were generated from seven seed profiles by addition of (white)
noise. To the left are shown the seed profiles and to the right the resulting synthetic data. The
color of each profile is that of the seed profile from which it was generated. If the profiles are
thought of as generated from a time-series experiment, the x axis corresponds to the time points.
The y axis gives the log-ratio of a gene’s expression level (logarithm of the expression level of a
gene at a certain time point divided by its expression level in a reference sample). For example
the “black genes” have an expression level that does not change much during the time course,
whereas the “red genes” are unchanged during first few time steps (but below reference level),
then increase through a number of time steps, and stay the same for the last few time steps. The
data were derived by defining the seven template profiles and generating profiles by adding noise,
specifically by adding random numbers between –0.5 and 0.5 (uniform probability) to each gene
at each time point. For the color version of this figure go to http://www.currentprotocols.com.
Figure 7.3.2 Data-import pipeline. Spot-intensity data are loaded from a file. A subset of the
genes is selected through a filtering step, the intensity values for the remaining genes are normalized, and log-ratios are calculated. The prepared data set is a gene-expression data matrix that
can be analyzed using, e.g., clustering methods.
Simple quality control of the array
3. To check the quality of the physical array, it is possible use the “quality control”
tool in J-Express. The user can choose various fields in the array output file and plot
these according to their spatial location. For instance, a popular control is to plot the
background intensity to see if there is a correlation between background contribution
and spatial location.
Perform preprocessing of each array
The Process tab enables customized routines for preprocessing each array in the experiment. Although it is possible to create an individual processing procedure for each array,
it is recommended that all arrays receive the same treatment during low-level preparation.
An easy way of doing this is to define a certain sequence (stack) of processes, try them on
a single array, and then use the “copy to all” option. What we refer to here as processes
are generally a number of routines available in the framework which can be added to
a list (the process stack). An example of such a process is the filter process, which can
Analyzing
Expression
Patterns
7.3.3
Current Protocols in Bioinformatics
Supplement 21
read any statistics in the array output file and compare them to a value. If the comparison
is valid (or invalid, as defined by the user) the measurement (spot) is tagged as filtered.
Another process is the plot process, which can, for instance, be added to the end of each
process stack to view a scatter plot of the whole processing procedure. Normalization
processes can be added when two-channel arrays are used (see below for inter-array
normalization procedures). By adding a plot before and after a normalization process (or
moving the same plot from before the normalization process to after), it is easy to see
in which way the normalization has changed the data. Another important process worth
mentioning is the scripting process. This provides a Jython (Python in Java) interface,
which can be used to manually manipulate the data or tag the measurements as filtered.
The scripting interface can also be used as a programming interface and enables users to
develop their own Java classes for data manipulation (e.g., data transformation, filtering,
or normalization) and plug them directly into the preprocessing framework.
4. Go to the Process tab and add filtering processes to remove all unwanted measurements. Use the String filter process to remove spots that are not a part of your
experiment (e.g., control spots, spikes, empty spots, etc). If this is a two-channel
experiment, add the global lowess normalization process. When satisfied with the
process stack, add a plot (one- or two-channel plot) to the end and click “run to” on
the plot. Check the distribution of the measurements. If the plot looks OK, click the
“copy to all” button to copy this process stack to all arrays.
Further information about normalization
5. For two-channel arrays, the normalization consists of a transformation of channel 1
to make it comparable to the second channel. For one-channel arrays, this can be a
reference array. The transformation is to correct for unequal quantities of hybridization material with each of the two dyes or for unequal labeling or hybridization
properties for each of the dyes. J-Express, at the moment, offers the user the choice
between four alternative normalization transformations available for two-channel
arrays (Fig. 7.3.3), and, currently, one alternative for one-channel arrays.
All normalization methods (substeps a to d) available in J-Express can be instructed to
use only a subset of genes in finding the normalization transformation to be applied to the
data. This subset is chosen by the user and may contain genes that are expected to change
little in expression values through the experiment.
Analysis of
Gene-Expression
Data Using
J-Express
a. The first and simplest, median transformation, multiplies all intensities in channel 1
by a number that makes the median of the intensities in each of the two channels
identical. The underlying assumption is that most genes have similar expression
level in both samples.
b. The other normalization is a linear-regression method termed MPI, since this was
supplied by the collaborating group of Martin Vingron now at Max Planck Institute
in Berlin (Beibbarth et al., 2001). This method also assumes that most genes have
unchanged expression levels between the two samples. First, a percentile (e.g.,
a value that is above exactly x% of the intensity values) is subtracted from each
channel to correct for unequal global background. Second, a multiplicative factor
is found to scale the first channel so that most of the highly expressed genes are
transformed to lie near the diagonal in a plot of intensity values.
c. The third and most popular method is the lowess normalization method, which
makes sure the mean intensity within any window along the intensity axis is equal
in the two channels (Cleveland, 1979). In addition, a procedure to account for
outliers is applied, and often this is repeated in iterations.
d. The fourth and final normalization option is to use splines (short curved segments)
to find a flexible mean between the two channels and process the data so that the
7.3.4
Supplement 21
Current Protocols in Bioinformatics
Figure 7.3.3 Screen shot of the SpotPix suite in J-Express including a visualization of a lowess normalization.
The SpotPix suite shows the experimental design linking data files to samples. For each data file, a sequence of
processes (including filters, plots, and normalization procedures) can be edited and executed. The figure shows
a process batch including a plot that has been executed so that the plot is included in the screenshot. The plot
shows the “regression line” defined by the lowess normalization procedure just above the plot in the process
batch.
splines are as straight as possible (Workman et al., 2002). This method is similar
to the lowess method, but the implementation gives more control to the user.
Try several normalization methods including lowess, change the parameters, and inspect
the results visually. The normalization window includes plots visualizing the input and
the output of the normalization algorithm (see Figure 7.3.3). The plots also provide an
indication of the quality of the intensity values and their normalization.
Global normalization
6. The normalization procedures described above are typical single-array procedures
and perform normalization only on two comparable data sources (measurements
minus reference measurements, or the channels in a two-channel array). Another
class of normalization methods can normalize sets of data sources and a use batches
of arrays instead of single pairs. Examples are the RMA (Irizarry et al., 2003)
procedure for Affymetrix arrays and quantile normalization (quantile normalization
is actually included in the RMA procedure).
RMA is short for Robust MultiChip Average and is available in the main J-Express file
menu. It lets the user select a batch of Affymetrix chips (as CEL files) and a definition
file (CDF). It then performs a background correction to correct for nonspecific binding, a
quantile normalization step, and finally a “polishing” step using a probe-set summary of
the log-normalized probe-level data. The quantile normalization method generally unifies
the expression distribution across all arrays in the experiment.
Analyzing
Expression
Patterns
7.3.5
Current Protocols in Bioinformatics
Supplement 21
Compiling and post-processing the expression matrix
7. When a suitable process stack has been created for all the arrays (e.g., by constructing
it for one array and copying to the others), the next step is to compile the expression
files into a single expression matrix. Before we start this process, we must decide
what to do with measurements removed by our filtering processes.
8. Click the “Post compilation” tab and choose an imputation method.
The simplest methods are the row or column mean, but studies show that they can have
unwanted effects on your expression matrix. A better approach is to use the LSImpute
methods or the KNNimpute method (Troyanskaya et al. 2001; Bø et al. 2004).
9. Finally, go back to the data tab and select in what form you want the resulting data.
For two-channel arrays, log ratio matrices are mostly used, and for one-channel data, it
is a good idea to use NONE or log ratio if a reference sample has been chosen. When
clicking “compile,” array replicates and (if chosen by user) within-array replicates, will
automatically be combined by the chosen method (the combine method). Each sample
will be added as a column and each measurement will be added as a row of an expression
matrix. This matrix will be added to a data set wrapper and put into the main project tree.
All information about the preprocessing is stored together with the expression matrix, so
that opening the raw data loader (SpotPix Suite) when this data is selected will bring up
the low-level project. The various processing procedures used can also be viewed in the
meta info window.
BASIC
PROTOCOL 2
ANALYZE A GENE-EXPRESSION MATRIX USING J-EXPRESS
The J-Express program can be used to explore a gene-expression data set contained in
a gene-expression data matrix in the J-Express system. For example, one may find sets
of genes behaving in a similar manner through a time-series experiment. Most of the
methods below analyze and compare gene-expression profiles. A profile signifies a list
of expression measurements associated with one gene (a row in the gene-expression
matrix).
Profile similarity search window
This window allows the user to select one expression profile (expression measurements
for one gene through a set of experiments or time steps) and to find other genes with similar
expression profiles using any of the defined dissimilarity measures (see Background
Information). Figure 7.3.4 shows the window and illustrates the difference between two
dissimilarity measures.
User-defined profile search
J-Express also allows the user to define a search profile and to search with it to find
all matching expression profiles in a gene-expression data matrix. The search profile
simply defines lower and upper bounds on the expression level for each array. The user
defines a search profile by using the mouse to move the lower and upper limits on the
allowed expression levels for each array. The search returns the list of genes for which all
expression values fall within the specified limits. A special feature of the profile search
is that it allows the user to “cycle” the expression profile, that is, to shift the lower/upper
bounds cyclically. This is primarily designed for time-series experiments, where it can be
interesting to see sets of genes behaving similarly but with a time difference. Figure 7.3.5
illustrates the Profile Search window.
Analysis of
Gene-Expression
Data Using
J-Express
Exploring the data using clustering and projection methods
Given a gene-expression matrix, one natural question to ask is whether the genes (rows)
and/or the arrays (columns) form groups. In other words, one can search for gene sets
having similar expression profiles under a given set of conditions. Such genes may be
7.3.6
Supplement 21
Current Protocols in Bioinformatics
Figure 7.3.4 The profile similarity search in J-Express allows the user to find the profiles most
similar to a query profile when a particular dissimilarity measure is used. The figure illustrates
the difference between (A) Euclidean distance, and (B) Pearson correlation–based dissimilarity
measure (mathematically, the dissimilarity measure is 1 minus the correlation coefficient). See
Background Information for more about dissimilarity measures.
hypothesized to participate in the same biological processes in the cell, for example,
taking part in the same metabolic pathway. Also, it is interesting to identify a set of
arrays that have similarities in gene-expression measurements, for example, to identify
relationships between different tumor (cancer) samples and potentially identify subtypes
represented in a cancer study.
In general, given a set of objects and a measure of their dissimilarity, it is reasonable
to ask whether the set can be divided into groups so that objects within each group
Analyzing
Expression
Patterns
7.3.7
Current Protocols in Bioinformatics
Supplement 21
Figure 7.3.5 The user-defined Profile Search window in J-Express allows the user to define
a search profile consisting of a lower and upper limit on the expression values and to find all
profiles matching that profile. The search profile is defined by the red/green barred boxes and
the matching expression profiles are shown in black. For the color version of this figure go to
http://www.currentprotocols.com.
are relatively similar to each other, and there is less similarity between the groups.
Partitional clustering methods such as the K-means algorithm will create nonoverlapping
groups, which together include the complete set of objects. Alternatively, one may want
to organize the objects in a tree. In the tree, very similar objects are grouped together in
tight subtrees. As one moves to larger and larger subtrees (up to and including the whole
tree), more and more dissimilar objects are included. The tree structure is relatively easy
to interpret, and many biologists are used to looking at trees—e.g., phylogenetic trees.
However, one should remember that the algorithm imposes a tree structure on the data
set even though the data set may be better explained using other structures.
An alternative to using a clustering method is to project the objects into a two- or threedimensional space and allow the users to visually analyze the objects in this space.
Projection methods include principal component analysis and multidimensional scaling.
The main objective of projection is to preserve (as much as possible) the information
in the lower dimensional space. Self-organizing maps (SOMs) provide an intermediate
between clustering and projection. SOMs group similar objects together, and at the same
time the groups are organized in a structure (e.g., a grid) so that groups close to each
other on the structure (e.g., neighbor nodes on the grid) contain similar objects.
Analysis of
Gene-Expression
Data Using
J-Express
Hierarchical clustering
This is a conceptually simple and attractive method. An early application of hierarchical
clustering to microarray gene-expression data was provided by Eisen et al. (1998). It
introduced an intuitive way of visualizing the expression profiles of the genes along
7.3.8
Supplement 21
Current Protocols in Bioinformatics
the edge of the resulting dendrogram. In J-Express, the user can perform hierarchical
clustering on a data set by choosing (clicking) the data set of interest in the project tree
and then choosing hierarchical clustering on the Methods pull-down menu (alternatively
a button with a tree icon can be clicked). The user then selects which distance measure
to use to calculate the tree (see Background Information). Additionally, the user can
choose which linkage rules to apply. The alternatives are single linkage, average linkage, and complete linkage. Average linkage also comes in two variants, weighted and
unweighted, corresponding, respectively, to the WPGMA and UPGMA methods well
known in clustering.
Additionally, the user can choose whether only the rows or both the rows and the columns
of the gene-expression matrix are to be clustered. The user is also given a high level of
control in defining how the results should be displayed on the screen (or in the file if the
graphics are saved to file). Figure 7.3.6 shows the results of hierarchical clustering of the
synthetic data set using J-Express and three different linkage rules.
K-means clustering
K-means clustering is a very simple algorithm for dividing a set of objects into K groups.
The parameter K needs to be defined prior to clustering. The algorithm chooses K points
as initial center points (centroids), one for each cluster. It then alternates between two
operations. First, given a set of centroids, allocate each data point to the cluster associated
with the closest centroid. Then, given sets of data points allocated to each of the K clusters,
calculate new centroids for each of the clusters. If, in two consecutive iterations, the same
points are allocated to each of the clusters, the algorithm has converged. The algorithm
may not converge in all cases, and it is convenient to define a maximum number of
iterations.
While the K-means algorithm is conceptually simple, it does have certain weaknesses.
One is that the user needs to define the number of clusters beforehand, and in most
cases the user will not have sufficient information. Another weakness is the initialization,
since the final result depends strongly on this. As a remedy for this second problem,
different heuristic methods have been proposed to find “good starting points,” including
the random approach, Forgy approach, MacQueen approach, and Kaufman approach
(Peña et al., 1999).
In J-Express the user starts a K-means analysis by choosing it from the Methods pulldown menu (or alternatively by clicking a short-cut button). The user needs to specify
the number of clusters, and may choose between a range of distance measures and
initializing methods. The most natural distance measure to use is the Euclidean, since the
centroids are calculated under the assumption of a Euclidean space. If one seeks clusters
of genes with correlated expression profiles, one should, instead of using a correlationbased distance measure, perform mean and variance normalization, and use a Euclidean
distance measure in the K-means analysis. Figure 7.3.7 shows the menu allowing the
user to start a K-means analysis in J-Express, including control over all the parameters
discussed.
Principal component analysis
Principal component analysis (PCA) involves mathematical procedures that transform
a number of (possibly) correlated variables into a (smaller) number of uncorrelated
variables called principal components (Joliffe, 1986). This approach has been popular
for analyzing gene-expression data (Holter et al., 2000; Raychaudhuri et al., 2000). The
main principle is to linearly transform a set of data points in a space of high dimensionality
to a set of data points in a space of lower dimensionality, while keeping as much of the
variance information in the data set as possible. Conceptually, the one axis through the
Analyzing
Expression
Patterns
7.3.9
Current Protocols in Bioinformatics
Supplement 21
Figure 7.3.6 Example showing hierarchical clustering of the synthetic data set using: (A) single
linkage, (B) average linkage, and (C) complete linkage. To the very right of each clustering is shown
from which seed each profile was generated (this is shown using the gene group visualization
functionality in the dendrogram window of J-Express).
Analysis of
Gene-Expression
Data Using
J-Express
original space that explains most of the variation in the data set is first found. The variance
explained by an axis can be calculated by projecting all data points onto the axis and
calculating the variance of this set of (one-dimensional) numbers. Next, one removes the
contribution of this axis to the data points (by subtracting the component along the first
axis) and repeats the analysis on this new data set. This is continued until the data points
end up in one point. The axes identified in each of these analyses constitute the principal
7.3.10
Supplement 21
Current Protocols in Bioinformatics
Figure 7.3.7 (A) K-means dialog box; (B) K-means result when clustering the synthetic data set.
Each cluster is represented by its mean profile and by the bars showing the variation within the
cluster at each data point.
components, and each explains a maximal amount of variance while being orthogonal
(independent).
The PCA functionality in J-Express allows the user to project the expression profiles
of interest down to two or three dimensions in order to get a visual impression of the
similarity relationships in the data set. Flexible two- and three-dimensional visualization
functions allow the user to visually study the data points and to interactively select objects
to study. In this way, the user can access the expression profiles of any subset of data
points. For example, a two-dimensional view can give an impression of existing clusters
as well as outliers in the data set. Inspection of the shape of the principal components
themselves can also be informative.
Analyzing
Expression
Patterns
7.3.11
Current Protocols in Bioinformatics
Supplement 21
Figure 7.3.8 (A) PCA window with applied density map and a selected green area. (B) Result
from PCA selection. For the color version of this figure go to http://www.currentprotocols.com.
Analysis of
Gene-Expression
Data Using
J-Express
The PCA window comes with a set of options for customizing and controlling the results.
For instance, one may want to apply a density map in order to easily see where in the plot
the data are the most dense (see Fig. 7.3.8). In Figures 7.3.8 and 7.3.9, the density map
is used together with a density threshold so that data points in areas that are less dense
are visualized only as spots. This makes it very easy to identify and group outlier genes,
which in this case are genes that correlate well with the selected principal components.
7.3.12
Supplement 21
Current Protocols in Bioinformatics
Figure 7.3.9 (A) PCA window with over 6000 points (genes). (B) The same number of points
with density threshold to find outliers.
In J-Express, a principal component analysis can be started from the Methods pull-down
menu or by clicking a button marked with a coordinate system.
Self-organizing map analysis
Self-organizing maps (SOMs), as originally proposed by Kohonen (1997), have been used
to study gene-expression data (Tamayo et al., 1999; Tornen et al., 1999). An attractive
feature of SOMs is that they provide a clustering of data points and simultaneously
organize the clusters themselves where clusters with similar expression profiles are close
to each other on the map. The SOMs are trained to adapt to the expression profiles under
study, a training procedure that is affected by choice of a large number of parameters.
For example, there are parameters controlling the “stiffness” of the map. For the user to
understand the effects of changing parameter values, J-Express visualizes the training
of the SOM by projecting both the data points (expression profiles) and the neurons in
Analyzing
Expression
Patterns
7.3.13
Current Protocols in Bioinformatics
Supplement 21
Figure 7.3.10
Analysis of
Gene-Expression
Data Using
J-Express
(A) SOM training control window. (B) SOM visualized in PCA window.
the map into a two- or three-dimensional plot. The projection is done using the most
significant principal components. Since the user can see the adaptation of the map during
the training phase, he or she can get an impression of the effect of altering the parameter
values. Of course, the user should be aware that the two- or three-dimensional plots do
not display the complete information in the data set. The program displays the proportion
of the variance explained by the utilized principal components. See Figure 7.3.10 for an
example.
7.3.14
Supplement 21
Current Protocols in Bioinformatics
After the training of the SOM, the data points are distributed between the neurons in a socalled sweep phase. In this phase, the user chooses whether the object groups collected
by the neurons should be disjoint or whether they should be allowed to overlap. The
user also sets the maximum distance between a neuron and a data point for the data
point to be associated with the neuron. If this threshold is set to a low value, one will
get “dense” clusters (low within variance), but at the same time run the risk that some
data points are not associated with any of the neurons. The visualization provided by
J-Express facilitates the understanding of such effects. A simpler interface of this method
has also been added. This interface only asks the user to select the number of neurons the
algorithm should use. When finished, the data points are allocated to the neurons using
an exclusive sweep. The SOM is started from the Methods pull-down menu in J-Express.
Significance analysis of microarrays
Significance analysis of microarrays (SAM) is a method that is used to find genes that
are differentially expressed between paired or unpaired groups of samples. The genes are
scored on the basis of change in gene expression between the states, combined with the
standard deviation of repeated measurements. It was developed by Tusher et al. (2001)
and uses as score a regularized version of the well known Student t test.
Since the distribution of the SAM scores is unknown, significance is estimated by randomly assigning the samples to the sample groups (permutations) and repeating the
calculation of SAM scores. This process is repeated a number of times to estimate the
distribution of random scores. The original SAM scores are compared to the distribution
of SAM scores obtained for the permuted data sets, and used to calculate false discovery
rates (FDR). The FDR is calculated for a list of genes with the highest SAM scores, and
reflects how many false positives should be expected to be among these.
The SAM procedure is simple to use. The only parameters the user has to provide
are information about which groups the different samples belong to, which groups the
analysis should be performed on, and the number of permutations. SAM is available
from the Methods menu.
Rank product
Rank product is a very simple and straightforward method that can be used to find
differentially expressed genes in a data set. It was developed by Breitling et al. (2004).
Rank Products is based upon the ranks that a gene gets after calculating fold change
between pairs of samples, and then scoring the ranks the genes obtain in the different
comparisons. A gene that is ranked high in all of the comparisons will get a good score.
The simplicity of the method also makes it a suitable option for doing meta analysis
(e.g., analysis of a set of gene rankings each based on analysis of separate data sets).
One of the drawbacks of the Rank Products method is that it may score a gene to be both
significantly up- and down-regulated. Rank Products can be started from the Methods
menu.
Gene ontology analysis
The Gene Ontology (GO) is a set of well defined terms used to described genes where
the terms are also organized in hierarchies (actually directed acyclic graphs, or DAGs)
reflecting the relationships between the terms. The GO component in J-Express can be
used to understand more about the gene lists resulting from the other analysis steps
performed. In addition to browsing the GO tree to see which processes the genes in
the lists are involved in, its real power becomes apparent when the GO tree of a list of
interesting genes (containing the number of genes mapped to each term in the tree) is
compared to the GO tree of a reference list (e.g., a list containing all the genes expressed
Analyzing
Expression
Patterns
7.3.15
Current Protocols in Bioinformatics
Supplement 21
on the array). When comparing the two trees, the reference list (together with the relative
size of the list of interesting gene set compared to the reference list) is used to calculate
expected values for the different entries in the GO tree and thereby see if the genes in the
list of interesting genes are enriched or have more genes belonging to a particular entry
then what would be expected by chance.
A file containing DAGs describing the relations between the different GO
terms can be downloaded from http://www.geneontology.org and saved to the JExpress/resources/go folder. The association between the data set and the GO
terms must be described in an association file. Association files for many organisms
can also be downloaded from the Gene Ontology Web resource and placed in the directory resources/go/goassociations under the J-Express directory in your
local J-Express installation. The user must then make sure that the identifier used in the
association file also exists as a column in the data set. The J-Express Annotation Manager
can be used to add annotation. See the J-Express Help for further details. The GO tree
method can be started from the Methods menu.
Gene set enrichment analysis
Gene set enrichment analysis (GSEA; Subramanian et al., 2005) uses external information
in a search for groups of genes that follow the same trends in a data set. It represents
an attempt to overcome some of the shortcomings of other more traditional methods
by avoiding preselection of the number of clusters and cut-offs. It can be performed on
either categorical or continuous data. Categorical data is of the type “before treatment”
versus “after treatment,” while continuous data can be time series data.
The external information used by GSEA is gene sets. Gene sets can be defined to capture
the biological relationships that the investigator is interested in. For example, a gene
set can be a list of genes sharing some terms in their descriptions, such as “apoptosis”
or “receptor activity,” and can be created from almost any source. Gene Ontology is a
commonly used source.
GSEA starts by ranking the genes using a method selected by the user. For categorical
data, this can be Golub score or SAM, which are methods used to calculate differential
expression between two groups of samples. For continuous data, the genes are ranked
according to their correlation with a particular search profile. Next the predefined gene
sets are scored by calculating an Enrichment Score (ES). The ES for a particular gene
set is calculated by starting at the top of the ranked gene list and adding a score to the
ES every time a gene that is a member of the gene set is encountered, and subtracting a
penalty from the ES every time a gene that is not a member of the gene set is encountered.
This creates what is referred to as a running-sum, and can be seen in Figure 7.3.11. The
maximum score reached during the calculation of the running sum is used as the ES
for the particular gene set. Significance of the gene set scores are estimated by data
permutation.
There are a few methods available that use gene sets. GSEA is well known in the
community and has been implemented in J-Express to facilitate excellent interactivity
with the gene expression data itself (through synchronization of the different viewers
in J-Express) when interpreting the gene set enrichment analysis results. GSEA can be
started from the Methods menu.
Analysis of
Gene-Expression
Data Using
J-Express
Scripting interface
The script interface really adds functionality to the software, and is available for use both
at the preprocessing and high-level analysis steps. It allows the user to automate the data
analysis and can thus save time when performing repetitive operations on the data sets.
The user gets access to the data objects through the script interface and may manipulate
7.3.16
Supplement 21
Current Protocols in Bioinformatics
Figure 7.3.11 Results from a GSEA analysis (window bottom right). The figure in the middle of the GSEA
window shows the path of the running sum used to find the Enrichment Score (ES) for a gene set. The ES is
determined by the highest (or in this case the lowest) point along this walk. The genes that appear at or before this
point are the ones contributing to the score and are referred to as the “Leading Edge.” These may be important
genes. The data set used shows the life cycle of Plasmodium falciparum (Bozdech et al. 2003). The genes in the
dataset were ranked according to their correlation to the search profile shown in the top right-hand corner, and
Gene Ontology was used to create gene sets (window top middle). When browsing the GSEA result window, the
corresponding GO terms will be selected in the GO tree and the gene profiles will be selected in the Gene Graph
window (window bottom left).
the data matrices by using some of the methods built into J-Express, or by writing his
or her own scripts. The user can also connect data from J-Express with his/her own Java
classes.
Both Javascript and Jython script are available in J-Express as of version 2.7. Jython
scripting enables full support for Python, but also enables use of Java objects directly
in the scripts. Some example scripts come with the installation, and more scripts are
available at the J-Express forum: http://www.molmine.com/forum. A script is executed
on a particular data set by selecting the data set in the project tree and pressing the
Execute button in the script window. Both scripts can be started from the Data Set menu,
and Jython scripts can also be added to the process list in SpotPix Suite and started from
a button with a “play” icon on the J-Express tools panel.
GUIDELINES FOR UNDERSTANDING RESULTS
The Basic Protocols of this unit describe how J-Express can be used to filter and normalize
the results from a set of microarray scans to obtain a gene-expression matrix, and how
the different analysis methods in J-Express can be used to explore a gene-expression
matrix. J-Express facilitates the interpretation of the results by allowing the user to
Analyzing
Expression
Patterns
7.3.17
Current Protocols in Bioinformatics
Supplement 21
visually explore the results within J-Express and to export textual representations of the
results that can then be imported into external programs. In the protocols above, the
different methods have been illustrated by using an artificial data set. It has been shown
that different dissimilarity measures can give quite different viewpoints on the data. It
is important to choose a measure that is appropriate for a particular analysis, and to
view filtering and normalization methods in conjunction with the choice of dissimilarity
measure. The authors have also tried to illustrate the difference between some of the
most popular clustering and projection methods. It is important that one have at least a
basic understanding of the methods before drawing conclusions regarding the results.
In general, microarray experiments can provide an overview of the phenomena under
study and form the basis for hypotheses that can be tested, potentially, using other
types of (often low-throughput) technology. In order to maximize the benefits from the
experiments, a set of powerful analysis methods should be applied and their results
compared and assessed. The J-Express package provides some of the most useful and
popular analysis methods and allows for comparison between the results.
COMMENTARY
Background Information
Central concepts
An important concept in J-Express is that
of a data set. This is the central object that
the user provides as input for analyses. It may
also be queried and stored. The relationships
between different data sets are automatically
recorded and maintained as part of the projectmanagement system. The system keeps track
of the data sets loaded into the system and of
the sets later generated by the user through
operations on the data and through analyses
(Fig. 7.3.12).
A data set can be one of two types. The most
important is the gene-expression data matrix.
This can be input to a selection of clustering and visualization methods (see Basic
Analysis of
Gene-Expression
Data Using
J-Express
Protocol 2). The other type is spot-intensity
data. This can be input to a filtering and normalization procedure giving, as a result, a
gene-expression data matrix (see Basic Protocol 1).
Another important concept is that of
metadata. For each data set stored in the
project-management system, J-Express generates metadata that document what steps the
user has taken in order to produce the data set.
These data can, for example, include information regarding from which file(s) the data were
loaded, filtering and normalization procedures
followed, and clustering and selection operations performed. The principle is that given the
metadata, the user should be able to repeat the
steps needed to reproduce the result.
Figure 7.3.12 Data flow. Data are loaded from a data medium (typically a hard disk) through
a loader/saver module and maintained within the J-Express system as a data set. The projectmanagement system holds the different data sets loaded, as well as derived data sets produced
by the user through analysis and processing (e.g., normalization/filtering) steps. The system also
stores information on relationships between data sets.
7.3.18
Supplement 21
Current Protocols in Bioinformatics
The gene-expression data matrix and object
sets
A gene-expression data matrix is a rectangular matrix containing one row per gene and
one column per array. Entry (i, j) contains a
number quantifying the expression value of
gene i in array j. If the data matrix has been
obtained through J-Express’ own normalization procedure applied to a set of two-channel
microarrays (see Basic Protocol 1), the value
is the log (base 2) ratio of channel 1 divided by
channel 2 intensity values for spot i on array
j. If applied to a set of one-channel microarrays, the value typically is a normalized form
of the log-intensity of the spot (or spots) corresponding to one gene. The analysis routines
in J-Express treat the data as numerical values,
and their semantics (or scales) are not explicitly used in any of the analyses. For this reason,
the program can also be used to analyze types
of data other than gene-expression data.
In addition to the numerical values, the
gene-expression data matrix can also contain
textual information about each row (gene) and
each column (array)—collectively referred to
as objects. Each object normally has an identifier, and, optionally, a set of information fields
in the form of character strings. For genes, the
identifier could be a GenBank identifier and
the information fields could, for example, contain characterization of the gene’s function or
its chromosomal location. The identifiers are
also the primary keys used in the J-Express
annotation manager (see Basic Protocol 2).
Associated with a gene-expression data matrix, one can also have a number of object sets,
each containing a subset of the genes or the
columns. These can be used to specify a set
of genes (or columns) sharing annotation information or grouped by the user, for example, on the basis of clustering analysis results
(see Basic Protocol 2). The gene sets can be
used to color graphical entities (e.g., expression profiles drawn as line graphs or dots in a
projection visualization) representing the objects in visual displays. For example, the user
can specify that all genes whose annotation
matches “heat shock” be colored red while all
genes belonging to a certain cluster be colored
blue.
Supervised and unsupervised analysis
Unsupervised analysis of gene expression
data has the goal of identifying groups of genes
(or arrays) that are similar to each other, effectively reducing the dimensionality of the data
set. For example, a possible goal might be to
obtain groups of genes that show similarity in
their expression values over all or over a subset of the arrays. It can then be hypothesized
that such gene sets are biologically related,
and, depending on availability of data, this can
be automatically analyzed. Also, hypotheses
about a gene’s function can be based on functional properties of other genes found in the
same cluster.
In the case of supervised analysis, a set
of objects (either genes or columns, e.g., expression profiles from different patients) are
given labels. When the samples are divided
into groups, a primary goal is to identify genes
and (predefined) gene sets that show differential expression between the sample groups. For
this purpose, methods such as SAM (Tusher
et al., 2001), rank products, and gene set enrichment analysis (Subramanian et al., 2005)
are applied. The key in all of these methods is
that they report for each gene (or gene set) a
statistic reflecting differential expression, and,
together with this, a p value reflecting how
likely this or a more extreme value of the reported statistic (e.g., a t score) is to be found,
assuming that the gene is expressed at the same
level in both groups. Methods for taking into
account multiple testing (the analysis is done
not for one gene but typically for thousands
of genes) include Bonferroni, False Discovery
Rate (FDR), and Q-values (Cui and Churchill,
2003). In J-Express p values are reported together with FDR values or Q-values.
Another goal in supervised analysis is to
develop a classifier that is able to predict the
labels of as yet unlabeled examples. For example, one may wish to develop a method
to predict functional properties of genes (e.g.,
Brown et al., 2000) or cancer subtype of a
patient (e.g., Golub et al., 1999). Techniques
applied here include support vector machines,
K nearest-neighbors’ classifiers, and artificial
neural networks. For a fuller discussion of supervised versus unsupervised analysis, see, for
instance, Brazma and Vilo (2000).
Expression-profile dissimilarity measures
An expression profile describes the
(relative) expression levels of a gene across a
set of arrays (i.e., a row in the gene-expression
matrix) or the expression levels of a set of
genes in one array (i.e., a column in the matrix). In cluster analysis (see Basic Protocol 2),
one seeks to find sets of objects (genes or arrays) with similar expression profiles, and for
this one needs to quantify to what degree two
expression profiles are similar (or dissimilar).
Clustering is more easily explained by using
Analyzing
Expression
Patterns
7.3.19
Current Protocols in Bioinformatics
Supplement 21
Figure 7.3.13 Illustration of distance measures for pairs of points in a two-dimensional space.
(A) Euclidean distance; (B) Manhattan (city block) distance.
dissimilarity (or distance) measures, and this
terminology will be used in this unit.
One can measure expression dissimilarities
in a number of different ways. A very simple measure is Euclidean distance, which is
simply the length of the straight lines connecting the two points in multidimensional space
(where each element in the expression profile
gives the coordinate along one of the axes).
Another simple measure is often referred to as
city block or Manhattan distance. This simply
sums the difference in expression values for
each dimension, with the sum taken over all
the dimensions. Other measures quantify the
similarity in expression-profile shape (e.g., if
the genes go up and down in a coordinated
fashion across the arrays), and are based on
measures of correlation. Figure 7.3.13 illustrates two representative distance concepts in
two dimensions.
In J-Express the user can, for each clustering method (see Basic Protocol 2), decide
which dissimilarity measure should be used. It
is a good idea for the user to explore the alternative measures separately in the expressionprofile similarity search engine to become
familiar with the properties of each of the
measures.
Critical Parameters
Analysis of
Gene-Expression
Data Using
J-Express
7.3.20
Supplement 21
Experimental design: Intra- and interarray
normalization
For normalization of data from two-channel
platforms using common reference design, it
is assumed that the experiment is designed so
that the reference sample (shared between the
arrays) is hybridized to channel 2 on each array, and the normalization is then carried out
for each array by normalizing channel 1 with
respect to channel 2 (see Fig. 7.3.14). If the
reference is not hybridized to channel 2 on
all arrays, the user can swap the data columns
to move the reference channel to the second
position for each array.
J-Express is designed to allow handling of
one-channel data. In this case, only channel
1 is used and the arrays are normalized using
quantile normalization (Bolstad et al., 2003).
Quantile normalization finds the average distribution over all of the arrays in the experiment and then transforms each array to get
the average distribution. Another option is to
use one array as reference and then the other
arrays are normalized with respect to it.
Selecting a clustering method
J-Express and other gene-expression analysis systems provide a choice of different clustering methods (Basic Protocol 2). It is difficult to provide any definite advice on which
method should be used in any one concrete
situation. The history of clustering theory in
general and of clustering of gene-expression
data shows that there is no one method that
outperforms all others on all data sets (Jain
and Dubes, 1988). Different investigators find
different methods and output representations
more useful and intuitive.
There are, however, some points that one
should keep in mind when considering alternative methods. For example, when using a hierarchical clustering method as presented here,
it is assumed that it is possible to find a binary
(bifurcating) tree that fits the structure of the
data well. This may not always be the case.
For example, it may be that there are more
complex similarity relationships between different clusters than what is naturally described
by such a tree.
Other methods also have their shortcomings. For example in K-means clustering, the
user needs to select the number of clusters
beforehand, and the method does not give
Current Protocols in Bioinformatics
Figure 7.3.14 Different experimental designs using two-channel system. In a two-channel system, one typically uses either a common control hybridized to each array (in either one of the two
channels), or one performs competitive hybridizations between all (or a subset of) the pairs of
samples under analysis. Presently, J-Express supports the first experimental design (left). Note
that, on the left, all samples are hybridized together with a common control (referred to as A in the
example), while, if one uses the all-pairs approach, every possible pair of samples is hybridized
together.
any information about the relationships between the identified clusters. Also, using a
self-organizing map (SOM), the choice of underlying topology affects the result. For example, one may choose a two-dimensional grid
(as above) or a three- or four-dimensional one.
The different choices may produce quite different results.
All in all, it is probably a good policy to try
out more than one method using alternative
parameter values in order to get the most out
of a concrete data set. J-Express permits the
user to do this, and in the future the program
will be extended, with an even wider selection
of clustering methods complementing the currently included methods (see Suggestions for
Further Analysis)
Incorporation of information on gene
function
J-Express has been extended with multiple modules to allow incorporation of functional data. In particular, the user can utilize
the KEGG database (Kanehisa et al., 2002),
finding the pathways that significantly overlap
with user-defined gene sets (the user is warned
that KEGG has license requirements for some
user groups). J-Express can also import Gene
Ontology (GO) files and offers the user functions to identify significant overlaps between
gene groups and GO terms. While these functions are primarily used to help interpretation
of expression data analyses, the gene set enrichment analysis allows the user to include
biological information to guide the analysis
itself (see Basic Protocol 2).
Suggestions for Further Analysis
The results obtained in an analysis of a data
set using J-Express can be stored, and further
analysis can be performed externally. It may be
desirable, for example, to perform a more indepth analysis of the genes placed together in
a cluster by J-Express. For example, one may
wish to investigate whether genes with similar
expression profiles share statistically significant patterns in their regulatory regions, giving hints of a common regulatory mechanism
(see, for instance, Brazma et al., 1998) or to
analyze gene expression together with protein
expression or interaction data. The J-Express
tool will be extended, with more functionality
in this direction, in the future. In some cases
one may wish to design new experiments (e.g.,
knockout or RT-PCR experiments).
Adapting and extending the J-Express
system
The plug-in framework. Through a comprehensible plug-in interface, it is possible to
connect any Java class to the J-Express framework. This interface gives the opportunity to
create bridges between J-Express and existing
systems, as well as new ways to manipulate or
analyze the data. In short, the plug-in model
consists of a main plug-in Java class with a few
abstract methods that must be implemented
by the programmer (sub-classed). Some plugins, including high-level normalization, filtering, search, and sorting, are already available
with full source code, and can be downloaded
from the same Web pages as J-Express. Simple examples, together with an Application
Analyzing
Expression
Patterns
7.3.21
Current Protocols in Bioinformatics
Supplement 21
Figure 7.3.15 The result of applying filters on the (original) synthetic data set using: (A) requiring
at least 5 values with absolute values above 2; (B) lower limit on standard deviation only.
Analysis of
Gene-Expression
Data Using
J-Express
Program Interface (API) and model description, are installed together with the main program package. Below, we briefly describe two
of the plug-ins available from the J-Express
Web pages (in the latest versions these
are also integrated into the main J-Express
framework).
Search tools. The search plug-in allows the
user to use regular expressions to search the
information fields in a gene-expression matrix. For example, the user can search for all
genes whose annotation matches “enzyme or
kinase,” or for all genes whose upstream se-
quences (if included in the gene-expression
matrix) match the pattern [AT]AAAT exactly.
High-level filtering and normalization. It is
sometimes appropriate to apply separate filtering and normalization routines to the geneexpression matrices. For example, one may
choose to remove the genes that show little variation in expression measurements. In
J-Express, this can be done using the available filtering plug-in, for example, to remove
the genes whose standard deviation is below
some threshold value (for an example, see
Fig. 7.3.15). Furthermore, one may want to
7.3.22
Supplement 21
Current Protocols in Bioinformatics
Figure 7.3.16 J-Express allows the user to normalize the expression profiles of genes (rows in
the gene-expression matrix). The example shows the results of normalizing the synthetic data set
by (A) mean normalization and (B) mean and variance normalization.
focus on the shape of the expression profiles
and not so much on the amplitude of the change
or on the offset of all values. In such cases,
one can use mean normalization or mean-andvariance normalization (see Fig. 7.3.16 for an
illustration). Both normalization procedures
operate on the expression profile of each gene
separately. While the first subtracts the mean
from each profile (so that the mean of each
profile gets a mean of zero), the second also
divides the resulting numbers by the variance
of the profile (so that the expression profile
mean becomes zero and its variance becomes
one). The second is well suited if one seeks
Analyzing
Expression
Patterns
7.3.23
Current Protocols in Bioinformatics
Supplement 21
to find genes behaving in a correlated manner
(e.g., increasing and decreasing in expression
level in a coordinated fashion), and allows one
to use simple (e.g., Euclidean) dissimilarity
measures also for this kind of analysis.
Scripting language. J-Express has a separate module supporting scripting in Jython (a
Java implementation of Python). This allows
users to describe their standard analysis operations as a program and also to add, for example, simple data transformation and analysis
functions to J-Express.
Future plans for J-Express
The J-Express system provides a powerful and integrated platform for the analysis of microarray gene-expression data. It is
platform-independent in that it requires only
the availability of a Java virtual machine
on the system. The system includes a range
of analysis tools and, importantly, a projectmanagement system supporting the organization and documentation of an analysis project.
J-Express is under development and extension,
and future versions will include new functionality as well as improved visualization and
management capabilities.
J-Express was one of the first tools to include functionality for importing and exporting MAGE-ML files (Spellman et al., 2002).
However, the functionality can be extended
to take advantage of, for example, the description of the experimental design in a
MAGE-ML file to automatically suggest or
execute analysis pipelines taking this information into account. We would also like to
develop functionality that allows the user to
consult other data sets when analyzing his/her
own data and for performing meta analysis.
The scripting functionality of J-Express allows flexible addition of analysis modules,
and future work will include developing and
making available a larger set of scripts that
J-Express users can utilize and adapt to their
own needs.
Additional sources of information
To help users get started with JExpress there are tutorials available at the
http://www.molmine.com Web site. In addition, the J-Express analysis guide MAGMA
shows the user step by step how to do different
types of analysis. MAGMA is available from
http://www.microarray.no/magma.
Analysis of
Gene-Expression
Data Using
J-Express
Literature Cited
Beibbarth, T., Fellenberg, K., Brors, B., ArribasPrat, R., Boer, J.M., Hauser, N.C., Scheideler,
M., Hoheisel, J.D., Schütz, G., Poustka, A., and
Vingron, M. 2001. Processing and quality control of DNA array hybridization data. Bioinformatics 16:1014-1022.
Bø, T.H., Dysvik, B., and Jonassen, I. 2004. LSimpute: Accurate estimation of missing values in
microarray data with least squares methods.
Nucleic Acids Res. 32:e34.
Bolstad, B.M., Irizarry, R.A., Astrand, M., and
Speed, T.P. 2003. A comparison of normalization methods for high density oligonucleotide
array data based on variance and bias. Bioinformatics 22:185-193.
Bozdech, Z., Llina, M., Pulliam, B.L., Wong, E.D.,
Zhu, J., and DeRisi, J.L. 2003. The transcriptome of the intraerythrocytic developmental cycle of Plasmodium falciparum. PLoS Biol 1:001016.
Brazma, A. and Vilo, J. 2000. Gene expression data
analysis. FEBS Lett. 480:17-24.
Brazma, A., Jonassen, I., Vilo, J., and Ukkonen,
E. 1998. Predicting gene regulatory elements in
silico on a genomic scale. Genome Res. 8:12021215.
Breitling, R., Armengaud, P., Amtmann, A., and
Herzyk, P. 2004. Rank products: A simple, yet
powerful, new method to detect differentially
regulated genes in replicated microarray experiments. FEBS Lett. 573:83-92.
Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini,
N., Sugnet, C.W., Furey, T.S., Ares, M. Jr., and
Haussler, D. 2000. Knowledge-based analysis
of microarray gene expression data by using
support vector machines. Proc. Natl. Acad. Sci.
U.S.A. 97:262-267.
Cleveland, W.S. 1979. Robust locally weighted regression and smoothing scatterplots. J. Am. Stat.
Assoc. 74:829-836.
Cui, X. and Churchill, G.A. 2003. Statistical tests
for differential expression in cDNA microarray
experiments. Genome Biol. 4:210.
Dysvik, B. and Jonassen, I. 2001. J-Express:
Exploring gene expression data using Java.
Bioinformatics 17:369-370.
Eisen, M.B., Spellman, P.T., Brown, P.O., and
Botstein, D. 1998. Cluster analysis and display
of genome-wide expression patterns. Proc. Natl.
Acad. Sci. U.S.A. 95:14863-14868.
Golub, T.R., Slonim, D.K., Tamayo, P., Huard,
C., Gaasenbeek, M., Mesirov, J.P., Coller, H.,
Loh, M.L., Downing, J.R., Caligiuri, M.A.,
Bloomfield, C.D., and Lander, E.S. 1999.
Molecular classification of cancer: Class discovery and class prediction by gene expression
monitoring. Science 286:531-537.
Holter, N.S., Mitra, M., Maritan, A., Cieplak, M.,
Banavar, J.R., and Fedoroff, N.V. 2000. Fundamental patterns underlying gene expression
profiles: Simplicity from complexity. Proc. Natl.
Acad. Sci. U.S.A. 97:8409-8414.
Irizarry, R.A., Bolstad, B.M., Collin, F., Cope,
L.M., Hobbs, B., and Speed, T.P. 2003. Summaries of affymetrix GeneChip probe level data.
Nucleic Acids Res. 31:e15
7.3.24
Supplement 21
Current Protocols in Bioinformatics
Jain, A.K. and Dubes, R.C. 1988. Algorithms
for Clustering Data. Prentice Hall, Englewood
Cliffs, New Jersey.
Joliffe, I.T. 1986. Principal Component Analysis.
Springer-Verlag, New York.
Kanehisa, M., Goto, S., Kawashima, S., and
Nakaya, A. 2002. The KEGG databases at
GenomeNet. Nucleic Acids Res. 30:42-46.
Kohonen, T. 1997. Self-Organizing Maps. SpringerVerlag, New York.
Peña, J.M., Lozano, J.A., and Larrañaga, P. 1999.
An empirical comparison of four initialization
methods for the k-means algorithm. Pattern
Recogn. Lett. 20:1027-1040.
Quackenbush, J. 2001. Computational analysis of
microarray data. Nat. Rev. Genet. 2:418-427.
Raychaudhuri, S.J., Stuart, M., and Altman, R.B.
2000. Principal components analysis to summarize microarray experiments: Application to
sporulation time series. Pacific Symposium on
Biocomputing, 455-466. Stanford Medical Informatics, Stanford University, Calif.
Spellman, P.T., Miller, M., Stewart, J., Troup,
C., Sarkans, U., Chervitz, S., Bernhart, D.,
Sherlock, G., Ball, C., Lepage, M., Swiatek,
M., Marks, W.L., Goncalves, J., Markel, S.,
Iordan, D., Shojatalab, M., Pizarro, A., White,
J., Hubley, R., Deutsch, E., Senger, M., Aronow,
B.J., Robinson, A., Bassett, D., Stoeckert, C.J.
Jr., and Brazma, A. 2002. Design and implementation of microarray gene expression markup
language (MAGE-ML). Genome Biol 3:RESEARCH0046.
Subramanian, A., Tamayo, P., Mootha, V.K.,
Mukherjee, S., Ebert, B.L., Gillette, M.A.,
Paulovich, A., Pomeroy, S.L., Golub, T.R.,
Lander, E.S., and Mesirovak, J.P. 2005. Gene
set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A.
102:15545-15550.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q.,
Kitareewan, S., Dmitrovsky, E., Lander, E.S.,
and Golub, T.R. 1999. Interpreting gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc.
Natl. Acad. Sci. U.S.A. 96:2907-2912.
Tornen, P., Kolehmainen, M., Wong, G., and
Castren, E. 1999. Analysis of gene expression
data using self-organizing maps. FEBS. Lett.
451:142-146.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown,
P., Hastie, T., Tibshirani, R., Bostein, D., and
Altman, R.B. 2001. Missing value estimation
methods for DNA microarrays. Bioinformatics
17:520-525.
Tusher, V.G., Tibshirani, R., and Chu, G. 2001. Significance analysis of microarrays applied to the
ionizing radiation response. Proc. Natl. Acad.
Sci. U.S.A. 98:5116-5121.
Workman, C., Jensen, L.J., Jarmer, H., Berka,
R., Gautier, L., Nielsen, H.B., Saxild, H.-H.,
Nielsen, C., Brunak, S., and Knudsen, S. 2002. A
new non-linear normalization method for reducing variability in DNA microarray experiments.
Genome Biology 3:research0048.
Analyzing
Expression
Patterns
7.3.25
Current Protocols in Bioinformatics
Supplement 21
DRAGON and DRAGON View: Information
Annotation and Visualization Tools for
Large-Scale Expression Data
UNIT 7.4
The Database Referencing of Array Genes ONline (DRAGON) database system consists
of information derived from publicly available databases including UniGene
(http://www.ncbi.nlm.nih.gov/UniGene/), SWISS-PROT (http://www.expasy.ch/sprot/),
Pfam (http://www.sanger.ac.uk/Software/Pfam/; UNIT 2.5), and the Kyoto Encyclopedia of
Genes and Genomes (KEGG, http://www.genome.ad.jp/kegg/). The DRAGON Annotate
tool makes use of relational databasing technology in order to allow users to rapidly join
their input gene list and expression data with a wide range of information that has been
gathered from the abovementioned multiple public-domain databases (Bouton and
Pevsner, 2001), rapidly supplying information pertaining to a range of biological
characteristics of all the genes in any large-scale gene expression data set. The subsequent inclusion of this information during data analysis and visualization allows for
deeper insight into gene expression patterns. The Annotate tool makes it easy for any
user with access to the Internet and an Internet browser to annotate large gene lists with
information from these multiple public databases simultaneously. Subsequent to annotation with the Annotate tool, the DRAGON View visualization tools allow users to
analyze their expression data in relation to biological characteristics (Bouton and
Pevsner, 2002).
The set of DRAGON View tools provides methods for the analysis and visualization of
expression patterns in relation to annotated information. Instead of incorporating the
standard set of clustering and graphing tools available in many large-scale expression
data analysis software packages, DRAGON View has been specifically designed to allow
for the analysis of expression data in relation to the biological characteristics of gene
sets.
PREPARING DATA FOR USE WITH THE DRAGON DATABASE AND
ANALYZING DATA WITH DRAGON VIEW
BASIC
PROTOCOL
This protocol describes how to prepare a tab-delimited text file for use in the DRAGON
database, how to understand the resulting data set, and then how to use the DRAGON
View visualization tools in order to analyze the data set in relation to the annotated
information gained from DRAGON. To demonstrate this process, the freely available
data set associated with the Iyer et al. (1999) study examining the response of human
fibroblasts to serum starvation and exposure is used. This is a good example data set
because it is freely available, concerns the expression of human genes across a time
course, has been well documented, and is sufficiently large to yield some interesting
results.
For all stages of this demonstration more information can be found on the Learn page of
the DRAGON Web site (http://pevsnerlab.kennedykrieger.org/learn.htm).
Necessary Resources
Hardware
Windows, Linux, Unix, or Macintosh computer with Internet connection
(preferably broadband connection, e.g., T1, T3, cable, or DSL service)
Analyzing
Expression
Patterns
Contributed by Christopher M.L.S. Bouton and Jonathan Pevsner
Current Protocols in Bioinformatics (2003) 7.4.1-7.4.22
Copyright © 2003 by John Wiley & Sons, Inc.
7.4.1
Supplement 2
Software
Internet browser: e.g., MS Internet Explorer 5 (or higher) or Netscape 6 (or higher)
on Windows or Macintosh systems; Opera, Netscape 6 (or higher), or Mozilla
on Linux-based systems. Internet Explorer 5 or higher and Netscape 6 or higher
are preferred, because Netscape 4.x is not capable of supporting all of the
functionality provided in the DRAGON Paths tool.
Also required:
Spreadsheet program: e.g., MS Excel on Windows or Macintosh systems or Sun
Microsystems Star Office suite on Linux systems.
Text editor: e.g., TextPad (http://www.textpad.com/) or Notepad on Windows
systems; XEmacs (http://www.xemacs.org) on Linux systems.
Finally, for advanced users who may want to have more flexibility in the
manipulation of their text files, the Perl programming language is powerful and
easy to use and allows the user to perform automated text-formatting,
file-creation, and file-alteration functions that are useful when analyzing large
data sets. Activestate (http://www.activestate.com) has developed a version of
Perl available for Windows computers (http://www.activestate.com/Products/
ActivePerl/). Otherwise http://www.perl.com Web site provides downloads of
Perl for Linux, Unix, and Macintosh computers.
Files
The Iyer et al. (1999) example data files were obtained from the Stanford
Microarray data Web site (http://genome-www.stanford.edu/serum/data.html).
The two files used for demonstration purposes in this unit may be downloaded
respectively at the following URLs:
http://genome-www.stanford.edu/serum/fig2data.txt
http://genome-www.stanford.edu/serum/data/fig2clusterdata.txt
Both files are also available at the Current Protocols Web site:
http://www3.interscience.wiley.com/c_p/cpbi_sampledatafiles.htm
Optional: The DRAGON database is generated through the automated parsing of
flat files provided by publicly available databases (see the DRAGON Web site
for a list of the database flat files used by DRAGON). The information in these
files is then loaded into a back-end MySQL (http://www.mysql.com; UNIT 9.2)
relational database for use by DRAGON (Fig. 7.4.1). Although it may be easy
and more intuitive for most users to access the information in these files via the
DRAGON Web site, some readers may want to use this information in their
own relational databases. For these purposes, all of the tables used in the
DRAGON database are provided for download on the DRAGON Web site
(http://pevsnerlab.kennedykrieger.org/download.htm) or can be ordered on CD
if desired (http://pevsnerlab.kennedykrieger.org/order.htm).
1. Acquire and prepare data in a master matrix file (see Background Information).
Preparing data for analysis is a critical first step in large-scale gene expression data
experiments. This preparation requires putting the data into a format that allows for the
comparison of gene expression values across all of the conditions in a given experiment.
Many expression analysis software packages attempt to make this job easier through
automated data-importing functions that keep track of the association of expression values
with gene identifiers. However, most researchers will still want to generate what can be
called a “master matrix text flat file” or “master matrix file” (see Background Information)
at some point during their analysis process.
DRAGON and
DRAGON View
The two example data files downloaded from the Stanford Microarray Web site (associated
with the study of Iyer et al., 1999) are both examples of master matrix files. To begin using
these files for the DRAGON demonstration, information from the two files must be merged.
The expression data in the fig2clusterdata.txt file is the correct type of ratio data
for demonstration purposes; however, only the fig2data.txt file has the GenBank
7.4.2
Supplement 2
Current Protocols in Bioinformatics
Figure 7.4.1 The DRAGON home page provides links to all available tools and data sources contained in DRAGON
and DRAGON View. The page also contains links to all of the public data files that are used by DRAGON to generate
its database.
accession numbers available for each sequence on the array. Due to the presence of a
common unique gene id field in both files, merger of information between the two files
is simple. Both files are opened in MS Excel. All columns in both files are sorted in
ascending order by the unique gene id (UNIQID) column. Because the same set of genes
is represented in each file, the sorted information between the two files matches perfectly.
If the sets of genes provided in the user’s files do not perfectly match, a merging of data is
still possible, but requires a relational database management system such as MS Access,
Oracle, or MySQL in order to allow for an appropriate SQL “join” of the common data
columns in each file (UNIT 9.2). The GenBank accession number column from the
fig2data.txt file is then pasted into the fig2clusterdata.txt file. The resulting file, figure2_combined_data.txt is used for all further steps in the demonstration.
2. Connect to the DRAGON Web site. The DRAGON and DRAGON View tools can
be accessed on the DRAGON Web site at http://pevsnerlab.kennedykrieger.
org/dragon.htm. Each of the tools available on the DRAGON site is listed on the front
page of the site (Fig. 7.4.1).
All of the tools in DRAGON and DRAGON View are based on Common Gateway Interface
(CGI) scripts written in the Perl programming language. As a result, they can all be used
via an Internet browser. The DRAGON tools include Search, Annotate, and Compare
(Compare is still under construction). The DRAGON View tools include Families, Order,
and Paths. In addition to DRAGON and DRAGON View, DRAGON Map, developed by
George W. Henry, is a set of tools that allow users to inspect the global expression properties
Analyzing
Expression
Patterns
7.4.3
Current Protocols in Bioinformatics
Supplement 2
of sequences defined in the UniGene database. Also, a powerful set of normalization tools
for microarray data called SNOMAD (Colantuoni et al., 2002) accompanies the DRAGON
Web site.
Following normalization with methods such as those provided by SNOMAD, the tools that
users use most often on the DRAGON Web site are DRAGON Annotate and, then, the
DRAGON View Families visualization tool. The usage of DRAGON Annotate is demonstrated in the following steps of this protocol; the usage of DRAGON Families is demonstrated in the Support Protocol.
In Figure 7.4.1, Internet Explorer 6 is pointed to http://pevsnerlab.kennedykrieger.
org/dragon.htm, and the home page is displayed.
Annotate the data set
3. Click on the Annotate link on the main page directly under the title, slightly to the
left (Fig. 7.4.1). The Annotate Web page is structured to help the user through the
annotation process by guiding the selection of variables through five sections of the
Web page. The general flow of usage for most of the tools in DRAGON and
DRAGON View is: Upload or Paste data, choose output variables, choose output
format (HTML, text, or E-mail) and submit analysis.
An overview of all five sections is provided at the top of the page in the Introduction section.
Additionally, a synopsis of the Annotate tool is provided in the Blurb section at the top of
the page.
4. At the top of the Annotate page (Fig. 7.4.2, panel A) the first thing to do is to import
the user’s data into the system. This is accomplished by first defining for the system
what sort of method is going to be used (section 1) and then either uploading a master
matrix text file (section 2a) or pasting the contents of an input file into the Web page’s
text box (section 2b). The deciding factor for whether one uploads or pastes data
should be the size of the data set to be annotated. In general, there is a size limit set
by Internet browsers on the amount of text that can be pasted into a Web page text
box. Because of this one needs to be cautious when entering data sets of more then
a few hundred rows into the text box on the Annotate Web page because data sets of
this size or greater may be cropped by the text box without warning.
The exact point at which the text box crops a data set depends on how much data is provided
in each row of the input data set. Because of this cropping problem, it is recommended that,
if the user has a data set of more than a few hundred rows, that the upload function in
section 2a on the page be used to import data, and, if the paste function is used, that the
user check the very bottom of the pasted data set after it is in the text box in order confirm
that the entire data set has been pasted.
In the example for demonstration, because of the size of figure2_combined_
data.txt, the upload option is the method of choice for importing data. In section 1, the
“I am going to upload my data using the file entry field below (Go to 2a).” radio button
option is selected. In section 2a, the Browse button is clicked and the figure2_
combined_data.txt file is selected from the appropriate directory. Supplying an
identifier in the text box in section 2a is optional but aids in identifying resulting files if
they are E-mailed.
DRAGON and
DRAGON View
5. Once the data has been imported, the user then needs to choose the types of
information with which the data set will be annotated (section 3; see Fig. 7.4.2, panel
B). Currently, information is available from four public database sources (see
Background Information). These are UniGene, SWISS-PROT, Pfam (UNIT 2.5) and the
Kyoto Encyclopedia of Genes and Genomes (KEGG). Different types of information
derived from these databases represent various biological attributes of the gene, its
encoded protein, the protein’s functional domains, the protein’s cellular functions,
and the protein’s participation in cellular pathways (see Background Information).
7.4.4
Supplement 2
Current Protocols in Bioinformatics
A
B
Figure 7.4.2 The DRAGON Annotate page. (A) The user is allowed to input data into a dialog box,
or a tab-delimited text file can be uploaded from a local file. (B) The user selects options, then sends
a request for annotation to the DRAGON database. Results may be returned as an HTML table, as
a tab-delimited text file (suitable for import into a spreadsheet such as Microsoft Excel), or as an
E-mail.
Analyzing
Expression
Patterns
7.4.5
Current Protocols in Bioinformatics
Supplement 2
The user chooses different types of data by simply selecting the check boxes to the
left of each type of information desired.
For the example, in section 3 of the annotation page (Fig. 7.4.2, panel B), the UniGene
Cluster ID, Cytoband, LocusLink, UniGene Name, and SWISS-PROT Keywords options
are checked.
6. Section 4 of the Annotation page (Fig. 7.4.2, panel B) allows the user to define certain
criteria related to the format of the imported data set. The most important criterion
defined here is the column in which the GenBank accession numbers provided in the
input data set are located, because, as mentioned above, GenBank accession numbers
are required for the proper functioning of the Annotate tool. For example, if the
GenBank accession numbers are located in the second column of the data set, then a
2 would be entered into the “Column number containing GenBank numbers (with
the farthest left column being 1):” text field. In addition, the user can specify what
sort of delimiter is being used in the input file (default is tab). If something other than
a tab, such as a comma, were being used, then a “,” (without the quotes) would be
entered in the “Text to use as field delimiter (assumed to be a tab “\t” if left blank):”
text area; if a pipe character were being used then a “|” (without quotes) would be
entered. Finally, the newline character being used can be specified (the default is
recognized as either a Windows or Linux/Unix newline character).
An important point about the choices made in section 4 of the Annotation page is that the
different data sources used to annotate the input data set can be more or less limiting. For
example, in any given data set there are only a small number of genes that will be annotated
with information from the KEGG database. Alternatively, a much larger set of genes will
be annotated with information from the Pfam database. Because of this it is often best to
perform multiple annotations of a given data set each time with a single type of information.
For example, with the fibroblast data set in the example from Iyer et al. (1999), three
successive annotations are performed. First the data set is annotated with all of the
UniGene database information. A second copy of the data set is annotated with SWISSPROT keywords, and a third copy of the data set is annotated with Pfam family names and
accession numbers. Performing successive annotations like this instead of one query
including all three data types prevents the loss of one type of information (i.e., Pfam
numbers) because another type of information (i.e., SWISS-PROT keywords) is not available for a given gene.
Since the GenBank accession numbers are contained in column 3 of the figure2_
combined_data.txt file, a 3 is placed in the “Column number containing GenBank
numbers (with the farthest left column being 1):” text area in section 4.
7. Section 5 of the Annotation page (Fig. 7.4.2, panel B) allows the user to select two
things, the desired output file format and how the annotated data is to be added to the
input data set. The three possible output file formats are HTML, text, and E-mail.
There are distinct benefits to each type of output format. The HTML-based output allows
the user to link out to additional information about both the original input data and the
additional information that was provided by the annotation. The drawback of the HTMLbased output is that it is useful only for smaller data sets.
The text-based output is also useful only for smaller data sets, but provides the option of
downloading the output file to one’s computer.
DRAGON and
DRAGON View
The E-mail output format is the best option for most data sets for several reasons. The first
is that the user is able to receive data sets of any size using this output format. It is important
to note that if a large data set is used with another output format, it is possible that the
processing of the data will extend beyond the “time-out” period set in by the Internet
browser being used. If this occurs, the user will receive an error message from the browser
stating something to the effect of, “The page cannot be displayed,” and the user may think
that there has been an error in the running of the Annotate tool. This is not the case; the
7.4.6
Supplement 2
Current Protocols in Bioinformatics
data processing has simply taken longer than the user’s browser is set to wait for a signal
back from the Annotate tool. A second reason why the E-mail output format is preferable
in most cases is that multiple data sets can be rapidly entered into the Annotate tool without
waiting for each of the analyses to finish.
In the example the Output to E-mail Address option is selected in section 5 and an E-mail
address (e.g., foo@bar.com) is entered into the text area.
8. In addition to the output format, the user can set the “All values on one line” or
“Multiple rows, one value per row” criteria in section 5. The user’s choice here
determines, to a great extent, what can be done with the output data file, and is
integrally related to the user’s intended purpose in using the Annotate tool. This is
because there are two primary uses for the Annotate tool. First, a user may simply
want to know more about each of the gene on the gene list. For example, one might
simply want to peruse the information while further studying one’s gene list, or one
might be required to include this information in a publication or patent filing. Either
way, one would want each row in the file to remain constant and to simply have
additional pieces of information added to the end of that gene row. If this is the user’s
intent, then the “All values on one line” option should be selected. The second use of
the Annotate tool is where one wants to use the output to analyze expression data in
relation to the functional characteristics of the genes in the data set. This can be
performed using the DRAGON View tools or by employing other gene expression
data analysis software packages such as Partek or GeneSpring. In order to accomplish
this, the “Multiple rows, one value per row” option should be selected.
The “Multiple rows, one value per row” option is selected from the drop-down menu.
9. After the entry of each input data set the user presses, “Submit Gene List” and
instantly receives a message stating, Your data is being processed,
and will be mailed to foo@bar.com. As soon as this message is
received, the user can go back and enter in another data set for processing. All of the
output files will be E-mailed to the address supplied by the user as soon as they
become available.
10. Open and inspect results (see Guidelines for Understanding Results).
In the example used for this unit, the results of the annotation of the figure2_
combined_data.txt file would be received in the E-mail inbox of the address
specified. The output data file would be received as an attachment to the E-mail message
with the subject line DRAGONOutput. The message must be opened and allowed to fully
load. It is important to allow the message to fully load into the E-mail client; otherwise
the attached file can sometimes be truncated. Different E-mail clients provide different
methods for downloading and saving E-mail attachments to the computer. For this
demonstration, the attached output file would be clicked on with the right mouse button
on a Windows computer. Right-clicking on the file opens a directory-browsing window
that allows for the choice of where to save the file on the computer’s hard drive. In
subsequent discussion (see Support Protocol), this downloaded file is named figure2_
combined_data_KWS.txt, where KWS stands for “Keywords.”
11. Analyze results with DRAGON Families (see Support Protocol).
ANALYZING DATA WITH THE DRAGON Families TOOL
This protocol and accompanying demonstration focus on the use of the DRAGON
Families tool. Refer to the Learn page of the DRAGON Web site (http://pevsnerlab.
kennedykrieger.org/learn.htm) or the paper describing DRAGON View (Bouton and
Pevsner, 2002) for more information on the other DRAGON View tools (also see
Background Information).
SUPPORT
PROTOCOL
Analyzing
Expression
Patterns
7.4.7
Current Protocols in Bioinformatics
Supplement 2
Necessary Resources
Hardware
Windows, Linux, Unix, or Macintosh computer with an Internet connection
(preferably broadband connection, e.g., T1, T3, cable, or DSL service)
Software
Internet browser: e.g., MS Internet Explorer 5 (or higher) or Netscape 6 (or higher)
on Windows or Macintosh systems; Opera, Netscape 6 (or higher) or Mozilla on
Linux-based systems. Internet Explorer 5 or higher and Netscape 6 or higher are
preferred, because Netscape 4.x is not capable of supporting all of the
functionality provided in the DRAGON Paths tool.
Also required:
Spreadsheet program: e.g., MS Excel on Windows or Macintosh systems or Sun
Microsystems Star Office suite on Linux systems.
Text editor: e.g., TextPad (http://www.textpad.com/) or Notepad on Windows
systems; MEmacs (http://www.xemacs.org) on Linux systems.
Files
An Annotated master matrix file created by running the DRAGON Annotate Tool
(figure2_combined_data_KWS.txt; see Basic Protocol)
Prepare and format the data
1. Generate an annotation file by using DRAGON Annotate Tool (see Basic Protocol).
In the example used in this unit, the annotation file is named figure2_combined_
data_KWS.txt.
2. Before submitting the figure2_combined_data_KWS.txt file to the
DRAGON Families tool the tab-limited file must be converted into a comma-delimited file.
a. The tab-delimited file figure2_combined_data_KWS.txt is opened in
MS Excel.
b. The File menu option is selected, then Save As... is selected.
c. In the “Save as type:” drop-down menu at the bottom of the Save As window, “CSV
(Comma delimited) (*.csv)” is selected.
d. The file figure2_combined_data_KWS.csv is saved to the same directory
as figure2_combined_data_KWS.txt.
This process simply replaces the tab delimiter in the file with comma delimiters.
3. It is then necessary to get rid of any commas that may exist in the file. In MS Excel
the entire data set is selected with by clicking Ctrl+A. Then the Replace function
is opened by clicking Ctrl+H (Replace can also be selected in the Edit menu option).
A comma is typed in the “Find what:” box and nothing is typed in the “Replace with:”
box. The Replace All button is then clicked.
This deletes all commas from the data in the file such as those that might be present in gene
names or SWISS-PROT keywords.
DRAGON and
DRAGON View
4. Finally a new file is created for each time point in the file. In the example here, each
file contains four columns: the GenBank accession numbers, the gene names, the
SWISS-PROT keywords, and the expression values for that time point. For each file,
these four columns are selected in the figure2_combined_data_KWS.csv
file by holding down the Ctrl key and clicking at the top of each column. Then, a new
spreadsheet is opened with Ctrl+N (or using the File menu option) and the four
7.4.8
Supplement 2
Current Protocols in Bioinformatics
Figure 7.4.3 The DRAGON Families page.
columns are pasted into the new spreadsheet with Ctrl+V (or using the Edit menu
option).
5. In each new file, the expression values need to be the first column in the file. This is
accomplished by selecting the expression values column in each file, cutting it by
clicking Ctrl+X (or using the Edit menu option), and pasting it into the first column
by holding down the right mouse button over the first column and selecting Insert
Cut Cells.
6. Each of these new files is saved as a .csv file in the same directory as the
figure2_combined_data_KWS.csv file. Multiple .csv files are saved using
this method and each is named by the time point data it contains (in the example here,
15mins.csv, 1hr.csv, 6hrs.csv, 24hrs.csv).
Run DRAGON Families
7. Start DRAGON Families by opening the main DRAGON page and selecting
DRAGON Families from the links at the top of the page.
To get a sense of what the input data should look like, one or both of the example files
on the page can be viewed. These files should have a similar format to the figure2_
combined_data_KWS.csv if the figure2_combined_data_KWS.csv were
opened in a text editor such as TextPad (see “Software” above) instead of MS Excel.
8. The DRAGON Families site (Fig. 7.4.3) is designed in a manner similar to the
DRAGON Annotate site. As described for the annotate page (see Basic Protocol), the
flow of data entry on the site is guided through sections of the page where a specific
task is accomplished in each section. For this demonstration the upload option is
selected in section 1.
Analyzing
Expression
Patterns
7.4.9
Current Protocols in Bioinformatics
Supplement 2
Figure 7.4.4 As the final step in the analysis of the demonstration data, each time point contained in the Iyer et
al. (1999) data set, after having been associated with SWISS-PROT keyword information by DRAGON Annotate,
is analyzed using the DRAGON Families tool. The most coordinately up-regulated gene families are shown here
for three time points (15 min, 6 hr and 24 hr). Each gene is represented in its corresponding family as a box that
is clickable and hyperlinked to the NCBI LocusLink entry for that gene. Across each row, all the boxes correspond
to genes in a given family. Each box is also color-coded on a scale from red (up-regulated) to green (down-regulated). A scale at the top of the analysis page (not shown) gives the association of colors with ratio values. For all
the functional families that are annotated, the program returns the families ranked in order according to the average
ratio expression value for all of the genes in that group. Note that overall there is less differential regulation occurring
at the 15-min time point since there are no bright red squares present. By 6 hr certain gene families, particularly
those associated with inflammatory responses, are coordinately up-regulated. Finally by 24 hr, cell cycle and mitotic
gene families are coordinately differentially regulated, indicating that the cells are progressing through the cell
cycle. This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure
go to http://currentprotocols.com/colorfigures.
9. In section 2a (Fig. 7.4.3), the Browse button is used to find and select the first time
point .csv file, 15mins.csv. The name 15mins is typed into the text area below
the Browse button.
10. In section 3 (see Fig. 7.4.3), various parameters can be selected describing the data
set. The Red/Green Output, Human, and SWISS-PROT Keywords are selected for
this file.
11. In section 4, the columns containing the required data are defined as 1 for the
expression data, 3 for the GenBank accession numbers, and 4 for the Type values,
which in this case are SWISS-PROT keywords.
12. The text areas in Section 5 are left blank, thereby defining them as default variables.
If another delimiter, such as a pipe (|) is used, this can be entered in the delimiters
text area in order to replace the default comma (,) delimiter. Similarly, if another
type of newline character is used, this can be entered in the newline text area. The
Analyze Data button is clicked. A new page appears showing the results of the
DRAGON Families analysis.
DRAGON and
DRAGON View
Save results
13. As the first step in saving the page, especially if a DRAGON Families output page
is going to be used often, it is a good idea to save the page as an HTML file to the
7.4.10
Supplement 2
Current Protocols in Bioinformatics
local computer’s hard drive. This is accomplished by selecting the File menu option
in the Internet browser and choosing Save As.
Some browsers will warn that the page may not be displayed properly after being saved.
Click OK to this warning. The DRAGON Families HTML output pages have been designed
so that even if viewed locally on the user’s computer they will still be viewed correctly if
the computer is connected to the Internet at the time.
14. As a second step in saving the page, with the browser window selected (select a
window by simply clicking on any portion of it), click the Alt+Print Screen
keys to capture a copy of the image. This image is then pasted by clicking Ctrl+V
into MS Powerpoint for presentation and publication purposes.
Figure 7.4.4 displays such an image.
15. Once the output HTML page has been saved and any images of the analysis desired
have been captured and stored in another software system (i.e., MS Powerpoint) click
the Back button in the Internet browser to go back to the data input page; all of the
user’s settings remain selected. A second file can be input by clicking the Browse
button again and selecting the next desired file. This process of input, analysis and
output storage can be repeated as many times as is desired.
GUIDELINES FOR UNDERSTANDING RESULTS
DRAGON Annotate Results
The general structure of an annotated text output file is simple. All of the original data
provided in the user’s input data set are preserved and additional information derived from
the annotation process is added to the far right-hand side of the data set as additional
columns in the file. The delimiter that the user has defined for the data set is used to delimit
the newly added information.
The only major difference that may occur in the output file is dictated by whether the user
selected the “All values on one line” option or the “Multiple rows, one value per row”
option (Fig. 7.4.2, panel B). If the former option is selected, then any information provided
will be added on the same row as all of the input gene’s information. However, if the latter
option is selected, then the user may note a significant difference in the structure of this
output file. Specifically, if any of the types of information with which one has chosen to
annotate contains more than one value for a given gene, then the information for that gene
in the input data set will be duplicated on as many rows as there are values for that gene.
Each value will be added to the end of one row. For example, multiple SWISS-PROT
Keywords and/or Pfam families can be associated with a gene and its encoded protein. If
the “Multiple rows, one value per row” option is chosen when annotating with either of
these types of information, then in the output file the gene’s input data will be duplicated
on multiple rows with a new SWISS-PROT keyword or Pfam family number at the end
of each row.
This type of output allows the user to view data in reference to the biological characteristics of the genes and the encoded proteins in the data set. For example, once
downloaded to the user’s computer the output file can be opened in a spreadsheet program
such as MS Excel and the column containing the newly annotated biologically relevant
information can be used to sort the entire data set. The result is that the expression data
present in the input data set are now categorized according to biological function instead
of individual genes. As a result the expression patterns of sets of genes related by
functional properties such as the SWISS-PROT keywords “Nuclear protein,” “Calciumbinding,” or “Proteoglycan” can be rapidly identified and analyzed for coordinate
regulation or other interesting properties.
Analyzing
Expression
Patterns
7.4.11
Current Protocols in Bioinformatics
Supplement 2
This type of analysis is difficult to perform without visualization tools that make use of
the annotated information in order to define categories of genes related by certain
properties (e.g., shared keywords, functional protein domain classifications, and chromosomal localization). Some expression data software packages allow for this type of
categorical analysis. For example, the Partek Pro software package (http://www.
partek.com) allows for the color-coding of genes and experiments according to categorical
information that is incorporated into clustering and principal component analysis (PCA)
views of the data. However, in an effort to make these types of tools more readily
accessible to the users of DRAGON, the set of DRAGON View tools has been developed.
The design and implementation of the DRAGON View tools is ongoing and upgrades will
be documented in updates to this chapter as well as on the DRAGON and DRAGON View
Web sites.
DRAGON Families Results
Each of the time points in the Iyer et al. (1999) data set were analyzed in DRAGON
Families as described above. The most up-regulated gene families identified by DRAGON
Families for three of these time points are shown in Figure 7.4.4. The lack of any bright
red squares in the 15-min data corresponds with the early phase in the experiment.
Interesting though, even at this early stage, three families can be noted that are more
dramatically up-regulated later in the experiment. These are the “Inflammatory Response”, “Chemotaxis,” and “Cytokine” families. As would be expected, the predominant
gene families being coordinately regulated at the 6-hr time point had to do with inflammatory response. Primarily, these families were “Inflammatory Response,” “Chemotaxis,” and “Cytokine.” This result agrees well with the findings of Iyer et al. (1999).
Additional families identified at the 6-hr time point also agree with what was reported in
the original paper (see Figures 4 and 5 in Iyer et al., 1999). These families include
angiogenesis (“Angiogenesis”) and blood coagulation (“Blood Coagulation”) families.
Finally, at the 24-hr time point it is apparent that cell cycle and proliferation mechanisms
are at work. Genes in these families include members of the “Cyclins,” “Mitosis,” “DNA
Repair,” and “Cell Division” families. This result also agrees well with what was reported
in the original paper.
The key point associated with these findings is that similar results were obtained in the
original Iyer et al. (1999) study even though an alternative, but complimentary, approach
was used to derive these results. Instead of starting with hierarchical clustering methods
and then manually searching the clustered genes for similar functionality, all genes in the
data set were first annotated with functional attributes and were then analyzed for
coordinate function. If this type of analysis were run early in the original fibroblast study,
the investigators would have very rapidly gotten a sense of the types of biological
processes that were occurring in their data. This early knowledge could have informed
their further, in-depth analysis of the clustering of expression profiles. These two methods
are not mutually exclusive; instead they are complimentary, and use of both acts to provide
a more rapid, comprehensive understanding of the biological patterns in a large-scale
expression data set.
COMMENTARY
Background Information
DRAGON and
DRAGON View
Why use DRAGON and the DRAGON
Annotate tool?
Researchers conducting large-scale gene expression research using technologies such as
Serial Analysis of Gene Expression (SAGE;
Velculescu et al., 1995) and microarrays (Bowtell, 1999; Cheung et al., 1999; Duggan et al.,
1999; Lipshutz et al., 1999) often find themselves wanting to rapidly and simultaneously
acquire information relating to the accession
7.4.12
Supplement 2
Current Protocols in Bioinformatics
numbers, biological characteristics, and other
attributes of large numbers of gene sequences.
Examples of these situations might be:
1. Before starting a microarray experiment,
an investigator would like to compare different
microarray technologies in order to assess
which platform best represents a functional
class of genes in which they are interested.
2. As part of the analysis of a large-scale
gene expression experiment, an investigator
hypothesizes that genes in a particular chromosomal region or of a given functional class
should be differentially regulated.
3. An investigator may want to acquire the
most up-to-date information in the public databases regarding the genes on their microarray
platform.
In these instances, the ability to “click”
through numerous Web sites and copy and paste
information for single genes into a spreadsheet
is not helpful when one has to do the same thing
for thousands of genes. This can take tens to
hundreds of hours, and is unnecessary due to
the availability of computational methods such
as those provided by the DRAGON Annotate
tool (see Basic Protocol).
The DRAGON Annotate tool associates biologically relevant information derived from numerous public databases with gene expression
information from microarray experiments. The
subsequent analysis process includes the association of relevant information with microarray
data and the visualization of microarray data in
the context of associated biological characteristics. To illustrate the use of DRAGON, the
authors of this unit have applied it to a microarray data set available via the Web. During the
analysis of this data set visual analysis methods
were used to examine the correlation between
gene expression patterns and biological characteristics such as membership in protein families and description by keywords. Results in the
demonstration data set using the DRAGON and
DRAGON View approaches closely matched
those reported in the original study that generated the demonstration data, and suggest that
these methods are complementary to the exploratory statistical approaches typically employed when examining large-scale gene expression data.
By integrating biologically relevant information with the analysis of large-scale expression data, certain types of gene expression phenomena can be discerned more easily and examined in light of the experimental paradigm
being tested. A comprehensive definition of
biological data regarding each gene on a mi-
croarray list through the interconnection of as
many public databases as possible is an eventual goal in the development of DRAGON and
DRAGON View. DRAGON would then be able
to supply a multidimensional network of information related to the expression patterns and
biological characteristics of all genes on a microarray. This growth of DRAGON is dependent upon the continued integration of public
databases (Frishman et al., 1998). Along with
the question of database integration comes the
crucial matter of data integrity within and
across databases (Macauley et al., 1998).
Utility of DRAGON families and other
DRAGON tools
One of the first questions that might be asked
when considering the relationship between
functional relatedness and expression patterns
is whether genes that are functionally related
are also coordinately regulated. Often this type
of question is addressed using descriptive or
exploratory statistical tools. A variety of methods can be applied to the entire set of gene
expression data to describe expression patterns
or signatures within the data including Kmeans and hierarchical clustering algorithms
(Michaels et al., 1998; Wen et al., 1998), principal component analysis, genetic network
analysis (Liang et al., 1998; Somogyi et al.,
1997; Szallasi, 1999; UNIT 7.3) and self-organizing maps (Tamayo et al., 1999; Toronen et al.,
1999; UNIT 7.3). These methods identify similarity in the expression patterns of groups or clusters of genes across time or sample. By grouping genes according to their expression patterns, investigators can then attempt to draw
inferences about the functional similarity of
coordinately regulated genes. Previous studies
have found that characteristics such as promoter elements, transcription factors, chromosomal loci, or cellular functions of encoded
proteins have been associated with the coordinate expression of genes (Chu et al., 1998;
Eisen et al., 1998; Gawantka et al., 1998; Heyer
et al., 1999; Zhang, 1999; Spellman and Rubin,
2002). Others have used this assumption in
order to test the effectiveness of various clustering methods (Gibbons and Roth, 2002).
Starting with exploratory statistical methods and attempting to identify shared function
is a powerful method with many benefits. However, using large-scale annotation systems such
as DRAGON Annotate, expression data can
now be explored from the opposite direction.
Instead of starting with data and inferring function, the investigator can start with known
Analyzing
Expression
Patterns
7.4.13
Current Protocols in Bioinformatics
Supplement 2
DRAGON and
DRAGON View
shared function and test for coordinate expression patterns. Neither method obviates the need
for the other; instead these two methods can
provide complementary analyses of a data set
which, when paired, provide deeper insight into
the significant biological findings of the experimental system being examined.
By starting with expression data and inferring similar biological characteristics through
clustering, annotation with DRAGON and subsequent analysis with the DRAGON View tools
allow the investigator to start with biological
characteristics in order to identify which of
those characteristics are associated with coordinate gene expression. This approach to analyzing expression data is not typically used, because the task of understanding the biological
characteristics of the thousands of genes typically presented in a microarray data set is usually
left to the investigator’s knowledge of the system in question, literature searches, and the
tedious process of researching individual genes
in public databases via the World Wide Web. As
discussed, the DRAGON Annotate tool (see
Basic Protocol) solves this problem, thereby
making functional class–based gene expression
analysis possible. Currently the primary tool
with which to perform this type of analysis is
DRAGON Families (see Support Protocol), and
is found on the DRAGON View Web site.
The DRAGON Families tool (see Support
Protocol) sorts several hundred functional
groups of genes to reveal families that have been
coordinately up-regulated or down-regulated.
DRAGON Families represents each gene in its
corresponding family as a box that is clickable
and hyperlinked to the NCBI LocusLink entry
for that gene. Across each row, all the boxes
correspond to genes in a given family. Each box
is also color coded on a scale from red (up-regulated) to green (down-regulated). Furthermore,
for all the functional families that are annotated,
the program returns the families ranked in order
according to the average ratio expression value
for all of the genes in that group (Fig. 7.4.5, panel
A, values in parentheses).
Two other tools are currently available for
use in DRAGON View. The DRAGON Order
tool is similar to DRAGON Families in that it
visualizes the expression data from a user’s
gene expression data as sorted into functional
groups (Fig. 7.4.5, panel B). However,
DRAGON Order automatically presorts data
based on ratio expression values. For each functional group the tool generates a series of bars
(vertical lines), each of which represents a protein in that functional group. The position of
the vertical bar indicates the extent to which
that gene is up- or down-regulated. An equal
distribution of vertical lines across the whole
row means that there is no significant coexpression of a set of genes in that group. However,
clusters of lines at either the far left or the far
right of any given row are potentially interesting because they indicate that a set of related
genes are all up- or down-regulated. This kind
of information would be difficult to detect by
manual inspection of microarray data sets.
The DRAGON Paths tool relies on cellular
pathway diagrams downloaded by file transfer
protocol from the KEGG database (Kanehisa
and Goto, 2000; Kanehisa et al., 2002).
DRAGON Paths maps gene expression values
onto cellular pathway diagrams (Fig. 7.4.4,
panel C). By viewing the expression levels
derived from microarray data within the context
of cellular pathways, the user is able to detect
patterns of expression as they relate to networks
of genes associated by cellular pathways.
DRAGON and DRAGON view architecture
The general structure of DRAGON and
DRAGON View is that of a relational database
(UNIT 9.1) accessed via the Internet through Common Gateway Interface (CGI) scripts written in
the Perl programming language (http://www.
perl.com). The CGI scripts handle user requests
and take care of the updating and management
of the data contained in the database (Fig.
7.4.6). The database acts as a repository of the
information collected from the public databases and provides rapid, flexible access to this
information. As opposed to using BLAST
(http://www.ncbi.nlm.nih.gov/BLAST/; UNITS 3.3
& 3.4), Blat (http://genome.ucsc.edu/cgi-bin/hg
Blat?command=start), or other sequence similarity searching methods, in order to associate
the user’s input gene list with other database
information, all annotation occurs using GenBank accession numbers provided in the user’s
gene lists. These accession numbers are joined
with other accession numbers via association
tables provided by the public databases (Fig.
7.4.7). For example, the Pfam database
(http://www.sanger.ac.uk/Software/Pfam/; UNIT
2.5) provides a table containing a list of every
SWISS-PROT number that is contained in a
Pfam family. The SWISS-PROT database
(http://www.expasy.ch/sprot/) provides a table
of GenBank accession numbers that are associated with a given SWISS-PROT accession
number. Thus, the correct combination of these
tables allows for the association of the user’s
GenBank accession numbers with SWISS-
7.4.14
Supplement 2
Current Protocols in Bioinformatics
Figure 7.4.5 Examples of the graphical outputs of the three types of DRAGON View tools. (A) DRAGON Families
produces rows of green (down-regulated), red (up-regulated), and gray (unchanged) boxes (see scale for the range
of ratio values represented by each color). Each box represents one gene and is hyperlinked to its corresponding
UniGene entry. Each row has a type identifier to its right that is hyperlinked to its description. To the far right is the
average ratio expression value for all of the genes in that family. All rows are sorted from the most up-regulated family
to the most down-regulated family. (B) DRAGON Order produces rows of black lines. Each line represents one gene
and its location in the row represents its position on a gene list sorted by ratio expression values. Lines at the far left
of represent the most up-regulated genes (+) and lines at the far right represent the most down-regulated (–). Each
row’s type (e.g., SWISS-PROT keywords) is listed to the right. (C) DRAGON Paths maps the location and ratio
expression value of genes from the submitted gene list on to KEGG cellular pathway diagrams. A green (down-regulated), red (up-regulated) or gray (unchanged) circle followed by the ratio expression value is mapped to the upper
left corner of each corresponding protein box. Each protein box is hyperlinked to its corresponding LocusLink entry.
This black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to
http://currentprotocols.com/colorfigures.
Analyzing
Expression
Patterns
7.4.15
Current Protocols in Bioinformatics
Supplement 2
Web
accessible
databases
Unigene
Automated
perl
scripts
DRAGON
database
CGI
scripts
Web site
parseunigene.pl
-Hs.data.Z
-Mm.data.Z
-Rn.data.Z
Swissprot
parseswissprot.pl
querytable.cgi
-sprot.dat.Z
Pfam
-Pfam-A.full.gz
parsepfam.pl
query.cgi
MySQL
database
user
interaction via
annotate and
search pages
The server is a Dell
PowerEdge 6300
running:
-Red Hat Linux 6.2
-Apache Web Server
-Perl 5.6
KEGG
-hsa_gene
_map.tab
parsekegg.pl
Figure 7.4.6 Database architecture for DRAGON. The data contained in the DRAGON database is derived
from Web-accessible databases that are downloaded by FTP, parsed using Perl scripts, and stored in tables in
the MySQL relational database management system. The DRAGON database is housed on a Dell PowerEdge
6300 dual processor server. The front end consists of a Web site that is searched using Perl (.cgi) scripts to
allow for user-defined queries of the database.
DRAGON and
DRAGON View
PROT accession numbers and any associated
Pfam family numbers and names. While being
dependent on the completeness and accuracy
of the association tables provided by the public
databases, this method provides for faster, more
efficient annotation. Furthermore, by annotating a gene list with various types of accession
numbers, a spectrum of information concerning the biological characteristics of the genes,
their associated proteins, and the protein’s par-
ticipation in cellular pathways can be gained
(Fig. 7.4.8).
Querying DRAGON
GenBank accession numbers are used as the
sole type of input accession number for the
DRAGON Annotate tool for several reasons.
First, GenBank accession numbers are the most
common type of accession number provided
with microarray gene lists. Second, GenBank
7.4.16
Supplement 2
Current Protocols in Bioinformatics
Figure 7.4.7 Overview of the information in DRAGON. This diagram represents a subset of the tables now
available in DRAGON and the possible connections between them. Depending upon what type of information is
desired different sets of tables are joined with the table containing microarray gene expression data that is as
example, “Incyte Array Data” and “Incyte Numbers” in this diagram. Two “UniGene Human Numbers” tables are
used to expand the “GenBank #s” from the “Incyte Numbers” table into all “GenBank #s” associated with each
“UniGene ID” thereby providing a bridge between “GenBank #s” from the “Incyte Numbers” table and the “Swissprot
Numbers”, “TrEMBL Numbers”, “Transfac Factors” and “Transfac Sites” tables. Further characterization of the
proteins that genes from the microarray encode occurs by joining with tables derived from the SWISS-PROT, Pfam,
Interpro and OMIM databases.
accession numbers are not retired or drastically
changed over time like some other types of
accession numbers. Finally, although GenBank
numbers represent sequence fragments, these
fragments are collected with other fragments
into clusters by the NCBI’s UniGene database
(http://www.ncbi.nlm.nih.gov/UniGene/) or the
TIGR Gene Index (http://www.tigr.org/tdb/
tgi/hgi/) in order to identify their representation
of genes. All of these reasons make GenBank
accession numbers among the best types of
accession numbers to use for input into the
DRAGON Annotate tool. Work is currently
ongoing that would allow for the use of other
types of input accession numbers (i.e., TIGR
accession numbers or LocusLink accession
numbers).
Master matrix file
A number of the more widely used expression analysis software packages such as
GeneSpring (http://www.silicongenetics.com/
cgi/SiG.cgi/index.smf), Partek Pro (http://www.
partek.com), and Cluster/Treeview (http://
rana.lbl.gov/EisenSoftware.htm; UNIT 6.2) use
master matrix files as one of their primary formats for data importing. Because of their ease
of use, simple integration with existing expression data analysis software packages, and human-readable nature, master matrix text files are
relied on, exclusively, for data importing and
analysis by DRAGON and DRAGON View.
An overview of the structure of a master
matrix file will aid in understanding how to use
DRAGON and DRAGON View more effectively. Specifically, there are a few important
attributes of a master matrix file that make it
useful during expression data analysis. First,
this type of file can be thought of as a “master”
file because it usually contains all of the data
available for a given experiment. For example,
in the fibroblast data set used for demonstration
Analyzing
Expression
Patterns
7.4.17
Current Protocols in Bioinformatics
Supplement 2
microarray
Transfac factor no.
Transfac site no.
protein
SWISS PROT no.
keywords
sequence
protein name
function
Pfam no. 1
Pfam family name
Interpro no. 1
gene
GenBank no.
Unigene ID
chromosomal localization
Unigene Name
disease-related? (OMIM no.)
Pfam no. 2
Pfam family Name
Interpro no. 2
Involvement in cellular pathways
Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway no.
Figure 7.4.8 DRAGON uses accession numbers to define biological characteristics of genes and proteins. A
microarray is a regular array of thousands of unique cDNAs or oligonucleotides spotted on a solid support. Each
spot contains cDNA corresponding to a specific gene that encodes a protein. Accession numbers derived from
publicly available databases provide information about the biological characteristics of both the gene and its
corresponding protein. At the gene level, “Transfac Site” and “Transfac Factor” numbers indicate the presence
of promoter regions on the gene and factors that bind to those promoter regions respectively. The “GenBank
no.” and “UniGene ID” refer to EST sequences corresponding to fragments of the gene and a cluster of those
EST sequences respectively. The “UniGene Cytoband” indicates the chromosomal location of the gene. The
“UniGene Name” is the name of the gene. The “OMIM no.” indicates whether the gene is known to be involved
in any human diseases. At the protein level, “Pfam no.” and “Interpro no.” indicate which functional domains the
protein contains. The “SWISS-PROT no.” is a unique identifier for the protein and can be derived from either the
SWISS-PROT or TrEMBL databases. “SWISS-PROT Keywords” are derived from a controlled vocabulary of 827
words that are assigned to proteins in the SWISS-PROT database according to their function(s). “SWISS-PROT
Sequence” is the amino acid sequence for the protein. “SWISS-PROT Name” is the SWISS-PROT database
name for the protein.
7.4.18
Supplement 2
Current Protocols in Bioinformatics
purposes in this unit, the fig2clusterdata.txt file can be considered a master
matrix file because it contains data from all of
the time points monitored in the Iyer et al.
(1999) experiment.
Secondly, these are “matrix” files because
they contain data from experimental conditions
(i.e., time points, disease versus control, treated
versus untreated) as columns in the file, and
from gene sequences represented on the microarray as rows in the file.
As a side note, each row in a master matrix
file should contain a unique id that can be used
to identify all of the data in that row as belonging to a given sequence. This unique id is
particularly important when multiple elements
on a given microarray represent different sequence fragments derived from the same gene.
For example, in the fig2clusterdata.
txt file, the unique clone id’s (“UNIQID”) are
used to identify each gene sequence. Alternatively, in Affymetrix GeneChip data sets, there
are often numerous element sets on a GeneChip
that represent different portions of the same
gene (see http://www.affymetrix.com for an indepth discussion of the structure of Affymetrix
GeneChips). Primarily, this redundancy is designed into the chip in order to provide internal
controls for the expression data measured by
the chip. One would most often expect to see
separate element sets representing the same
gene displaying similar expression levels and
profiles (unless of course something like the
differential regulation of alternative splice
forms is occurring). In order to identify the
unique sequences representing a given gene,
Affymetrix provides a set of unique identifiers
that consist of a GenBank or Ref_Seq accession
number followed by a series of underscores and
letters such as AA123345_i_at. This
method of unique sequence identification
achieves two important goals: first, the GenBank accession number is provided, allowing
for a link to the public databases; second, the
underscored tag at the end maintains the
uniqueness of each sequence representing a
given gene.
Finally, master matrix files are “text flat
files” because they are normally stored as
either comma-delimited files in which commas are used to indicate column boundaries
or tab-delimited files in which tabs are used
to indicate column boundaries. Such text files
that do not have any relational structure are
referred to as “flat.” In other words, they are
not stored as multiple tables in a relational
database such as MySQL (http://www.mysql
.com; UNIT 9.2), Oracle (http://www.oracle.com),
or MS Access (http://www.microsoft.com/office/
access/). Instead, all of the data is contained in
one simple text file.
Further research and development for
DRAGON and DRAGON view
Development of DRAGON and DRAGON
View is an ongoing process at many levels.
There are numerous “bug fixes” that are constantly being addressed. In addition, novel tools
and methods are being developed that will allow for additional methods of annotation and
analysis. For example, the integration of additional databases including the Gene Ontology
database (GO; http://www.geneontology.org;
UNIT 7.2), Interpro (http://www.ebi.ac.uk/inter
pro/), the International Protein Index (IPI;
http://www.ensembl.org/IPI/), the Ensembl
genome databases (http://www.ensembl.org),
the TRANSFAC database (http://www.cbi.pku.
edu.cn/TRANSFAC/), the Database of Interacting Proteins (http://dip.doe-mbi.ucla.edu/),
along with other databases, is is planned for the
DRAGON Annotate tool. In addition, a system
allowing for the use of other types of input
accession numbers (e.g., those of TIGR, LocusLink, and SWISS-PROT) is being developed. Numerous additions and upgrades are
also planned for the DRAGON View tools. For
example, a batch import mode is required for
most of the tools so that users do not have to
break their master matrix files up into numerous
data input files, as had to be done for this
demonstration. Additionally, as a complement
to the visual methods already in use, quantitative statistical methods are being developed to
allow for the identification of the statistical
significance of the coordinate differential regulation of annotated gene families.
As discussed further below (see Critical Parameters and Troubleshooting), proteomic
technologies—such as antibody microarrays
(Lal et al., 2002), high-throughput mass spectroscopy methods, fluorescent differential 2-D
PAGE methods, and transfected cell microarrays (Bailey et al., 2002)—which allow for the
measurement of large-scale protein expression
patterns will eventually supplant the need for
the measurement of gene expression patterns.
The types of methods and analyses that can be
carried out using DRAGON and DRAGON
View are just as applicable to large-scale protein expression analysis as they are to largescale gene expression analysis. In fact, many of
Analyzing
Expression
Patterns
7.4.19
Current Protocols in Bioinformatics
Supplement 2
the types of information annotated by
DRAGON (e.g., Pfam functional domains, cellular functions, and cellular pathway participation) are more directly related to proteins than
they are to their encoding genes. The primary
difference, were the DRAGON and DRAGON
View tools to be used with proteomics data,
would simply be the type of accession numbers
that would be used in the input data set. Given
this, future developments of DRAGON will
allow for the use of protein accession numbers
such as SWISS-PROT accession numbers
(http://www.expasy.ch/sprot/) an d International Protein Index accession numbers (http://
www.ebi.ac.uk/IPI/IPIhelp.html).
The DRAGON and DRAGON View tools
were originally developed to help answer a
simple question for which there was no good
method for investigation. In keeping with the
intent of their original development, it is the
hope of the authors that the methods provided
by the DRAGON and DRAGON View tools
will continue to facilitate novel types of research and analysis concerning large-scale
gene and eventually protein expression data.
Critical Parameters and
Troubleshooting
DRAGON and
DRAGON View
Often, errors with DR AGON and
DRAGON View tools are due to formatting
issues in the input data text file. The critical
points to remember with formatting are: (1)
what type of delimiter is being used (i.e.,
comma or tab) and (2) where is the critical
information in the file (i.e., what columns
contain GenBank accession numbers and other
data of interest). An important point about
delimiters is that the input data file will be read
incorrectly if the character being used as the
delimiter is found anywhere else in the input
file besides the separation between columns.
For example, in the demonstration described
in the Support Protocol, all of the commas in
the figure2_combined_data.csv file
were replaced with nothing. This was done
because commas in gene names and type information (e.g., Pfam family names, SWISSPROT keywords) will be read as delimiters
when importing a .csv file into DRAGON or
DRAGON View tools.
In order to check the validity of the information provided by the DRAGON Annotate
tool, users can perform one of several quality
control measures. First, random genes from the
annotated gene set can be selected and searched
for on the Web sites that were used in the
annotation process. For example, if a gene was
annotated with UniGene and SWISS-PROT
information, then the user can search for the
gene on those two Web sites in order to confirm
the information derived from DRAGON Annotate. Alternatively, there are other large-scale
annotation tools available that can be used instead of, or in addition to, DRAGON Annotate.
These include TIGR Manatee (http://manatee.
sourceforge.net/), Resourcerer (http://pga.tigr.
org/tigr-scripts/magic/r1.pl), and Affymetrix
NetAffx (http://www.affymetrix.com/analysis/
index.affx). These tools have strengths and limitations relative to the DRAGON Annotate tool.
For example, the Affymetrix NetAffx Web site
provides annotation for all of the Affymetrix
GeneChips. However, the limitation with the
site is that the user is only allowed to annotate
500 genes at a time and can only annotate genes
that are on one of the Affymetrix GeneChips.
Depending on their ease-of-use, these tools can
be used in addition to DRAGON Annotate in
order to compare the output of the systems. If
errors are discovered in the information annotated by DRAGON relative to the public database Web sites of other annotation tools, they
may be due to a need for updating of the
DRAGON data or to the fact that DRAGON is
under construction. In either case, feedback
concerning this type of matter is greatly appreciated and should be sent via E-mail to the
questions address provided on the DRAGON
Web site.
One major assumption that is made in the
use of DRAGON and DRAGON View, and,
indeed, in the analysis of most large-scale gene
expression data, is that the information associated with the expression data relates to the gene
whose expression patterns are being directly
measured. This of course is not the case for
many types of information. For example,
SWISS-PROT information and Pfam information are associated with the encoded protein,
not the gene being measured. Perhaps in the
majority of cases it is safe to assume that gene
expression levels are indicative of changes in
encoded protein expression levels. However,
eventual widespread use of proteomics technologies allowing for the large-scale measurement of protein expression levels will make this
assumption unnecessary. When this becomes
the case, it will be possible to apply the same
annotation and analysis methods provided by
DRAGON and DRAGON View to large-scale
protein expression data just as easily as largescale gene expression data.
7.4.20
Supplement 2
Current Protocols in Bioinformatics
Suggestions for Further Analysis
There is no fundamental reason why the
DRAGON View tools need to be the only tools
used for the analysis of expression data in
relation to annotated functional classes derived from the DRAGON Annotate tool. Once
the user obtains the annotated output data file,
any one of a number of analyses can be performed. This flexibility in analysis options is
a critical design feature of the DRAGON and
DRAGON View systems. For example, instead
of using DRAGON Families to search for the
coordinate regulation of functionally related
gene groups, a K-means clustering strategy
can be used to perform the same type of analysis (e.g., UNIT 7.3). An annotated expression data
set can be clustered using a K-means clustering
algorithm by gene expression values over time.
Following clustering, the user can search for
genes that are both clustered into the same
group and are associated with the same type of
annotated functional information. This is just
one example of the type of analysis that can be
performed once the user has access to the large
amounts of functionally relevant information
concerning all of the members of a gene expression data set that DRAGON Annotate provides.
Literature Cited
Bailey, S.N., Wu, R.Z., and Sabatini, D.M. 2002.
Applications of transfected cell microarrays in
high-throughput drug discovery. Drug Discov.
Today 7:S113-S118.
Bouton, C.M. and Pevsner, J. 2001. DRAGON:
Database Referencing of Array Genes Online.
Bioinformatics 16:1038-1039.
Bouton, C.M. and Pevsner, J. 2002. DRAGON
View: Information visualization for annotated
microarray data. Bioinformatics 18:323-324.
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of
genome-wide expression patterns. Proc. Natl.
Acad. Sci. U.S.A.. 95:14863-14868.
Frishman, D., Heumann, K., Lesk, A., and Mewes,
H-W. 1998. Comprehensive, comprehensible,
distributed and intelligent databases: Current
status. Bioinformatics 14:551-561.
Gawantka, V., Pollet, N., Delius, H., Vingron, M.,
Pfister, R., Nitsch, R., Blumenstock, C., and
Niehrs, C. 1998. Gene expression screening in
Xenopus identifies molecular pathways, predicts
gene function and provides a global view of
embryonic gene expression. Mech. Dev. 77:95141.
Gibbons, F.D. and Roth, F.P. 2002. Judging the
quality of gene expression-based clustering
methods using gene annotation. Genome Res.
12:1574-81.
Heyer, L.J., Kruglyak, S., and Yooseph, S. 1999.
Exploring expression data: Identification and
analysis of coexpressed genes. Genome Res.
9:1106-1115.
Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G.,
Moore, T., Lee, J.C., Trent, J.M., Staudt, L.M.,
Hudson, J. Jr., Boguski, M.S., Lashkari, D., Shalon, D., Botstein, D., and Brown, P.O. 1999. The
transcriptional program in the response of human fibroblasts to serum. Science 283:83-87.
Kanehisa, M. and Goto S. 2000. KEGG: Kyoto
encyclopedia of genes and genomes. Nucleic
Acids Res. 28:27-30.
Kanehisa, M. et al. 2002. The KEGG databases at
GenomeNet. Nucleic Acids Res. 30:42-46.
Lal, S.P., Christopherson, R.I., and dos Remedios,
C.G. 2002. Antibody arrays: An embryonic but
rapidly growing technology. Drug Discov. Today
7:S143-S149.
Liang, S., Fuhrman, S., and Somogyi, R. 1998.
Reveal, a general reverse engineering algorithm
for inference of genetic network architectures.
Pac. Symp. Biocomput. 3:18-29.
Bowtell, D.D.L. 1999. Options available-from start
to finish-for obtaining expression data by microarray. Nat. Genet. Suppl. 21:25-32.
Lipshutz, R.J., Fodor, S.P.A., Gingeras, T.R., and
Lockhart, D.J. 1999. High density synthetic oligonucleotide arrays. Nat. Genet. Suppl. 21:2024.
Cheung, V.G., Morley, M., Aguilar, F., Massimi, A.,
Kucherlapati, R., and Childs, G. 1999. Making
and reading microarrays. Nat. Genet. Suppl.
21:15-19.
Macauley, J., Wang, H., and Goodman, N. 1998. A
model system for studying the integration of
molecular biology databases. Bioinformatics
14:575-582.
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P.O., and Herskowitz, I. 1998.
The transcriptional program of sporulation in
budding yeast. Science 282:699-705.
Michaels, G.S., Carr, D.B., Askenaki, M., Fuhrman,
S., Wen, X., and Somogyi, R. 1998. Cluster
analysis and data visualization of large-scale
gene expression data. Pacific Symp. Biocomp.
3:42-53.
Colantuoni, C., Henry, G., Zeger, S., and Pevsner, J.
2002. SNOMAD (Standardization and NOrmalization of MicroArray Data): Web-accessible
gene expression data analysis. Bioinformatics
18:1540-1541.
Duggan, D.J., Bittner, M., Chen, Y., Meltzer, P., and
Trent, J.M. 1999. Expression profiling using
cDNA microarrays. Nat. Genet. Suppl. 21:10-14.
Somogyi, R., Fuhrman, S., Askenazi, M., and Wuensche, A. 1997. The gene expression matrix: Towards the extraction of genetic network architectures. Proc. Second World Cong. Nonlinear Analysts 1996. 30:1815-1824.
Spellman, P.T. and Rubin, G.M. 2002. Evidence for
large domains of similarly expressed genes in the
Drosophila genome. J. Biol. 1:5.1-5.8.
Analyzing
Expression
Patterns
7.4.21
Current Protocols in Bioinformatics
Supplement 2
Szallasi, Z. 1999. Genetic network analysis in light
of massively parallel biological data acquisition.
Pac. Symp. Biocomp. 4:5-16.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., and
Golub, T.R. 1999. Interpreting patterns of gene
expression with self-organizing maps: Methods
and applications to hematopoetic differentiation.
Proc. Natl. Acad. Sci. U.S.A. 96:2907-2912.
Toronen, P., Kolehmainen, M., Wong, G., and Castren,
E. 1999. Analysis of gene expression data using
self-organizing maps. FEBS Lett. 451:142-146.
Velculescu, V.E., Zhang, L., Vogelstein, B., and Kinzler, K.W. 1995. Serial analysis of gene expression. Science 270:484-7.
Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B.,
Smith, S., Barker, J.L., and Somogyi, R. 1998.
Large-scale temporal gene expression mapping
of central nervous system development. Proc.
Natl. Acad. Sci. U.S.A. 95:334-339.
Zhang, M.Q. 1999. Large-scale gene expression
data analysis: A new challenge to computational
biologists. Genome Res. 9:681-688.
Key References
Bouton and Pevsner, 2001. See above.
Bouton and Pevsner, 2002. See above.
Original publication concerning the DRAGON View
visualization tools.
Bouton, C.M., Hossain, M.A., Frelin, L.P., Laterra,
J., and Pevsner, J. 2001. Microarray analysis of
differential gene expression in lead-exposed astrocytes. Toxicol. Appl. Pharmacol. 176:34-53.
Research publication that reports use of DRAGON
and DRAGON View in the context of a toxicogenomic microarray study.
Iyer et al.. 1999. See above.
Reports the microarray study from which the example data sets for this unit were derived.
Contributed by Christopher M.L.S. Bouton
LION Bioscience Research
Cambridge, Massachusetts
Jonathan Pevsner
Kennedy Krieger Institute and
Johns Hopkins University School of
Medicine
Baltimore, Maryland
Original publication concerning the DRAGON database.
DRAGON and
DRAGON View
7.4.22
Supplement 2
Current Protocols in Bioinformatics
Integrating Whole-Genome Expression
Results into Metabolic Networks with
Pathway Processor
UNIT 7.6
Genes never act alone in a biological system, but participate in a cascade of networks. As
a result, analyzing microarray data from a pathway perspective leads to a new level of
understanding the system. The authors’ group has recently developed Pathway Processor
(http://cgr.harvard.edu/cavalieri/pp.html), an automatic statistical method to determine
which pathways are most affected by transcriptional changes and to map expression data from
multiple whole-genome expression experiments on metabolic pathways (Grosu et al., 2002).
The Pathway Processor package (Fig. 7.6.1) consists of three programs, Data File
Checker, Pathway Analyzer (see Basic Protocol), and Expression Mapper (see Support
Protocol). The final protocol in the unit presents a method for comparing the results from
multiple experiments (see Alternate Protocol).
The first program included with the Pathway Processor package, called Data File Checker,
examines the input microarray data and checks whether it has the correct format for
Pathway Analyzer and Expression Mapper. The output form data file checker is a text file
called data.txt that constitutes the input of the two other programs.
SCORING BIOCHEMICAL PATHWAYS WITH PATHWAY PROCESSOR
Pathway Analyzer is a new method that uses the Fisher Exact Test to score biochemical
pathways according to the probability that as many or more genes in a pathway would be
significantly altered in a given experiment as would be altered by chance alone. Results
from multiple experiments can be compared, reducing the analysis from the full set of
individual genes to a limited number of pathways of interest.
BASIC
PROTOCOL
This tool is the first to include a statistical test to determine automatically the probability
that the genes of any of a large number of pathways are significantly altered in a given
experiment. Pathway Processor also provides a user-friendly interface, called Expression
Mapper (see Support Protocol), which automatically associates expression changes with
genes organized into metabolic maps (Grosu et al., 2002).
The Pathway Processor program, initially designed for the analysis of yeast and B.subtilis
expression data, can readily be adapted to the metabolic networks of other organisms.
The program can also be adapted to metabolic pathways other that those reported in
KEGG.
Necessary Resources
Hardware
PC running Microsoft Windows. The authors have found that a 700 MHz Pentium
PC with 512 Mb of RAM performs very well.
Software
Pathway Processor is written completely in Sun Microsystems Java. It is freely
available on the Web page of the Bauer Center for Genomics Research
(http://www.cgr.harvard.edu/cavalieri/pp.html), or by contacting either Paul
Grosu (paul_grosu@harvard.edu) or Duccio Cavalieri (dcavalieri@cgr.
harvard.edu). The program can be downloaded from the Web together with the
detailed User’s Instruction Manual.
Contributed by Duccio Cavalieri and Paul Grosu
Current Protocols in Bioinformatics (2004) 7.6.1-7.6.19
Copyright © 2004 by John Wiley & Sons, Inc.
Analyzing
Expression
Patterns
7.6.1
Supplement 5
Figure 7.6.1 Flowchart of the Pathway Processor Project, including a screenshot of the directory structure of Pathway
Processor.
Analyzing
Expression
Results with
Pathway
Processor
Files
The tab-delimited data text file is the file where one’s expression data will reside.
This data file must have the name data.txt, and will need to reside in the
data folder of the programs for which it will be used (this will be described in
greater detail later on; see step 1). This is the file used by Pathway Analyzer and
Expression Mapper. The file must contain normalized data in the format of
ratios. Data should not be log-transformed, since the programs will take care of
that where necessary.
The file must not have any headers and is of the following format: (1) the first
column must contain the yeast ORF names (for B. subtilis, use the SubtiList
accession numbers; e.g., BG11037; see note below); (2) the last column must
contain the normalized ratios; (3) there can be as many columns in between as
desired, but the authors recommend that only locus names be placed as the
middle column; this provides a quicker identification of the ORF in Expression
Mapper. Figure 7.6.2 shows an example.
There are some requirements and restrictions on the data file, i.e.: (a) the data file
must not contain any empty ORFs or ratios; (b) the data file must not contain
any 0 ratios since this will be a problem when taking the log of these Ratios;
(c) the data file must not contain duplicate ORFs since the statistics will be
skewed; (d) the data file must not contain any blank rows or columns; (e) the
data file must not contain any header columns nor extra lines or spaces except
for the text that is in each cell. Each cell must contain only one line and cannot
be spread across multiple lines.
7.6.2
Supplement 5
Current Protocols in Bioinformatics
Figure 7.6.2 A valid data.txt file.
NOTE: For Bacillus subtilis it is necessary to use the SubtiList Accession numbers in
data.txt. For example, instead of using aadK, one needs to use BG11037. This is a
procedure that can easily be performed in Microsoft Access where one has a table of one’s
data and another table that associates the gene names (e.g., aadK) with the corresponding
SubtiList Accession number (in this case, BG 11037). Such associations can be entered
as a table in Microsoft Access from different locations available freely on the Internet.
Feel free to contain either Paul Grosu (paul_grosu@harvard.edu) or Duccio Cavalieri
(dcavalieri@cgr.harvard.edu).
Installing Pathway Processor
1. Pathway Processor comes as a compressed file called pathway_processor.
zip. The user will need to unzip this file and all the proper directories and files will
be created.
All three programs (Pathway Analyzer, Expression Mapper, and Data File Checker) have
the same directory architecture. For each program there exists one main directory and three
subdirectories (data, library, results; Fig. 7.6.1).
The program and the three subdirectories reside in the Main Folder. In the data folder,
the user will put the data.txt file. The library folder contains data that the program
will use to process the user’s data. The results folder will output all of the user’s results.
The JRE1.3.1 folder is used by the program to start running.
2. After performing the operations described in step 1, run the data file through the Data
File Checker program (steps 3 to 5). This program will remove any ORFs that are
either not present in the pathway matrix against which the data is compared to perform
the statistics (this will be explained in more detail later), as well any data that contain
0 ratios.
Running the Data File Checker
3. Place the data.txt file in the data subdirectory of the data_file_checker
folder.
4. Go to the data_file_checker folder and double-click on the run.bat file.
Click the Process Request button in the dialog box that appears.
The program will parse the data.txt file and remove any ORFs that have 0 ratios or
that are not part of the latest SGD ORF listing. This SGD ORF listing is used by Pathway
Analyzer in a matrix form to do the statistical calculations. Updates to the pathway matrix
file will be done on a weekly basis. The pathway matrix file is called pathway_
file.txt and resides in the following subdirectories:
For the Data_File_Checker:
pathway_processor\\data_file_checker\\library\\pathway_file
For Pathway_Analyzer:
pathway_processor\\pathway_analyzer\\library\\pathway_file
Analyzing
Expression
Patterns
7.6.3
Current Protocols in Bioinformatics
Supplement 5
A
B
Figure 7.6.3 (A) Screen shot of the message window one receives when the Data File Checker
application has successfully parsed one’s data file. (B) Screenshot of the message window one
receives when the Data File Checker application has encountered an error while parsing one’s data
file. This message will alert the user to the row (line number) at which the error has occurred. The
user will need to open the file, usually with Microsoft Excel, and make the correction and rerun the
Data File Checker application. The data files always need to be saved as tab-delimited text files.
5a. Scenario 1: If the data.txt file was of the correct format, the message shown in Figure
7.6.3A will come up.
In the results folder, the new processed data.txt file will be found. This can be
placed in the data directory of pathway_analyzer or expression_mapper (see
Support Protocol).
5b. Scenario 2: If the data.txt file was not of the correct format, then the message
shown in Figure 7.6.3B will come up.
The next step would be to correct the data file where the error has occurred and then try
to run data_file_checker again on the new data file.
Running Pathway Analyzer
6. Place the data.txt file (from step 5a) in the data subdirectory of the pathway_
analyzer folder.
7. Go to the pathway_analyzer folder and double-click on the run.bat file.
The screen shown in Figure 7.6.4 will come up:
Analyzing
Expression
Results with
Pathway
Processor
8. The next step is to set the appropriate fold change cutoff. Pathway Analyzer will start
with a preset fold change cutoff for the Fisher Exact Test Statistic. The user should
choose the fold change based on the number of replicates that are combined to create
the data set, on the confidence that he or she has in the data, and on the type of
experiment. The Fisher Exact Test is based on the number of genes that pass the cutoff,
without considering the variance. In the experiment used as an example, the 1.8 fold
change was chosen also by looking at the Gaussian distribution of the fold changes
7.6.4
Supplement 5
Current Protocols in Bioinformatics
Figure 7.6.4 Screenshot of the Pathway Analyzer application main window.
in the experimental data set. It was observed that, in this particular data set, the number
of genes included between 1.5 and 1.6 was much larger than 1.8 and 1.9, and could
be the result of noise or variability in the measurements. It is suggested that the
analysis be performed with different cutoffs and that the one that gives the best values
of the Fisher Exact Test be chosen.
In Pathway Analyzer, the user specifies the magnitude of the difference in ORF expression
that is to be regarded as above background. The program uses the expression “fold change”
to indicate the relative change in gene expression, represented as the multiplier by which
the level of expression of a particular ORF is increased or decreased in an experiment.
9. Click on the Process Request button. The status bar will then change from “Waiting
for process request” to “Working...Please wait for job to finish...”. When the program
is finished, the status bar will change to “Job done. Waiting for process request.”
The program will parse the data file and compare it to pathway_file.txt (pathway
matrix file). From this comparison it will generate the Fisher Exact Test. All ratios are
transformed to log base 2 values before performing any kind of analysis. The first set of
tab-delimited text files that are generated are the following, all of which will be saved in
the results subdirectory of the pathway_analyzer directory:
gene_expression_pathway_summary_file.txt
pathway_summary_file.txt
The gene_expression_pathway_summary_file.txt will list, per pathway, all
the genes, with the associated fold change, that passed the cutoff. Table 7.6.1 is a small
sample of what it will look like.
The KEGG map number of each pathway is also listed in the header (first) row. This will
come in handy for the Expression Mapper (see Support Protocol).
The second file (Table 7.6.2), pathway_summary_file.txt, is the file containing
the Fisher Exact Test signed and unsigned t statistic information. Table 7.6.3 contains a
description of the content of the columns of the pathway_summary_file.txt.
The Signed Fisher Exact Test values will come in handy when doing pathway analysis
among multiple experiments, which is described in the Alternate Protocol.
Analyzing
Expression
Patterns
7.6.5
Current Protocols in Bioinformatics
Supplement 5
Table 7.6.1 Visualization of a Detail of Two Columns of gene_expression_pathway_
summary_file.txt, Opened Using Microsoft Excel
Pentose and glucuronate interconversions
map40
YBR204C *** Fold Change: 1.82
YKL140W - TGL1 *** Fold Change: 2.05
YKL035W - UGP1 *** Fold Change: 2.38
Fructose and mannose metabolism map51
YDL055C - PSA1 *** Fold Change: −2.18
YGL253W - HXK2 *** Fold Change: −1.82
YDR368W - YPR1 *** Fold Change: 1.85
YCL040W - GLK1 *** Fold Change: 1.93
YKR009C - FOX2 *** Fold Change: 2.29
YJR159W - SOR1 *** Fold Change: 2.39
YIL107C - PFK26 *** Fold Change: 2.87
YJL155C - FBP26 *** Fold Change: 3.60
YDL243C - AAD4 *** Fold Change: 3.83
YCR107W - AAD3 *** Fold Change: 5.00
YFL056C - AAD6 *** Fold Change: 7.25
YJR155W - AAD10 *** Fold Change: 10.05
Table 7.6.2 Visualization of a Detail of the Second File Obtained from Pathway Analyzer, pathway_summary_
file.txt, Opened Using Microsoft Excel
Genes in
pathway
present in the
data file
Genes
exceeding fold
change cutoff
(−1.8, 1.8)
Fisher Exact
Test (−1.8, 1.8)
Up-regulation/
Down-regulation of
pathway (−1.8, 1.8)
Signed Fisher
Exact Test
(−1.8, 1.8)
39
18
0.0084793
0.134658288
0.0084793
3
0
1
0
1
23
18
4.80E-07
2.477863857
4.80E-07
22
13
0.0016249
0.909243073
0.0016249
8
3
0.3770615
1.78955758
0.3770615
36
14
0.085082
0.539119765
0.085082
29
8
0.5521999
1.257629699
0.5521999
13
3
0.7303117
2.138632217
0.7303117
4
0
1
0
1
Pathway
Glycolysis/
Gluconeogenesis,
map10
Styrene degradation
map11
Citrate cycle (TCA
cycle) map 20
Pentose phosphate
cycle map30
Pentose and
glucuronate
interconversions map
40
Fructose and
mannose metabolism
map 51
Galactose
metabolism map 52
Ascorbate and
aldarate metabolism
map 53
Fatty acid
biosynthesis (path 1)
map 61
7.6.6
Supplement 5
Current Protocols in Bioinformatics
Table 7.6.3
Description of the Content of the Columns of the pathway_summary_file.txt
Column name
Column description
Genes in pathway present in the data file
Lists the number of genes in the particular pathway—in the last
column in that row—and also present in the data file
Lists the number of genes that passed the cutoff in the particular
pathway, listed in the last column in that row. In parentheses is listed
the fold change range that was entered when the program was run.
Lists the Fisher Exact Test value. In parentheses is listed the fold
change range that was entered when the program was run.
Calculates the difference between the means of the log2 ratios of the
genes that passed the cutoff within the pathway and all the genes that
passed the cutoff. If the number is greater than zero, the pathway is
up-regulated compared to the rest of the genome. If the pathway is
less than zero it is down-regulated. If it is zero, it is not significant. In
parentheses is listed the fold change range that was entered when the
program was run.
Genes exceeding fold change cutoff (−1.8,
1.8)
Fisher Exact Test (−1.8, 1.8)
Up-regulation/down-regulation of pathway
(−1.8, 1.8)
Signed Fisher Exact Test (−1.8, 1.8)
Pathway
Takes the sign of the up-regulation/down-regulation column—only if
it is non-zero—and multiplies it by the Fisher Exact Test column
value. If the up-regulation/down-regulation column is 0, then the
value is automatically set to 0. If the value is greater or equal to
−0.0001 and less than 0, then the value is automatically set to
−0.0001. This is done so that colors can be plotted correctly, since
colors in ranges very close to zero in the negative region are
considered 0 by some visualization programs. In parentheses is listed
the fold change range that was entered when the program was run.
This column lists the pathway name. The KEGG map number, which
can be used in Expression Mapper, is listed in parentheses.
10. The extent of the alteration of the genes that show major changes, and their
position in the pathways identified as of greatest interest with Pathway Analyzer,
can be now visualized. It is sufficient to annotate the number of the pathway of
interest, as reported in parentheses next to the KEGG map number in the pathway
column (Table 7.6.2), and proceed to the analysis with Expression Mapper (see
Support Protocol).
DETAILED ANALYSIS WITH EXPRESSION MAPPER
Expression Mapper allows a detailed examination of the relationships among genes
in the pathways of interest. This program features a unique graphical output, displaying differences in expression on metabolic charts of the biochemical pathways to
which the ORFs are assigned. The gene names are visualized on the metabolic chart
together with the fold change, next to the biological step to which the gene has been
associated.
SUPPORT
PROTOCOL
The letters are colored in red if the gene is up-regulated and green if the gene is
down-regulated; the color intensity is proportional to the extent of the change in
expression.
Single pathways of interest can be then studied in detail using Expression Mapper.
Analyzing
Expression
Patterns
7.6.7
Current Protocols in Bioinformatics
Supplement 5
Figure 7.6.5 Screenshot of the Expression Mapper application’s Map Manipulation Area window
using the new, checked data.txt file. The figure reports the Glycolysis/Gluconeogenesis pathway
(KEGG map 10). The text is colored in red if the relative change in gene expression is ≥1 or green
if it is ≥1. The intensity of the color is proportional to the magnitude of the differential expression.
The presence of a gray box indicates that the corresponding step in the biochemical pathway
requires multiple gene products. This black and white facsimile of the figure is intended only
as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c_p/
colorfigures.htm.
Necessary Resources
Hardware
PC running Microsoft Windows. The authors have found that a 700 MHz Pentium
PC with 512 Mb of RAM performs very well.
Analyzing
Expression
Results with
Pathway
Processor
Software
Pathway Processor is written completely in Sun Microsystems Java. It is freely
available on the Web page of the Bauer Center for Genomics Research
(http://www.cgr.harvard.edu/cavalieri/pp.html), or by contacting either Paul
Grosu (paul_grosu@harvard.edu) or Duccio Cavalieri (dcavalieri@
cgr.harvard.edu). The program can be downloaded from the Web together with
the detailed User’s Instruction Manual.
continued
7.6.8
Supplement 5
Current Protocols in Bioinformatics
Figure 7.6.6 The same screenshot as Figure 7.6.5, with the exception that the user has
dragged out the per-gene fold-changes. This black and white facsimile of the figure is intended
only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c_p/
colorfigures.htm.
Files
data.txt file (see Basic Protocol)
1. After completing the Basic Protocol, place the data.txt file in the data subdirectory of the expression_mapper folder.
2. Go to the expression_mapper folder and double-click on the run.bat file.
3. Enter the KEGG map number of interest in the dialog box that appears, then click on
the Process Request button.
Be sure to type in a map that exists. If unsure, check the Pathway column from the
pathway_summary_file.txt file (Table 7.6.2) for the KEGG map number of
greatest interest.
4. A window will come up that will look similar to Figure 7.6.5 (this is for KEGG map
number 10).
Analyzing
Expression
Patterns
7.6.9
Current Protocols in Bioinformatics
Supplement 5
Figure 7.6.7 JPEG output file that is saved from Figure 7.6.6 when one closes the Map Manipulation Area window. This
black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.
interscience.wiley.com/c_p/colorfigures.htm.
7.6.10
Supplement 5
Current Protocols in Bioinformatics
Figure 7.6.8 This is a portion of the Map Manipulation Area window using the B. subtilis version
of the program. This black and white facsimile of the figure is intended only as a placeholder; for
full-color version of figure go to http://www.interscience.wiley.com/c_p/colorfigures.htm.
The user will notice that the fold changes of some of the ORFs are listed. These are locations
where only one ORF is present. The ORFs are colored with shades of red if they are
up-regulated and shades of green if they are down-regulated. The lighter the shade, the
more up-regulated or down-regulated the gene is. The darker the shade, the less up-regulated or down-regulated the gene is. The ratios are transformed into fold changes and
written on the graph.
The user will notice that some locations contain gray boxes. These are locations where
more than on gene is located. In order to view these, one will need to click on the gray box
and drag it out (Fig. 7.6.6).
5. Finally, once satisfied with the way the pathway layout looks, one can save
the image by closing the window. By closing the window, an output file of the
corresponding map number will be created in the results folder under the
expression_mapper directory. The file name will be created from the
template[mapnumber] output.jpg. For instance, if KEGG map number 10
is used, the output file will be 10 output.jpg. The output of KEGG map 10 is
shown in Figure 7.6.7.
For B. subtilis, there will be boxes which are green; these boxes are prerendered green by
KEGG to indicate that they contain B. subtilis genes. When they are seen on screen, it means
that the data do not contain those genes since they are left green and not overwritten with
either a gray box or with a specific B. subtilis gene and its associated fold change. Figure
7.6.8 shows an example. In that figure, EC number 6.3.2.4 is left green and not overwritten.
Remember that the green in 6.3.2.4 does not necessarily mean that the gene is down-regulated.
COMPARATIVE VISUALIZATION OF PATHWAY ANALYSIS FROM
MULTIPLE EXPERIMENTS
ALTERNATE
PROTOCOL
It is possible to use Pathway Analyzer to perform pathway analysis across multiple
experiments. To do this, first run the Pathway Analyzer program on each experiment of
interest with the same cutoffs on each of them (see Basic Protocol). Next, take the Signed
Fisher Exact Test column of each experiment and place them into one Excel spreadsheet.
Everything can then be sorted by the most interesting experiment, such that the most
up-regulated pathways are at the top and most down-regulated pathways are at the bottom.
From this, one can make a contour plot.
Figure 7.6.9 shows an example of such a plot, performed on data from the paper on
time-course expression during yeast diauxic shift (DeRisi et al., 1997), showing only the
top 10 most up-regulated and down-regulated Signed Fisher Exact Test pathways. The
data were downloaded from the The Pat Brown Laboratory Web Site http://cmgm.
stanford.edu/pbrown/explore/array.txt.
Analyzing
Expression
Patterns
7.6.11
Current Protocols in Bioinformatics
Supplement 5
Figure 7.6.9 Surface graph obtained using Microsoft Excel to plot all the Signed Fisher Exact Test
column values of the different pathway_summary_file.txt files.
It is possible to preserve the sign and subtract the absolute value from 1, and then plot the
line plots in Microsoft Excel and get the result shown in Figure 7.6.10 for the top 10
up-regulated and down-regulated pathways. Figure 7.6.11 shows the figure modified from
the paper (DeRisi et al., 1997) itself. There is a very good correlation between the two
figures, indicating that Pathway Processor has automatically identified the more relevant
features of the process.
According to the researcher’s preferences, the results can be visualized with different
visualization programs. The starting point is always an Excel file containing the values
from the Signed Fisher Exact Test column from different experiments
Analyzing
Expression
Results with
Pathway
Processor
To generate similar heatmaps and view them with the Eisen’s clustering programs (Eisen
et al., 1998), one would go through the following steps:
1. Take the sorted file and convert every value which is greater than −0.01 and less than
0.01 to 0.01 with the appropriate sign.
7.6.12
Supplement 5
Current Protocols in Bioinformatics
Figure 7.6.10 Time course of the experiment described in DeRisi et al. (1997). The figure reports
a XY (Scatter) graph using Microsoft Excel to plot all the Signed Fisher Exact Test column values
of the different pathway_summary_file.txt files. The p values have been adjusted to plot with
large numbers at low p values and vice versa to show that one can get the same result as the
original DeRisi figure. This black and white facsimile of the figure is intended only as a placeholder;
for full-color version of figure go to http://www.interscience.wiley.com/c_p/colorfigures.htm.
Figure 7.6.11 Time course of the experiment described in DeRisi et al. (1997). The hours on the
horizontal bar indicate the time, during diauxic shift, at which the mRNA has been extracted. The
experiment compares differential expression at the indicated time respect to a common reference.
This is a redrawing of the original figure appearing in De Risi et al. (1997). This black and white
facsimile of the figure is intended only as a placeholder; for full-color version of figure go to
http://www.interscience.wiley.com/c_p/colorfigures.htm.
2. Take the reciprocal of every value.
3. Rename the Pathway column to Name. The file would look similar in format to Table 7.6.4.
4. Save the file as a tab-delimited text file.
5. Open Mike Eisen’s TreeView program.
6. Go to the File menu and select Load.
7. Select the type of file to be text (*.TXT) and select the file. The result will look
similar to Figure 7.6.12.
Analyzing
Expression
Patterns
7.6.13
Current Protocols in Bioinformatics
Supplement 5
Table 7.6.4 Detail of the Visualization of the Results of the Comparison of the Seven Experiments in the Time Course
of the De Risi et al. (1997) Experimenta
Name
9 hr
11 hr
13 hr
15 hr
Ribosome map 3010
Purine metabolism map 230
1
1
1
1
1
100
Pyrimidine metabolism map 240
RNA polymerase map 3020
Aminoacyl-tRNA biosynthesis
map 970
Methionine metabolism map 271
Selenoamino acid metabolism
map 450
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
17 hr
19 hr
21 hr
1
−1.005134
−1.220797 10.70821
−100
−31.96548
−100
−100
1
1
1
−1.881824
1
1
−5.985468
−8.275396
−100
−100
−100
−100
1
1
−100
−100
−29.94031
−43.05598
−100
−75.23784
aHours in the top row indicate the time, during diauxic shift, at which the mRNA has been extracted; the experiment compares differential expression
at the indicated time respect to a common reference.
Figure 7.6.12 A screenshot of Mike Eisen’s TreeView program using the reciprocally adjusted
Signed Fisher Exact Test values to show how one can quickly visualize the results of multi-experiment pathway analysis.
Data from different experiments analyzed using Pathway Analyzer can be visualized with
the open-source visualization software OpenDX (http://www.opendx.org). This visualization program allows an elegant and detailed examination of the expression levels
observed in the experiment, according to pathways.
Analyzing
Expression
Results with
Pathway
Processor
The advantage of OpenDX is that it visualizes data in three dimensions. An example is
shown in Figure 7.6.13. The input of the program consists of three files: one with the
pathway names, another with the Signed Fisher Exact Test, and a third with the header
row. The program represents each value graphically as a cube. The color of the cube
indicates the extent of the variation, based on the magnitude of the p values and the sign,
with red being up-regulated, green down-regulated, and yellow no change. The correspon-
7.6.14
Supplement 5
Current Protocols in Bioinformatics
Figure 7.6.13 Picture representing the down-regulated pathways in the diauxic shift (DeRisi et
al., 1997), with the Fisher Exact Test Results visualized using OpenDX. The values of the Signed
Fisher Exact Test of the 21-hr data set have been sorted according to the value of the Fisher Exact
Test; the results of the other data sets for the affected pathways are also reported. The color of the
cube indicates the extent of the variation, according to the p values, with red being up-regulated,
green down-regulated, and yellow unchanged. The opacity visually represents the statistical
significance of the variation, the greater the opacity, the greater the significance of p value. The
color of the cube depends on the p value in the following way: from 1 to 0.15 the color remains
yellow, from 0.15 to 0 with overexpression (+) it goes from yellow to red, and from 0.15 to 0 with
under-expression (−) it goes from yellow to green. This black and white facsimile of the figure is
intended only as a placeholder; for full-color version of figure go to http://www.interscience.
wiley.com/c_p/colorfigures.htm.
dence between the color of the cube and the p value can be modulated according to the
user’s preferences; the authors suggest that the visualization be tuned in the following
way: from 1 to 0.15, the color remains yellow; from 0.15 to 0 with over-expression (+),
it goes from yellow to red; from 0.15 to 0 with under-expression (−), it goes from yellow
to green. To allow the eye to focus on the most significant results, it is also suggested that
the opacity be changed so that the greater the significance of the variation, the greater the
opacity (Fig. 7.6.13). The use of the program is not intuitive, and its application to
visualization of microarray data classified using Pathway Processor needs some fine
tuning. A detailed description of OpenDX itself is beyond the scope of this manual; a
detailed description of the program and manuals on how to use OpenDx can be found at
http://www.opendx.org/support.html and http://www.opendx.org/index2.php. A book
with a more extensive tutorial can be found at http://www.vizsolutions.com/paths.html.
GUIDELINES FOR UNDERSTANDING RESULTS
Two tab-delimited text files are generated from the comparison files in Pathway Analyzer.
One, called, gene_expression_pathway_summary_file.txt (Table 7.6.1),
contains all the genes that pass the cutoff, organized by pathway, and can be used to
retrieve lists of the genes with their fold changes, subdivided according to the KEGG
Pathway organization. The other, comb_pathway_summary_file.txt (Table
Analyzing
Expression
Patterns
7.6.15
Current Protocols in Bioinformatics
Supplement 5
7.6.2), contains the summary of the statistics for each pathway, which can be imported
into Microsoft Excel to enable the user to sort the results into various columns, to
determine the effect on the different pathways, and to be used as input in different
visualization tools.
The Signed Fisher Exact Test column of comb_pathway_summary_file.txt
(Table 7.6.2) allows the sorting of up-regulated or down-regulated pathways. The value
in this column is composed of two distinct parts. The first part carries the signs + or −,
indicating whether the particular pathway contains genes that are up- or down-regulated.
The second part of each entry is a positive real number (between 0 and 1), corresponding
to the p value of the Fisher Exact Test for the pathway. The sign is calculated by subtracting
the mean relative expression of all genes that pass the cutoff and are in the pathway from
the mean relative expression of the genes that pass the cutoff and are not within the
pathway (up-regulation/down-regulation column, Table 7.6.2). If there are no genes above
the cutoff in a pathway, the sign is arbitrarily set to +. This step is done only for
convenience, as the p values for such pathways will always be non-significant.
Sorting for the Signed Fisher Exact Test is done so that the most significant values are at
the top of the column (Table 7.6.2) for the up-regulated pathways and at the bottom for
the down-regulated pathways. In the middle are the least significant pathways. The values
of the Fisher Exact Test vector can be used to compare different experiments using
Microsoft Excel (Table 7.6.2), and the comparison among the different experiments can
be represented graphically. Programs for the graphical representation can vary from Excel
to more sophisticated ones; an interesting graphical software is OpenDX
(http://www.opendx.org), an open-source visualization software package (Fig. 7.6.13).
The resulting set of p values for all pathways is finally used to rank the pathways according
to the magnitude and direction of the effects. The Pathway Processor results from multiple
experiments can be compared, reducing the analysis from the full set of individual genes
to a limited number of pathways of interest. The probability that a given pathway is
affected is necessary to weigh the relative contribution of the biological process at work
to the phenotype studied.
COMMENTARY
Background Information
Analyzing
Expression
Results with
Pathway
Processor
DNA microarrays provide a powerful technology for genomics research. The multistep,
data-intensive nature of this approach has created unprecedented challenges for the development of proper statistics and new bioinformatic
tools. It is of the greatest importance to integrate
information on the genomic scale with the biological information accumulated through years
of research on the molecular genetics, biochemistry, and physiology of the organisms that
researchers investigate.
A genomic approach for the understanding
of fundamental biological processes enables
the simultaneous study of expression patterns
of all genes for branch-point enzymes. Similarly, one can look for patterns of expression
variation in particular classes of genes, such as
those involved in metabolism, cytoskeleton,
cell-division control, apoptosis, membrane
transport, sexual reproduction, and so forth.
Interpreting such a huge amount of data requires a deep knowledge of metabolism and
cellular signaling pathways.
The availability of properly annotated pathway databases is one of the requirements for
analyzing microarray data in the context of
biological pathways. Efforts to establish proper
gene ontology (UNIT 7.2; Ashburner et al., 2000)
and pathway databases are continuing, and several resources are available publicly and commercially.
Efforts have also been made to integrate
functional genomic information into databases,
such as ArrayDB (Ermolaeva et al., 1998), SGD
(Ball et al., 2000; Ball et al., 2001), YPD, Worm
PD, PombePD and callPd (Costanzo et al.,
2000), and KEGG (Nakao et al., 1999). The
ability to display information on pathway maps
is also extremely important. The Kyoto Ency-
7.6.16
Supplement 5
Current Protocols in Bioinformatics
clopedia of Genes and Genomes (KEGG;
Kanehisa et al., 2002), the Alliance for Cellular
Signalling, BioCarta, EcoCyc (Karp et al.,
2002a), MetaCyC (Karp et al., 2002b),
PathDB, and MIPS all organize existing metabolic information in easily accessible pathway
maps. Pathway databases will become more
useful as a unique and detailed annotation for
all the genes in the sequenced genomes becomes available. In this respect, the situation
for yeast contrasts with that for human, mouse,
and rat, for which the systematic and detailed
annotation and description of open reading
frame (ORF) function is still in progress.
The visualization of expression data on cellular process charts is also important. Many
authors have manually mapped transcriptional
changes to metabolic charts, and others have
developed automatic methods to assign genes
showing expression variation to functional
categories, focusing on single pathways.
KEGG, MetaCyC, and EcoCyC display expression data from some experiments on their
maps. Some commercial microarray analysis
packages, such as Rosetta Resolver (Rosetta
Biosoftware) have also integrated a feature enabling the display of expression of a given gene
in the context of a metabolic map.
MAPPFinder and GenMAPP (http://www.
genmapp.org/; UNIT 7.5), are recently developed
tools allowing the display of expression results
on metabolic or cellular charts (Doniger et al.,
2003). The program represents the pathways in
a special file format called MAPPs, independent of the gene expression data, and enables the grouping of genes by any organizing
principle. The novel feature is that it both allows
visualization of networks according to the
structure reported in the current pathway databases and also provides the ability to modify
pathways ad hoc; it also makes it possible to
design custom maps and exchange pathway-related data among investigators. The map contains hyperlinks to the expression information
and to the information on every gene available
on public databases. Still, the program is limited because the information provided on the
map is not quantitative but only qualitative and
based on color coding, with a repressed gene
or an activated gene reported, respectively, as
green or red. This does not automatically indicate the pathways of greatest interest. Microarray papers tend to discuss the fact that a given
pathway is activated or repressed, based on the
number of genes activated or repressed, or just
on intuition, and too often the researcher finds
only what she or he expected, or already knew.
A recent paper from Castillo-Davis and
Hartl (2003) describes a method to assess the
significance of alteration in expression of diverse cellular pathways, taken from GO, MIPS,
or other sources. The program is extremely
useful but does not include a visualization tool
enabling one to map the results on pathway
charts, and does not address the problem arising
from applying hypergeometric distributions to
the analysis of highly interconnected pathways
with a high level of redundancy as defined
according to the GO terms.
Ideally, methods analyzing expression data
according to a pathway-based logic should give
an indication of the statistical significance of
the conclusions, provide a user-friendly interface, and be able to encompass the largest
number of possible interconnections between
genes, although the ability to separate larger
pathways into smaller independent subpathways is important when developing methods
assessing the analysis of the statistical significance of up- or down-regulation of a pathway.
The Pathway Analyzer algorithm
Pathway Analyzer implements a statistical
method in Java, automatically identifying
which metabolic pathways are most affected by
differences in gene expression observed in a
particular experiment. The method associates
an ORF with a given biochemical step according to the information contained in 92 pathway
files from KEGG (http://www.genome.ad.jp/
kegg/). Pathway Analyzer scores KEGG biochemical pathways, measuring the probability
that the genes of a pathway are significantly
altered in a given experiment. KEGG has been
chosen for the concise and clear way in which
the genes are interconnected, and for its curators’ great effort in keeping the information up
to date.
In deriving scores for pathways, Pathway
Analyzer takes into account the following factors: (1) the number of ORFs whose expression
is altered in each pathway; (2) the total number
of ORFs contained in the pathway; and (3) the
proportion of the ORFs in the genome contained in a given pathway.
In the first step of the analysis, the user
specifies the magnitude of the difference in
ORF expression that should be regarded as
above background. The relative change in gene
expression is the multiplier by which the level
of expression of a particular ORF is increased
or decreased in an experiment. Hence, Pathway
Analyzer allows the study of differences
smaller than a given fold change, but which
Analyzing
Expression
Patterns
7.6.17
Current Protocols in Bioinformatics
Supplement 5
affect a statistically significant number of ORFs
in a particular metabolic pathway. Consistent
differential expression of a number of ORFs in
the same pathway can have important biological implications; for example, it may indicate
the existence of a set of coordinately regulated
ORFs. The program uses the Fisher Exact Test
to calculate the probability that a difference in
ORF expression in each of the 92 pathways
could be due to chance. A statistically significant probability means that a particular pathway contains more affected ORFs than would
be expected by chance. The program allows the
user to choose different values of the Fisher
Exact Test.
Fisher Exact test
The analysis performed with the Fisher Exact Test provides a quick and user-friendly way
of determining which pathways are the most
affected. The one-sided Fisher Exact Test calculates a p value, based on the number of genes
whose expression exceeds a user-specified cutoff in a given pathway. This p value is the
probability that by chance the pathway would
contain as many affected genes as or more
affected genes than actually observed, the null
hypothesis being that the relative changes in
gene expression of the genes in the pathway are
a random subset of those observed in the experiment as a whole.
Literature Cited
Ashburner, M., Ball, C.A., Blake, J.A., Botstein,
D., Butler, H., Cherry, J.M., Davis, A.P.,
Dolinski, K., Dwight, S.S., Eppig, J.T., Harris,
M.A., Hill, D.P., Issel-Tarver, L., Kasarskis,
A., Lewis, S., Matese, J.C., Richardson, J.E.,
Ringwald, M., Rubin, G.M., and Sherlock, G.
2000. The Gene Ontology Consortium. Nat.
Genet. 25:25-29.
Ball, C.A., Dolinski, K., Dwight, S.S., Harris,
M.A., Issel-Tarver, L., Kasarskis, A., Scafe,
C.R., Sherlock, G., Binkley, G., Jin, H.,
Kaloper, M., Orr, S.D., Schroeder, M., Weng,
S., Zhu, Y., Botstein, D., and Cherry, J.M.
2000. Integrating functional genomic information into the Saccharomyces genome database. Nucleic Acids Res. 28:77-80.
Analyzing
Expression
Results with
Pathway
Processor
Ball, C.A., Jin, H., Sherlock, G., Weng, S.,
Matese, J.C., Andrada, R. , Binkley, G., Dolinski, K., Dwight, S.S., Harris, M.A., IsselTarver, L., Schroeder, M., Botstein, D., and
Cherry, J.M. 2001. Saccharomyces genome
database provides tools to survey gene expression and functional analysis data. Nucleic Acids Res. 29:80-81.
Castillo-Davis, C.I. and Hartl, D.L. 2003. GeneMerge-post-genomic analysis, data mining,
a n d h y p o t h e s i s t e s t i n g . Bioinformatics
19:891-892.
Costanzo, M.C., Hogan, J.D., Cusick, M.E.,
Davis, B.P., Fancher, A.M., Hodges, P.E.,
Kondu, P., Lengieza, C., Lew-Smith, J.E.,
Lingner, C., Roberg-Perez, K.J., Tillberg, M.,
Brooks, J.E., and Garrels, J.I. 2000. The yeast
proteome database (YPD) and Caenorhabditis
elegans proteome database (WormPD): Comprehensive resources for the organization and
comparison of model organism protein information. Nucleic Acids Res. 28:73-76.
DeRisi, J.L., Iyer,V.R., and Brown, P.O. 1997.
Exploring the metabolic and genetic control of
gene expression on a genomic scale. Science
278:680–686.
Doniger, S.W., Salomonis, N., Dahlquist, K.D.,
Vranizan, K., Lawlor, S.C., and Conklin, B.R.
2003. MAPPFinder: Using gene ontology and
GenMAPP to create a global gene-expression
profile from microarray data. Genome Biol.
4:R7.
Eisen, M.B., Spellman, P.T., Brown, P.O., and
Botstein, D. 1998. Cluster analysis and display
of genome-wide expression patterns. Proc.
Natl. Acad. Sci. U.S.A. 95:14863-14868.
Ermolaeva, O., Rastogi, M., Pruitt, K.D., Schuler,
G.D., Bittner, M.L., Chen, Y., Simon, R.,
Meltzer, P., Trent, J.M., and Boguski, M.S.
1998. Data management and analysis for gene
expression arrays. Nat. Genet. 20:19-23.
Grosu, P., Townsend, J.P., Hartl, D.L., and Cavalieri, D. 2002. Pathway processor: A tool for
integrating whole-genome expression results
into metabo lic networks. Genome Res.
12:1121-1126.
Kanehisa, M., Goto, S., Kawashima, S., and
Nakaya, A. 2002. The KEGG databases at
GenomeNet. Nucleic Acids Res. 30:42-46.
Karp, P.D., Riley, M., Saier, M., Paulsen, I.T.,
Collado-Vides, J., Paley, S.M., PellegriniToole, A., Bonavides, C., and Gama-Castro, S.
2002a. The EcoCyC database. Nucleic Acids
Res. 30:56-58.
Karp, P.D., Riley, M., Paley, S.M., and PellegriniToole, A. 2002b. The MetaCyC database. Nucleic Acids Res. 30:59-61.
Nakao, M, Bono, H., Kawashima, S., Kamiya, T.,
Sato, K., Goto, S., and Kanehisa, M. 1999.
Genome-scale gene expression analysis and
pathway reconstruction in KEGG. Genome Inform. Ser. Workshop Genome Inform. 10:94103.
Internet Resources
http://www.cgr.harvard.edu/cavalieri/pp.html
Duccio Cavalieri CGR Web site.
http://www.genome.ad.jp/kegg/
The Kyoto Encyclopedia of Genes and Genomes
(KEGG) home page.
7.6.18
Supplement 5
Current Protocols in Bioinformatics
http://www.proteome.com/databases/YPD/
YPDsearch-quick.html
http://www.ncgr.org/genex/
The yeast proteome database (YPD) home page.
Gene X, Gene expression home page at the National
Center for Genome Resources.
http://www.opendx.org
http://cmgm.stanford.edu/pbrown/
http://www.opendx.org/index2.php
The Pat Brown Laboratory Web site.
The open-source visualization software, OpenDX
along with manuals and tutorials.
http://genome-www.stanford.edu/
Saccharomyces/
The Saccharomyces Genome database (SGD) home
page.
Contributed by Duccio Cavalieri and
Paul Grosu
Bauer Center for Genomics Research
Harvard University
Cambridge, Massachusetts
Analyzing
Expression
Patterns
7.6.19
Current Protocols in Bioinformatics
Supplement 5
Integrating Whole-Genome Expression
Results into Metabolic Networks with
Pathway Processor
UNIT 7.6
Genes never act alone in a biological system, but participate in a cascade of networks. As
a result, analyzing microarray data from a pathway perspective leads to a new level of
understanding the system. The authors’ group has recently developed Pathway Processor
(http://cgr.harvard.edu/cavalieri/pp.html), an automatic statistical method to determine
which pathways are most affected by transcriptional changes and to map expression data from
multiple whole-genome expression experiments on metabolic pathways (Grosu et al., 2002).
The Pathway Processor package (Fig. 7.6.1) consists of three programs, Data File
Checker, Pathway Analyzer (see Basic Protocol), and Expression Mapper (see Support
Protocol). The final protocol in the unit presents a method for comparing the results from
multiple experiments (see Alternate Protocol).
The first program included with the Pathway Processor package, called Data File Checker,
examines the input microarray data and checks whether it has the correct format for
Pathway Analyzer and Expression Mapper. The output form data file checker is a text file
called data.txt that constitutes the input of the two other programs.
SCORING BIOCHEMICAL PATHWAYS WITH PATHWAY PROCESSOR
Pathway Analyzer is a new method that uses the Fisher Exact Test to score biochemical
pathways according to the probability that as many or more genes in a pathway would be
significantly altered in a given experiment as would be altered by chance alone. Results
from multiple experiments can be compared, reducing the analysis from the full set of
individual genes to a limited number of pathways of interest.
BASIC
PROTOCOL
This tool is the first to include a statistical test to determine automatically the probability
that the genes of any of a large number of pathways are significantly altered in a given
experiment. Pathway Processor also provides a user-friendly interface, called Expression
Mapper (see Support Protocol), which automatically associates expression changes with
genes organized into metabolic maps (Grosu et al., 2002).
The Pathway Processor program, initially designed for the analysis of yeast and B.subtilis
expression data, can readily be adapted to the metabolic networks of other organisms.
The program can also be adapted to metabolic pathways other that those reported in
KEGG.
Necessary Resources
Hardware
PC running Microsoft Windows. The authors have found that a 700 MHz Pentium
PC with 512 Mb of RAM performs very well.
Software
Pathway Processor is written completely in Sun Microsystems Java. It is freely
available on the Web page of the Bauer Center for Genomics Research
(http://www.cgr.harvard.edu/cavalieri/pp.html), or by contacting either Paul
Grosu (paul_grosu@harvard.edu) or Duccio Cavalieri (dcavalieri@cgr.
harvard.edu). The program can be downloaded from the Web together with the
detailed User’s Instruction Manual.
Contributed by Duccio Cavalieri and Paul Grosu
Current Protocols in Bioinformatics (2004) 7.6.1-7.6.19
Copyright © 2004 by John Wiley & Sons, Inc.
Analyzing
Expression
Patterns
7.6.1
Supplement 5
Figure 7.6.1 Flowchart of the Pathway Processor Project, including a screenshot of the directory structure of Pathway
Processor.
Analyzing
Expression
Results with
Pathway
Processor
Files
The tab-delimited data text file is the file where one’s expression data will reside.
This data file must have the name data.txt, and will need to reside in the
data folder of the programs for which it will be used (this will be described in
greater detail later on; see step 1). This is the file used by Pathway Analyzer and
Expression Mapper. The file must contain normalized data in the format of
ratios. Data should not be log-transformed, since the programs will take care of
that where necessary.
The file must not have any headers and is of the following format: (1) the first
column must contain the yeast ORF names (for B. subtilis, use the SubtiList
accession numbers; e.g., BG11037; see note below); (2) the last column must
contain the normalized ratios; (3) there can be as many columns in between as
desired, but the authors recommend that only locus names be placed as the
middle column; this provides a quicker identification of the ORF in Expression
Mapper. Figure 7.6.2 shows an example.
There are some requirements and restrictions on the data file, i.e.: (a) the data file
must not contain any empty ORFs or ratios; (b) the data file must not contain
any 0 ratios since this will be a problem when taking the log of these Ratios;
(c) the data file must not contain duplicate ORFs since the statistics will be
skewed; (d) the data file must not contain any blank rows or columns; (e) the
data file must not contain any header columns nor extra lines or spaces except
for the text that is in each cell. Each cell must contain only one line and cannot
be spread across multiple lines.
7.6.2
Supplement 5
Current Protocols in Bioinformatics
Figure 7.6.2 A valid data.txt file.
NOTE: For Bacillus subtilis it is necessary to use the SubtiList Accession numbers in
data.txt. For example, instead of using aadK, one needs to use BG11037. This is a
procedure that can easily be performed in Microsoft Access where one has a table of one’s
data and another table that associates the gene names (e.g., aadK) with the corresponding
SubtiList Accession number (in this case, BG 11037). Such associations can be entered
as a table in Microsoft Access from different locations available freely on the Internet.
Feel free to contain either Paul Grosu (paul_grosu@harvard.edu) or Duccio Cavalieri
(dcavalieri@cgr.harvard.edu).
Installing Pathway Processor
1. Pathway Processor comes as a compressed file called pathway_processor.
zip. The user will need to unzip this file and all the proper directories and files will
be created.
All three programs (Pathway Analyzer, Expression Mapper, and Data File Checker) have
the same directory architecture. For each program there exists one main directory and three
subdirectories (data, library, results; Fig. 7.6.1).
The program and the three subdirectories reside in the Main Folder. In the data folder,
the user will put the data.txt file. The library folder contains data that the program
will use to process the user’s data. The results folder will output all of the user’s results.
The JRE1.3.1 folder is used by the program to start running.
2. After performing the operations described in step 1, run the data file through the Data
File Checker program (steps 3 to 5). This program will remove any ORFs that are
either not present in the pathway matrix against which the data is compared to perform
the statistics (this will be explained in more detail later), as well any data that contain
0 ratios.
Running the Data File Checker
3. Place the data.txt file in the data subdirectory of the data_file_checker
folder.
4. Go to the data_file_checker folder and double-click on the run.bat file.
Click the Process Request button in the dialog box that appears.
The program will parse the data.txt file and remove any ORFs that have 0 ratios or
that are not part of the latest SGD ORF listing. This SGD ORF listing is used by Pathway
Analyzer in a matrix form to do the statistical calculations. Updates to the pathway matrix
file will be done on a weekly basis. The pathway matrix file is called pathway_
file.txt and resides in the following subdirectories:
For the Data_File_Checker:
pathway_processor\\data_file_checker\\library\\pathway_file
For Pathway_Analyzer:
pathway_processor\\pathway_analyzer\\library\\pathway_file
Analyzing
Expression
Patterns
7.6.3
Current Protocols in Bioinformatics
Supplement 5
A
B
Figure 7.6.3 (A) Screen shot of the message window one receives when the Data File Checker
application has successfully parsed one’s data file. (B) Screenshot of the message window one
receives when the Data File Checker application has encountered an error while parsing one’s data
file. This message will alert the user to the row (line number) at which the error has occurred. The
user will need to open the file, usually with Microsoft Excel, and make the correction and rerun the
Data File Checker application. The data files always need to be saved as tab-delimited text files.
5a. Scenario 1: If the data.txt file was of the correct format, the message shown in Figure
7.6.3A will come up.
In the results folder, the new processed data.txt file will be found. This can be
placed in the data directory of pathway_analyzer or expression_mapper (see
Support Protocol).
5b. Scenario 2: If the data.txt file was not of the correct format, then the message
shown in Figure 7.6.3B will come up.
The next step would be to correct the data file where the error has occurred and then try
to run data_file_checker again on the new data file.
Running Pathway Analyzer
6. Place the data.txt file (from step 5a) in the data subdirectory of the pathway_
analyzer folder.
7. Go to the pathway_analyzer folder and double-click on the run.bat file.
The screen shown in Figure 7.6.4 will come up:
Analyzing
Expression
Results with
Pathway
Processor
8. The next step is to set the appropriate fold change cutoff. Pathway Analyzer will start
with a preset fold change cutoff for the Fisher Exact Test Statistic. The user should
choose the fold change based on the number of replicates that are combined to create
the data set, on the confidence that he or she has in the data, and on the type of
experiment. The Fisher Exact Test is based on the number of genes that pass the cutoff,
without considering the variance. In the experiment used as an example, the 1.8 fold
change was chosen also by looking at the Gaussian distribution of the fold changes
7.6.4
Supplement 5
Current Protocols in Bioinformatics
Figure 7.6.4 Screenshot of the Pathway Analyzer application main window.
in the experimental data set. It was observed that, in this particular data set, the number
of genes included between 1.5 and 1.6 was much larger than 1.8 and 1.9, and could
be the result of noise or variability in the measurements. It is suggested that the
analysis be performed with different cutoffs and that the one that gives the best values
of the Fisher Exact Test be chosen.
In Pathway Analyzer, the user specifies the magnitude of the difference in ORF expression
that is to be regarded as above background. The program uses the expression “fold change”
to indicate the relative change in gene expression, represented as the multiplier by which
the level of expression of a particular ORF is increased or decreased in an experiment.
9. Click on the Process Request button. The status bar will then change from “Waiting
for process request” to “Working...Please wait for job to finish...”. When the program
is finished, the status bar will change to “Job done. Waiting for process request.”
The program will parse the data file and compare it to pathway_file.txt (pathway
matrix file). From this comparison it will generate the Fisher Exact Test. All ratios are
transformed to log base 2 values before performing any kind of analysis. The first set of
tab-delimited text files that are generated are the following, all of which will be saved in
the results subdirectory of the pathway_analyzer directory:
gene_expression_pathway_summary_file.txt
pathway_summary_file.txt
The gene_expression_pathway_summary_file.txt will list, per pathway, all
the genes, with the associated fold change, that passed the cutoff. Table 7.6.1 is a small
sample of what it will look like.
The KEGG map number of each pathway is also listed in the header (first) row. This will
come in handy for the Expression Mapper (see Support Protocol).
The second file (Table 7.6.2), pathway_summary_file.txt, is the file containing
the Fisher Exact Test signed and unsigned t statistic information. Table 7.6.3 contains a
description of the content of the columns of the pathway_summary_file.txt.
The Signed Fisher Exact Test values will come in handy when doing pathway analysis
among multiple experiments, which is described in the Alternate Protocol.
Analyzing
Expression
Patterns
7.6.5
Current Protocols in Bioinformatics
Supplement 5
Table 7.6.1 Visualization of a Detail of Two Columns of gene_expression_pathway_
summary_file.txt, Opened Using Microsoft Excel
Pentose and glucuronate interconversions
map40
YBR204C *** Fold Change: 1.82
YKL140W - TGL1 *** Fold Change: 2.05
YKL035W - UGP1 *** Fold Change: 2.38
Fructose and mannose metabolism map51
YDL055C - PSA1 *** Fold Change: −2.18
YGL253W - HXK2 *** Fold Change: −1.82
YDR368W - YPR1 *** Fold Change: 1.85
YCL040W - GLK1 *** Fold Change: 1.93
YKR009C - FOX2 *** Fold Change: 2.29
YJR159W - SOR1 *** Fold Change: 2.39
YIL107C - PFK26 *** Fold Change: 2.87
YJL155C - FBP26 *** Fold Change: 3.60
YDL243C - AAD4 *** Fold Change: 3.83
YCR107W - AAD3 *** Fold Change: 5.00
YFL056C - AAD6 *** Fold Change: 7.25
YJR155W - AAD10 *** Fold Change: 10.05
Table 7.6.2 Visualization of a Detail of the Second File Obtained from Pathway Analyzer, pathway_summary_
file.txt, Opened Using Microsoft Excel
Genes in
pathway
present in the
data file
Genes
exceeding fold
change cutoff
(−1.8, 1.8)
Fisher Exact
Test (−1.8, 1.8)
Up-regulation/
Down-regulation of
pathway (−1.8, 1.8)
Signed Fisher
Exact Test
(−1.8, 1.8)
39
18
0.0084793
0.134658288
0.0084793
3
0
1
0
1
23
18
4.80E-07
2.477863857
4.80E-07
22
13
0.0016249
0.909243073
0.0016249
8
3
0.3770615
1.78955758
0.3770615
36
14
0.085082
0.539119765
0.085082
29
8
0.5521999
1.257629699
0.5521999
13
3
0.7303117
2.138632217
0.7303117
4
0
1
0
1
Pathway
Glycolysis/
Gluconeogenesis,
map10
Styrene degradation
map11
Citrate cycle (TCA
cycle) map 20
Pentose phosphate
cycle map30
Pentose and
glucuronate
interconversions map
40
Fructose and
mannose metabolism
map 51
Galactose
metabolism map 52
Ascorbate and
aldarate metabolism
map 53
Fatty acid
biosynthesis (path 1)
map 61
7.6.6
Supplement 5
Current Protocols in Bioinformatics
Table 7.6.3
Description of the Content of the Columns of the pathway_summary_file.txt
Column name
Column description
Genes in pathway present in the data file
Lists the number of genes in the particular pathway—in the last
column in that row—and also present in the data file
Lists the number of genes that passed the cutoff in the particular
pathway, listed in the last column in that row. In parentheses is listed
the fold change range that was entered when the program was run.
Lists the Fisher Exact Test value. In parentheses is listed the fold
change range that was entered when the program was run.
Calculates the difference between the means of the log2 ratios of the
genes that passed the cutoff within the pathway and all the genes that
passed the cutoff. If the number is greater than zero, the pathway is
up-regulated compared to the rest of the genome. If the pathway is
less than zero it is down-regulated. If it is zero, it is not significant. In
parentheses is listed the fold change range that was entered when the
program was run.
Genes exceeding fold change cutoff (−1.8,
1.8)
Fisher Exact Test (−1.8, 1.8)
Up-regulation/down-regulation of pathway
(−1.8, 1.8)
Signed Fisher Exact Test (−1.8, 1.8)
Pathway
Takes the sign of the up-regulation/down-regulation column—only if
it is non-zero—and multiplies it by the Fisher Exact Test column
value. If the up-regulation/down-regulation column is 0, then the
value is automatically set to 0. If the value is greater or equal to
−0.0001 and less than 0, then the value is automatically set to
−0.0001. This is done so that colors can be plotted correctly, since
colors in ranges very close to zero in the negative region are
considered 0 by some visualization programs. In parentheses is listed
the fold change range that was entered when the program was run.
This column lists the pathway name. The KEGG map number, which
can be used in Expression Mapper, is listed in parentheses.
10. The extent of the alteration of the genes that show major changes, and their
position in the pathways identified as of greatest interest with Pathway Analyzer,
can be now visualized. It is sufficient to annotate the number of the pathway of
interest, as reported in parentheses next to the KEGG map number in the pathway
column (Table 7.6.2), and proceed to the analysis with Expression Mapper (see
Support Protocol).
DETAILED ANALYSIS WITH EXPRESSION MAPPER
Expression Mapper allows a detailed examination of the relationships among genes
in the pathways of interest. This program features a unique graphical output, displaying differences in expression on metabolic charts of the biochemical pathways to
which the ORFs are assigned. The gene names are visualized on the metabolic chart
together with the fold change, next to the biological step to which the gene has been
associated.
SUPPORT
PROTOCOL
The letters are colored in red if the gene is up-regulated and green if the gene is
down-regulated; the color intensity is proportional to the extent of the change in
expression.
Single pathways of interest can be then studied in detail using Expression Mapper.
Analyzing
Expression
Patterns
7.6.7
Current Protocols in Bioinformatics
Supplement 5
Figure 7.6.5 Screenshot of the Expression Mapper application’s Map Manipulation Area window
using the new, checked data.txt file. The figure reports the Glycolysis/Gluconeogenesis pathway
(KEGG map 10). The text is colored in red if the relative change in gene expression is ≥1 or green
if it is ≥1. The intensity of the color is proportional to the magnitude of the differential expression.
The presence of a gray box indicates that the corresponding step in the biochemical pathway
requires multiple gene products. This black and white facsimile of the figure is intended only
as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c_p/
colorfigures.htm.
Necessary Resources
Hardware
PC running Microsoft Windows. The authors have found that a 700 MHz Pentium
PC with 512 Mb of RAM performs very well.
Analyzing
Expression
Results with
Pathway
Processor
Software
Pathway Processor is written completely in Sun Microsystems Java. It is freely
available on the Web page of the Bauer Center for Genomics Research
(http://www.cgr.harvard.edu/cavalieri/pp.html), or by contacting either Paul
Grosu (paul_grosu@harvard.edu) or Duccio Cavalieri (dcavalieri@
cgr.harvard.edu). The program can be downloaded from the Web together with
the detailed User’s Instruction Manual.
continued
7.6.8
Supplement 5
Current Protocols in Bioinformatics
Figure 7.6.6 The same screenshot as Figure 7.6.5, with the exception that the user has
dragged out the per-gene fold-changes. This black and white facsimile of the figure is intended
only as a placeholder; for full-color version of figure go to http://www.interscience.wiley.com/c_p/
colorfigures.htm.
Files
data.txt file (see Basic Protocol)
1. After completing the Basic Protocol, place the data.txt file in the data subdirectory of the expression_mapper folder.
2. Go to the expression_mapper folder and double-click on the run.bat file.
3. Enter the KEGG map number of interest in the dialog box that appears, then click on
the Process Request button.
Be sure to type in a map that exists. If unsure, check the Pathway column from the
pathway_summary_file.txt file (Table 7.6.2) for the KEGG map number of
greatest interest.
4. A window will come up that will look similar to Figure 7.6.5 (this is for KEGG map
number 10).
Analyzing
Expression
Patterns
7.6.9
Current Protocols in Bioinformatics
Supplement 5
Figure 7.6.7 JPEG output file that is saved from Figure 7.6.6 when one closes the Map Manipulation Area window. This
black and white facsimile of the figure is intended only as a placeholder; for full-color version of figure go to http://www.
interscience.wiley.com/c_p/colorfigures.htm.
7.6.10
Supplement 5
Current Protocols in Bioinformatics
Figure 7.6.8 This is a portion of the Map Manipulation Area window using the B. subtilis version
of the program. This black and white facsimile of the figure is intended only as a placeholder; for
full-color version of figure go to http://www.interscience.wiley.com/c_p/colorfigures.htm.
The user will notice that the fold changes of some of the ORFs are listed. These are locations
where only one ORF is present. The ORFs are colored with shades of red if they are
up-regulated and shades of green if they are down-regulated. The lighter the shade, the
more up-regulated or down-regulated the gene is. The darker the shade, the less up-regulated or down-regulated the gene is. The ratios are transformed into fold changes and
written on the graph.
The user will notice that some locations contain gray boxes. These are locations where
more than on gene is located. In order to view these, one will need to click on the gray box
and drag it out (Fig. 7.6.6).
5. Finally, once satisfied with the way the pathway layout looks, one can save
the image by closing the window. By closing the window, an output file of the
corresponding map number will be created in the results folder under the
expression_mapper directory. The file name will be created from the
template[mapnumber] output.jpg. For instance, if KEGG map number 10
is used, the output file will be 10 output.jpg. The output of KEGG map 10 is
shown in Figure 7.6.7.
For B. subtilis, there will be boxes which are green; these boxes are prerendered green by
KEGG to indicate that they contain B. subtilis genes. When they are seen on screen, it means
that the data do not contain those genes since they are left green and not overwritten with
either a gray box or with a specific B. subtilis gene and its associated fold change. Figure
7.6.8 shows an example. In that figure, EC number 6.3.2.4 is left green and not overwritten.
Remember that the green in 6.3.2.4 does not necessarily mean that the gene is down-regulated.
COMPARATIVE VISUALIZATION OF PATHWAY ANALYSIS FROM
MULTIPLE EXPERIMENTS
ALTERNATE
PROTOCOL
It is possible to use Pathway Analyzer to perform pathway analysis across multiple
experiments. To do this, first run the Pathway Analyzer program on each experiment of
interest with the same cutoffs on each of them (see Basic Protocol). Next, take the Signed
Fisher Exact Test column of each experiment and place them into one Excel spreadsheet.
Everything can then be sorted by the most interesting experiment, such that the most
up-regulated pathways are at the top and most down-regulated pathways are at the bottom.
From this, one can make a contour plot.
Figure 7.6.9 shows an example of such a plot, performed on data from the paper on
time-course expression during yeast diauxic shift (DeRisi et al., 1997), showing only the
top 10 most up-regulated and down-regulated Signed Fisher Exact Test pathways. The
data were downloaded from the The Pat Brown Laboratory Web Site http://cmgm.
stanford.edu/pbrown/explore/array.txt.
Analyzing
Expression
Patterns
7.6.11
Current Protocols in Bioinformatics
Supplement 5
Figure 7.6.9 Surface graph obtained using Microsoft Excel to plot all the Signed Fisher Exact Test
column values of the different pathway_summary_file.txt files.
It is possible to preserve the sign and subtract the absolute value from 1, and then plot the
line plots in Microsoft Excel and get the result shown in Figure 7.6.10 for the top 10
up-regulated and down-regulated pathways. Figure 7.6.11 shows the figure modified from
the paper (DeRisi et al., 1997) itself. There is a very good correlation between the two
figures, indicating that Pathway Processor has automatically identified the more relevant
features of the process.
According to the researcher’s preferences, the results can be visualized with different
visualization programs. The starting point is always an Excel file containing the values
from the Signed Fisher Exact Test column from different experiments
Analyzing
Expression
Results with
Pathway
Processor
To generate similar heatmaps and view them with the Eisen’s clustering programs (Eisen
et al., 1998), one would go through the following steps:
1. Take the sorted file and convert every value which is greater than −0.01 and less than
0.01 to 0.01 with the appropriate sign.
7.6.12
Supplement 5
Current Protocols in Bioinformatics
Figure 7.6.10 Time course of the experiment described in DeRisi et al. (1997). The figure reports
a XY (Scatter) graph using Microsoft Excel to plot all the Signed Fisher Exact Test column values
of the different pathway_summary_file.txt files. The p values have been adjusted to plot with
large numbers at low p values and vice versa to show that one can get the same result as the
original DeRisi figure. This black and white facsimile of the figure is intended only as a placeholder;
for full-color version of figure go to http://www.interscience.wiley.com/c_p/colorfigures.htm.
Figure 7.6.11 Time course of the experiment described in DeRisi et al. (1997). The hours on the
horizontal bar indicate the time, during diauxic shift, at which the mRNA has been extracted. The
experiment compares differential expression at the indicated time respect to a common reference.
This is a redrawing of the original figure appearing in De Risi et al. (1997). This black and white
facsimile of the figure is intended only as a placeholder; for full-color version of figure go to
http://www.interscience.wiley.com/c_p/colorfigures.htm.
2. Take the reciprocal of every value.
3. Rename the Pathway column to Name. The file would look similar in format to Table 7.6.4.
4. Save the file as a tab-delimited text file.
5. Open Mike Eisen’s TreeView program.
6. Go to the File menu and select Load.
7. Select the type of file to be text (*.TXT) and select the file. The result will look
similar to Figure 7.6.12.
Analyzing
Expression
Patterns
7.6.13
Current Protocols in Bioinformatics
Supplement 5
Table 7.6.4 Detail of the Visualization of the Results of the Comparison of the Seven Experiments in the Time Course
of the De Risi et al. (1997) Experimenta
Name
9 hr
11 hr
13 hr
15 hr
Ribosome map 3010
Purine metabolism map 230
1
1
1
1
1
100
Pyrimidine metabolism map 240
RNA polymerase map 3020
Aminoacyl-tRNA biosynthesis
map 970
Methionine metabolism map 271
Selenoamino acid metabolism
map 450
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
17 hr
19 hr
21 hr
1
−1.005134
−1.220797 10.70821
−100
−31.96548
−100
−100
1
1
1
−1.881824
1
1
−5.985468
−8.275396
−100
−100
−100
−100
1
1
−100
−100
−29.94031
−43.05598
−100
−75.23784
aHours in the top row indicate the time, during diauxic shift, at which the mRNA has been extracted; the experiment compares differential expression
at the indicated time respect to a common reference.
Figure 7.6.12 A screenshot of Mike Eisen’s TreeView program using the reciprocally adjusted
Signed Fisher Exact Test values to show how one can quickly visualize the results of multi-experiment pathway analysis.
Data from different experiments analyzed using Pathway Analyzer can be visualized with
the open-source visualization software OpenDX (http://www.opendx.org). This visualization program allows an elegant and detailed examination of the expression levels
observed in the experiment, according to pathways.
Analyzing
Expression
Results with
Pathway
Processor
The advantage of OpenDX is that it visualizes data in three dimensions. An example is
shown in Figure 7.6.13. The input of the program consists of three files: one with the
pathway names, another with the Signed Fisher Exact Test, and a third with the header
row. The program represents each value graphically as a cube. The color of the cube
indicates the extent of the variation, based on the magnitude of the p values and the sign,
with red being up-regulated, green down-regulated, and yellow no change. The correspon-
7.6.14
Supplement 5
Current Protocols in Bioinformatics
Figure 7.6.13 Picture representing the down-regulated pathways in the diauxic shift (DeRisi et
al., 1997), with the Fisher Exact Test Results visualized using OpenDX. The values of the Signed
Fisher Exact Test of the 21-hr data set have been sorted according to the value of the Fisher Exact
Test; the results of the other data sets for the affected pathways are also reported. The color of the
cube indicates the extent of the variation, according to the p values, with red being up-regulated,
green down-regulated, and yellow unchanged. The opacity visually represents the statistical
significance of the variation, the greater the opacity, the greater the significance of p value. The
color of the cube depends on the p value in the following way: from 1 to 0.15 the color remains
yellow, from 0.15 to 0 with overexpression (+) it goes from yellow to red, and from 0.15 to 0 with
under-expression (−) it goes from yellow to green. This black and white facsimile of the figure is
intended only as a placeholder; for full-color version of figure go to http://www.interscience.
wiley.com/c_p/colorfigures.htm.
dence between the color of the cube and the p value can be modulated according to the
user’s preferences; the authors suggest that the visualization be tuned in the following
way: from 1 to 0.15, the color remains yellow; from 0.15 to 0 with over-expression (+),
it goes from yellow to red; from 0.15 to 0 with under-expression (−), it goes from yellow
to green. To allow the eye to focus on the most significant results, it is also suggested that
the opacity be changed so that the greater the significance of the variation, the greater the
opacity (Fig. 7.6.13). The use of the program is not intuitive, and its application to
visualization of microarray data classified using Pathway Processor needs some fine
tuning. A detailed description of OpenDX itself is beyond the scope of this manual; a
detailed description of the program and manuals on how to use OpenDx can be found at
http://www.opendx.org/support.html and http://www.opendx.org/index2.php. A book
with a more extensive tutorial can be found at http://www.vizsolutions.com/paths.html.
GUIDELINES FOR UNDERSTANDING RESULTS
Two tab-delimited text files are generated from the comparison files in Pathway Analyzer.
One, called, gene_expression_pathway_summary_file.txt (Table 7.6.1),
contains all the genes that pass the cutoff, organized by pathway, and can be used to
retrieve lists of the genes with their fold changes, subdivided according to the KEGG
Pathway organization. The other, comb_pathway_summary_file.txt (Table
Analyzing
Expression
Patterns
7.6.15
Current Protocols in Bioinformatics
Supplement 5
7.6.2), contains the summary of the statistics for each pathway, which can be imported
into Microsoft Excel to enable the user to sort the results into various columns, to
determine the effect on the different pathways, and to be used as input in different
visualization tools.
The Signed Fisher Exact Test column of comb_pathway_summary_file.txt
(Table 7.6.2) allows the sorting of up-regulated or down-regulated pathways. The value
in this column is composed of two distinct parts. The first part carries the signs + or −,
indicating whether the particular pathway contains genes that are up- or down-regulated.
The second part of each entry is a positive real number (between 0 and 1), corresponding
to the p value of the Fisher Exact Test for the pathway. The sign is calculated by subtracting
the mean relative expression of all genes that pass the cutoff and are in the pathway from
the mean relative expression of the genes that pass the cutoff and are not within the
pathway (up-regulation/down-regulation column, Table 7.6.2). If there are no genes above
the cutoff in a pathway, the sign is arbitrarily set to +. This step is done only for
convenience, as the p values for such pathways will always be non-significant.
Sorting for the Signed Fisher Exact Test is done so that the most significant values are at
the top of the column (Table 7.6.2) for the up-regulated pathways and at the bottom for
the down-regulated pathways. In the middle are the least significant pathways. The values
of the Fisher Exact Test vector can be used to compare different experiments using
Microsoft Excel (Table 7.6.2), and the comparison among the different experiments can
be represented graphically. Programs for the graphical representation can vary from Excel
to more sophisticated ones; an interesting graphical software is OpenDX
(http://www.opendx.org), an open-source visualization software package (Fig. 7.6.13).
The resulting set of p values for all pathways is finally used to rank the pathways according
to the magnitude and direction of the effects. The Pathway Processor results from multiple
experiments can be compared, reducing the analysis from the full set of individual genes
to a limited number of pathways of interest. The probability that a given pathway is
affected is necessary to weigh the relative contribution of the biological process at work
to the phenotype studied.
COMMENTARY
Background Information
Analyzing
Expression
Results with
Pathway
Processor
DNA microarrays provide a powerful technology for genomics research. The multistep,
data-intensive nature of this approach has created unprecedented challenges for the development of proper statistics and new bioinformatic
tools. It is of the greatest importance to integrate
information on the genomic scale with the biological information accumulated through years
of research on the molecular genetics, biochemistry, and physiology of the organisms that
researchers investigate.
A genomic approach for the understanding
of fundamental biological processes enables
the simultaneous study of expression patterns
of all genes for branch-point enzymes. Similarly, one can look for patterns of expression
variation in particular classes of genes, such as
those involved in metabolism, cytoskeleton,
cell-division control, apoptosis, membrane
transport, sexual reproduction, and so forth.
Interpreting such a huge amount of data requires a deep knowledge of metabolism and
cellular signaling pathways.
The availability of properly annotated pathway databases is one of the requirements for
analyzing microarray data in the context of
biological pathways. Efforts to establish proper
gene ontology (UNIT 7.2; Ashburner et al., 2000)
and pathway databases are continuing, and several resources are available publicly and commercially.
Efforts have also been made to integrate
functional genomic information into databases,
such as ArrayDB (Ermolaeva et al., 1998), SGD
(Ball et al., 2000; Ball et al., 2001), YPD, Worm
PD, PombePD and callPd (Costanzo et al.,
2000), and KEGG (Nakao et al., 1999). The
ability to display information on pathway maps
is also extremely important. The Kyoto Ency-
7.6.16
Supplement 5
Current Protocols in Bioinformatics
clopedia of Genes and Genomes (KEGG;
Kanehisa et al., 2002), the Alliance for Cellular
Signalling, BioCarta, EcoCyc (Karp et al.,
2002a), MetaCyC (Karp et al., 2002b),
PathDB, and MIPS all organize existing metabolic information in easily accessible pathway
maps. Pathway databases will become more
useful as a unique and detailed annotation for
all the genes in the sequenced genomes becomes available. In this respect, the situation
for yeast contrasts with that for human, mouse,
and rat, for which the systematic and detailed
annotation and description of open reading
frame (ORF) function is still in progress.
The visualization of expression data on cellular process charts is also important. Many
authors have manually mapped transcriptional
changes to metabolic charts, and others have
developed automatic methods to assign genes
showing expression variation to functional
categories, focusing on single pathways.
KEGG, MetaCyC, and EcoCyC display expression data from some experiments on their
maps. Some commercial microarray analysis
packages, such as Rosetta Resolver (Rosetta
Biosoftware) have also integrated a feature enabling the display of expression of a given gene
in the context of a metabolic map.
MAPPFinder and GenMAPP (http://www.
genmapp.org/; UNIT 7.5), are recently developed
tools allowing the display of expression results
on metabolic or cellular charts (Doniger et al.,
2003). The program represents the pathways in
a special file format called MAPPs, independent of the gene expression data, and enables the grouping of genes by any organizing
principle. The novel feature is that it both allows
visualization of networks according to the
structure reported in the current pathway databases and also provides the ability to modify
pathways ad hoc; it also makes it possible to
design custom maps and exchange pathway-related data among investigators. The map contains hyperlinks to the expression information
and to the information on every gene available
on public databases. Still, the program is limited because the information provided on the
map is not quantitative but only qualitative and
based on color coding, with a repressed gene
or an activated gene reported, respectively, as
green or red. This does not automatically indicate the pathways of greatest interest. Microarray papers tend to discuss the fact that a given
pathway is activated or repressed, based on the
number of genes activated or repressed, or just
on intuition, and too often the researcher finds
only what she or he expected, or already knew.
A recent paper from Castillo-Davis and
Hartl (2003) describes a method to assess the
significance of alteration in expression of diverse cellular pathways, taken from GO, MIPS,
or other sources. The program is extremely
useful but does not include a visualization tool
enabling one to map the results on pathway
charts, and does not address the problem arising
from applying hypergeometric distributions to
the analysis of highly interconnected pathways
with a high level of redundancy as defined
according to the GO terms.
Ideally, methods analyzing expression data
according to a pathway-based logic should give
an indication of the statistical significance of
the conclusions, provide a user-friendly interface, and be able to encompass the largest
number of possible interconnections between
genes, although the ability to separate larger
pathways into smaller independent subpathways is important when developing methods
assessing the analysis of the statistical significance of up- or down-regulation of a pathway.
The Pathway Analyzer algorithm
Pathway Analyzer implements a statistical
method in Java, automatically identifying
which metabolic pathways are most affected by
differences in gene expression observed in a
particular experiment. The method associates
an ORF with a given biochemical step according to the information contained in 92 pathway
files from KEGG (http://www.genome.ad.jp/
kegg/). Pathway Analyzer scores KEGG biochemical pathways, measuring the probability
that the genes of a pathway are significantly
altered in a given experiment. KEGG has been
chosen for the concise and clear way in which
the genes are interconnected, and for its curators’ great effort in keeping the information up
to date.
In deriving scores for pathways, Pathway
Analyzer takes into account the following factors: (1) the number of ORFs whose expression
is altered in each pathway; (2) the total number
of ORFs contained in the pathway; and (3) the
proportion of the ORFs in the genome contained in a given pathway.
In the first step of the analysis, the user
specifies the magnitude of the difference in
ORF expression that should be regarded as
above background. The relative change in gene
expression is the multiplier by which the level
of expression of a particular ORF is increased
or decreased in an experiment. Hence, Pathway
Analyzer allows the study of differences
smaller than a given fold change, but which
Analyzing
Expression
Patterns
7.6.17
Current Protocols in Bioinformatics
Supplement 5
affect a statistically significant number of ORFs
in a particular metabolic pathway. Consistent
differential expression of a number of ORFs in
the same pathway can have important biological implications; for example, it may indicate
the existence of a set of coordinately regulated
ORFs. The program uses the Fisher Exact Test
to calculate the probability that a difference in
ORF expression in each of the 92 pathways
could be due to chance. A statistically significant probability means that a particular pathway contains more affected ORFs than would
be expected by chance. The program allows the
user to choose different values of the Fisher
Exact Test.
Fisher Exact test
The analysis performed with the Fisher Exact Test provides a quick and user-friendly way
of determining which pathways are the most
affected. The one-sided Fisher Exact Test calculates a p value, based on the number of genes
whose expression exceeds a user-specified cutoff in a given pathway. This p value is the
probability that by chance the pathway would
contain as many affected genes as or more
affected genes than actually observed, the null
hypothesis being that the relative changes in
gene expression of the genes in the pathway are
a random subset of those observed in the experiment as a whole.
Literature Cited
Ashburner, M., Ball, C.A., Blake, J.A., Botstein,
D., Butler, H., Cherry, J.M., Davis, A.P.,
Dolinski, K., Dwight, S.S., Eppig, J.T., Harris,
M.A., Hill, D.P., Issel-Tarver, L., Kasarskis,
A., Lewis, S., Matese, J.C., Richardson, J.E.,
Ringwald, M., Rubin, G.M., and Sherlock, G.
2000. The Gene Ontology Consortium. Nat.
Genet. 25:25-29.
Ball, C.A., Dolinski, K., Dwight, S.S., Harris,
M.A., Issel-Tarver, L., Kasarskis, A., Scafe,
C.R., Sherlock, G., Binkley, G., Jin, H.,
Kaloper, M., Orr, S.D., Schroeder, M., Weng,
S., Zhu, Y., Botstein, D., and Cherry, J.M.
2000. Integrating functional genomic information into the Saccharomyces genome database. Nucleic Acids Res. 28:77-80.
Analyzing
Expression
Results with
Pathway
Processor
Ball, C.A., Jin, H., Sherlock, G., Weng, S.,
Matese, J.C., Andrada, R. , Binkley, G., Dolinski, K., Dwight, S.S., Harris, M.A., IsselTarver, L., Schroeder, M., Botstein, D., and
Cherry, J.M. 2001. Saccharomyces genome
database provides tools to survey gene expression and functional analysis data. Nucleic Acids Res. 29:80-81.
Castillo-Davis, C.I. and Hartl, D.L. 2003. GeneMerge-post-genomic analysis, data mining,
a n d h y p o t h e s i s t e s t i n g . Bioinformatics
19:891-892.
Costanzo, M.C., Hogan, J.D., Cusick, M.E.,
Davis, B.P., Fancher, A.M., Hodges, P.E.,
Kondu, P., Lengieza, C., Lew-Smith, J.E.,
Lingner, C., Roberg-Perez, K.J., Tillberg, M.,
Brooks, J.E., and Garrels, J.I. 2000. The yeast
proteome database (YPD) and Caenorhabditis
elegans proteome database (WormPD): Comprehensive resources for the organization and
comparison of model organism protein information. Nucleic Acids Res. 28:73-76.
DeRisi, J.L., Iyer,V.R., and Brown, P.O. 1997.
Exploring the metabolic and genetic control of
gene expression on a genomic scale. Science
278:680–686.
Doniger, S.W., Salomonis, N., Dahlquist, K.D.,
Vranizan, K., Lawlor, S.C., and Conklin, B.R.
2003. MAPPFinder: Using gene ontology and
GenMAPP to create a global gene-expression
profile from microarray data. Genome Biol.
4:R7.
Eisen, M.B., Spellman, P.T., Brown, P.O., and
Botstein, D. 1998. Cluster analysis and display
of genome-wide expression patterns. Proc.
Natl. Acad. Sci. U.S.A. 95:14863-14868.
Ermolaeva, O., Rastogi, M., Pruitt, K.D., Schuler,
G.D., Bittner, M.L., Chen, Y., Simon, R.,
Meltzer, P., Trent, J.M., and Boguski, M.S.
1998. Data management and analysis for gene
expression arrays. Nat. Genet. 20:19-23.
Grosu, P., Townsend, J.P., Hartl, D.L., and Cavalieri, D. 2002. Pathway processor: A tool for
integrating whole-genome expression results
into metabo lic networks. Genome Res.
12:1121-1126.
Kanehisa, M., Goto, S., Kawashima, S., and
Nakaya, A. 2002. The KEGG databases at
GenomeNet. Nucleic Acids Res. 30:42-46.
Karp, P.D., Riley, M., Saier, M., Paulsen, I.T.,
Collado-Vides, J., Paley, S.M., PellegriniToole, A., Bonavides, C., and Gama-Castro, S.
2002a. The EcoCyC database. Nucleic Acids
Res. 30:56-58.
Karp, P.D., Riley, M., Paley, S.M., and PellegriniToole, A. 2002b. The MetaCyC database. Nucleic Acids Res. 30:59-61.
Nakao, M, Bono, H., Kawashima, S., Kamiya, T.,
Sato, K., Goto, S., and Kanehisa, M. 1999.
Genome-scale gene expression analysis and
pathway reconstruction in KEGG. Genome Inform. Ser. Workshop Genome Inform. 10:94103.
Internet Resources
http://www.cgr.harvard.edu/cavalieri/pp.html
Duccio Cavalieri CGR Web site.
http://www.genome.ad.jp/kegg/
The Kyoto Encyclopedia of Genes and Genomes
(KEGG) home page.
7.6.18
Supplement 5
Current Protocols in Bioinformatics
http://www.proteome.com/databases/YPD/
YPDsearch-quick.html
http://www.ncgr.org/genex/
The yeast proteome database (YPD) home page.
Gene X, Gene expression home page at the National
Center for Genome Resources.
http://www.opendx.org
http://cmgm.stanford.edu/pbrown/
http://www.opendx.org/index2.php
The Pat Brown Laboratory Web site.
The open-source visualization software, OpenDX
along with manuals and tutorials.
http://genome-www.stanford.edu/
Saccharomyces/
The Saccharomyces Genome database (SGD) home
page.
Contributed by Duccio Cavalieri and
Paul Grosu
Bauer Center for Genomics Research
Harvard University
Cambridge, Massachusetts
Analyzing
Expression
Patterns
7.6.19
Current Protocols in Bioinformatics
Supplement 5
An Overview of Spotfire for
Gene-Expression Studies
Spotfire DecisionSite (Spotfire, Inc.;
http://www.spotfire.com) is a powerful data
mining and visualization application with use
in many disciplines. Modules are available for
use in various areas of research, e.g., support
of gene expression analysis, proteomics,
general statistical analysis, chemical lead
discovery analysis, and geology. Here the
focus is on Spotfire’s utility in analyzing
gene expression data obtained from DNA
microarray experiments (Spotfire Decision
Site for Functional Genomics).
Since its advent in the middle of last decade
(Schena et al., 1995; Kozal et al., 1996), DNA
microarray technology has revolutionized the
way gene expression is measured (Schena
et al., 1998). The ability to quantitatively measure the levels of expression of thousands of
genes in a single experiment is allowing investigators to make significant advances in the
fields of stress response, transcriptional analysis, disease detection and treatment, gene therapy, and many others (Iyer et al., 1999; Lee
et al., 2002; Yeoh et al., 2002; Cheok et al.,
2003). Analysis of microarray data can be
overwhelming and confusing, due to the enormous amount of data generated (Leung and
Cavalieri, 2003). In addition, because of technical considerations or genetic or biological
variability within the subjects, these measurements can be noisy and error prone (Smyth
et al., 2003). To obtain statistically meaningful results, one needs to replicate the experiments several times, adding to the dimensionality and scope of the problem (Kerr and
Churchill, 2001). With rapidly emerging microarray analysis methods, scientists are faced
with the challenge of incorporating new information into their analyses quickly and easily. Mining data of this magnitude requires
software-based solutions able to handle and
manipulate such data.
DecisionSite for Functional Genomics
(henceforth referred to as Spotfire) is a solution for accessing, analyzing and visualizing data. This platform combines state-ofthe-art data access, gene expression analysis
methods, guided workflows, dynamic visualizations, and extensive computational tools. It
is a client-server based system that can be run
UNIT 7.7
from a Unix-based server and can service either a single client using a single license or a
large group of users with an institution-wide
site license. The authors of this unit currently
use version 7.2 of this application. Spotfire is
designed to allow biologists with little or no
programming or statistical skills to transform
(UNIT 7.8), process, and analyze (UNIT 7.9) microarray data. Tools and Guides are the components of Spotfire platform. Tools are the analytical access components added to the interface,
while Guides connect Tools together to initiate
suggested analysis paths.
Spotfire can extract data from a variety of
databases and data sources such as Oracle,
SQL server, Informix, Sybase, and networked
or even local drives. This ability to retrieve
data from multiple databases allows the application to function as a “virtual data warehouse”
and offers tremendous potential in integrating
data from various sources for data-mining purposes. Once the data have been extracted, Spotfire can be used to interactively query and filter
it using multiple overlapping criteria without
knowledge of a query language such as SQL.
Results of data manipulation can be instantly
visualized using a number of graphics options,
including two- and three-dimensional scatter
plots, bar charts, profiles, heat maps, Venn diagrams and others. Spotfire stores data internally in a proprietary file format allowing for
quick response times, and has a series of builtin heuristics and algorithms to perform most
of the basic tasks that a user with microarray
data would like to perform. These include importing data from two-color microarray experiments (here the focus is on data produced using the popular GenePix scanners from Axon,
Inc.; UNIT 7.8); importing data from Affymetrix
one-color microarray experiments (UNIT 7.8);
filtering and preprocessing to remove unreliable data (UNIT 7.8); log-transformation of array data (UNIT 7.8); scaling and normalization
of array data (UNIT 7.8); identification of differentially expressed genes using statistical tests
such as the t test/ANOVA algorithm; calculation of fold change as ratios of given signal values; and other calculations (UNIT 7.9). A
variety of clustering algorithms are available
in the package—e.g., K-means Clustering,
Analyzing
Expression
Analysis
Contributed by Deepak Kaushal and Clayton W. Naeve
Current Protocols in Bioinformatics (2004) 7.7.1-7.7.21
C 2004 by John Wiley & Sons, Inc.
Copyright 7.7.1
Supplement 6
Self-Organizing Maps (SOMs), Principal
Components Analysis like this (PCA), and Hierarchical Clustering (UNIT 7.9). The ability to
join expression data with gene annotations and
Gene Ontology (UNIT 7.2) descriptions using
the Web browser functionality within the DecisionSite navigator is also possible (UNIT 7.9).
Finally, an application-programming interface
is available for Spotfire that allows one to customize the application. Spotfire DecisionSite
Developer offers the ability to add new microarray data sources, normalization methods,
and algorithms within a guided workflow that
can be rapidly deployed to all users. The platform can be updated and extended to incorporate both the expertise gained by the users
as well as the tremendous advances occurring
in the field of genomics on a daily basis. The
authors have been able to integrate S-Plus algorithms into Spotfire to enhance its capabilities.
This interface makes it possible for end-users
to write code for their own Guides or Tools
and incorporate them into the DecisionSite environment for everyday use.
NECESSARY REQUIREMENTS
FOR USING THE FUNCTIONAL
GENOMICS MODULE OF
SPOTFIRE
Hardware
The recommended minimal hardware requirements are modest.
For Windows systems: The software will
run on an Intel Pentium or equivalent with a
100 MHz processor, 64 Mb RAM, 20 Mb disk
space; VGA or better display, and a video display of 800 × 6000 pixels. However, most microarray experiments yield large output files
and most experimental designs require several
data files to be analyzed simultaneously, so the
user will benefit from both a much higher RAM
and a significantly better processor speed.
For Macintosh: Requirements include
Macintosh PowerPC with 8 Mb of available
memory, 3Mb of free disk space, and 256 color
(or better) video display. A network interface
card (NIC) is required for network connections
to MetaFrame servers.
MDAC (Microsoft Data Access Components) versions 2.1 sp2 (2.1.2.4202.3) to
2.5 (2.50.4403.12).
A Web connection to the Spotfire server
(http://home.spotfire.net) or a local customerspecific Spotfire Server. A Web connection is
also required to take advantage of Web links
for the purpose of querying databases and Web
sites on the Internet using columns of data residing in Spotfire (UNIT 7.9).
Microsoft PowerPoint, Word and Excel are
required to take advantage of a number of features available within Spotfire related to export
of text results or visualizations (UNIT 7.9).
Spotfire (v. 6.2 or above) is required. An
evaluation copy of Spotfire DecisionSite for
Functional Genomics can be downloaded by
contacting the regional account manager at
Spotfire Inc. (http://www.spotfire.com). A single license can be purchased and is installed
through the Web (client recieves a link where
the software can be downloaded using a
vendor-supplied key). A site license is required
for multiple users within an institution to be
able to use the software and these users share
a local customer-specific Spotfire server. The
site license is typically installed on site by the
Spotfire support team. The vendor offers an
academic pricing which covers a one-year subscription for use of the software along with
Spotfire technical support. It is also possible
to download a software program called System
Checker from the same vendor that checks the
system on which Spotfire will be installed to
ensure compliance. Spotfire must be installed
under a Windows user account with full Administrator privileges. Antivirus software must
be disabled during installation, as well as while
downloading server-delivered Functional Genomics applications. In order to properly operate, Spotfire requires Microsoft Data Access
Components (MDAC) version 2.1 SP1 through
2.7 SP1.
For Macintosh systems: MacOS 7.5.3 or
later (including OS X) is required. In order
to operate Spotfire on Apple’s Macintosh system, users need to install and configure Citrix
ICA client (http://www.citrix.com) on the system. Open Transport TCP/IP Version 1.1.1 or
later is required.
Software
An Overview of
Spotfire for
Gene-Expression
Studies
For Windows systems: Windows 98 or
higher, Windows NT with service pack 4.0
or higher, Windows Millennium or Windows
2000.
A standard install of Microsoft Internet Explorer v. 5.0 to 6.0.
Files
Spotfire (Functional Genomics module)
can import data in nearly any format, but the
focus here is on two popular microarray platforms, the commercial GeneChip microarray
data (Affymetrix, Inc.) and two-color spotted
7.7.2
Supplement 6
Current Protocols in Bioinformatics
microarray data produced using GenePix software (Axon, Inc.). Spotfire facilitates the
seamless import of Affymetrix output files
(.met) from Affymetrix MAS v. 4.0 or v. 5.0
software. The .met file is a tab-delimited text
file containing information about attributes
such as probe set level, gene expression levels (signal), and detection quality controls
(p value and Absence/Presence calls). In the
illustration below, MAS 5.0 .met files are
used as an example. Several types of spotted arrays and corresponding data types exist, including those from commercial vendors
(Agilent, Motorola, and Mergen) that supply
spotted microarrays for various organisms, and
those from facilities that manufacture their
own chips. Several different scanners and scanning software packages are available. One of
the more commonly used scanners is the Axon
GenePix (Axon, Inc.). GenePix data files are
a tab-delimited text format (.gpr), which can
be directly imported into a Spotfire session.
OVERVIEW OF SPOTFIRE
VISUALIZATION WINDOW
Upon launch, the Spotfire DecisionSite application window appears as illustrated below (Fig. 7.7.1). One can choose which
DecisionSite module to open by clicking on
the orange Navigate button. Upon selection of
the Functional Genomics module, a suite of
applications geared to microarray data analysis is made available including database
Figure 7.7.1
access, data preprocessing, statistical analysis,
and domain-specific tasks.
The Spotfire application window contains
four functionally different areas (Fig. 7.7.1):
DecisionSite Navigator on the left side,
which is a Web browser;
A visualization window in the center containing one or more visualizations of the data
(the user needs to import data (UNIT 7.8) into a
Spotfire session for visualizations to appear);
Query devices on the top right, which allow
manipulation of the data (UNIT 7.8);
“Details-on-Demand” on the bottom right,
which allows cross-examination of selected
data (UNIT 7.8).
DecisionSite NAVIGATOR
DecisionSite Navigator is a Web browser
that is fully integrated into the Spotfire DecisionSite environment. The navigator is used
to connect to the Spotfire DecisionSite server,
providing access to several powerful tools,
guides and resources (Fig. 7.7.2).
The DecisionSite Navigator is opened by
default when the user launches Spotfire. The
user can open and close the Navigator by clicking the “bulls eye” button on the far left of
the toolbar. The Navigation toolbar is used to
navigate the content of the DecisionSite Navigator, and has the same basic features of any
popular Web browser such as Back, Forward,
Stop, Refresh, and Home buttons. The Navigator toolbar can be hidden or revealed by selecting Navigation toolbar from the View menu.
Various components of Spotfire DecisionSite.
Analyzing
Expression
Analysis
7.7.3
Current Protocols in Bioinformatics
Supplement 6
Figure 7.7.2
An Overview of
Spotfire for
Gene-Expression
Studies
Different components of the DecisionSite Navigator.
The DecisionSite Navigator contains three
different Panes, or windows, each with a different collection of hyperlinks:
1. The Guides pane, which contains stepby-step directions to perform selected tasks or
workflows, most useful for beginners. Guides
are a good way to learn the function of tools
that are used repetitively and that are needed
to achieve a specific goal.
2. The Tools pane, which contains direct
links to tools and functions that directly affect manipulation of microarray data. These
include tools for importing, pre-processing,
normalizing, clustering, and statistically validating the data. For further discussion of these
tools see UNIT 7.8 and UNIT 7.9. These tools
are arranged in a hierarchical manner in the
Navigator, depending on the type of tool. The
header for each level of hierarchy is boldfaced. Clicking on any header expands the
tree and reveals the tools underneath. Clicking again will collapse the tree. The following
tools are available in Spotfire DecisionSite for
Functional Genomics:
Portfolio (allows users to save subsets of
data for overlap and Venn-logic comparison);
Access (allows users to import multiple
files of Affymetrix and GenePix data with
Tools such as Import Affymetrix data, Import
GenePix data, Import GEML file, Celera discovery system, Add columns, and Weblinks);
Analysis (contains tools for actual manipulation of array data such as Normalization, Row Summarization, Pivot and Transpose data, Hierarchical clustering, K-means
clustering, Principal Components Analysis,
Treatment comparison with ANOVA, profile
search, and coincidence testing);
Reporting (allows users to export data and
visualizations to external software such as
PowerPoint or Word).
3. Resources (contain links to various information sources such as the “Functional
Genomics companion,” which maintains a collection of articles and Webcasts about Spotfire;
the Resources section also contains a detailed
users manual).
VISUALIZATIONS
Visualizations are a key to the analytical
power of Spotfire (UNIT 7.9). Spotfire can display nine types of visualizations, each one providing a unique view of the data. Different
visualizations are linked and updated automatically when query devices are manipulated.
Visualizations allow high-dimension data to
be readily displayed and enhanced by manipulating values that control visual attributes
such as size, color, shape, rotation, and text
labels.
Some examples include:
2-D scatter plots
7.7.4
Supplement 6
Current Protocols in Bioinformatics
Figure 7.7.3
The Properties Dialog Box.
3-D scatter plots
Histograms
Bar charts
Profile charts
Line charts
Pie charts
Heat maps
Tables.
By setting properties such as color, shape,
and size, each visualization can be tailored to
personal taste and the specific task at hand.
New visualizations are created by clicking one
of the toolbar buttons, or by selecting a visualization from the Visualization menu. Visualization properties are controlled from the
Properties dialog (Fig. 7.7.3). Click the toolbar
button, or select Properties from the Edit menu
to go to this dialog.
Initial Visualization: 2-D Scatter Plot
Data are imported into Spotfire as a table
with rows and columns (Short/Wide format).
Each record (Spot or Probe set in microarray
data) is assigned a row in the table. Various
measurements made for every record are assigned columns in the table. When working
with a single data set, it is possible to look at
the expression behavior of a particular gene. It
is also possible to analyze several experiments
with same number of records simultaneously.
When a data set is loaded in Spotfire, the program produces an initial default visualization
in the form of a two-dimensional scatter plot.
The columns to be displayed as x and y axes
are initially suggested by the program and may
not be particularly helpful. Any data column
that has been imported into the Spotfire session can, however, be used to display a scatter plot. Users can specify columns to be used
in the scatter plot by selecting the appropriate columns from the X- and Y-axis column
selectors (Fig. 7.7.4).
New two-dimensional scatter plots can be
created by clicking the “2-D” button on the
toolbar, by clicking Ctrl-1, or by selecting
New 2D Scatter Plot from the Visualization
menu (Fig. 7.7.4). The user can zoom in and
out of a scatter plot in many ways, e.g., using the zoom bars or mouse. Dragging the end
arrows of the zoom bars (along the edges of
the visualization window) zooms in on a portion of the visualization. Dragging the bar itself
(by placing the mouse pointer on the yellow bar
and dragging) pans across different areas of the
entire visualization. The pale yellow area represents the selected range of values, whereas
the bright yellow area represents the range of
existing values within the selected range. The
zoom bar can be adjusted to encompass only
the currently selected data.
The user can set the Data Range to Selected
records by:
Moving the drag box of the zoom bar to
narrow the selection;
Right-clicking on the zoom bar; or
Choosing Select Data Range from Zooming
from the pop-up menu.
The zoom bar expands to its full width, but
with Data Range set to encompass only the
selected records. Three dots are displayed to
Analyzing
Expression
Analysis
7.7.5
Current Protocols in Bioinformatics
Supplement 6
Figure 7.7.4
Various features of the Scatter plot visualization in Spotfire.
indicate that the range is not the original full
range.
To reset the Data Range, right-click the
zoom bar and select Reset Data Range.
Coloring plots
An Overview of
Spotfire for
Gene-Expression
Studies
Markers can be colored to reflect the value
of a particular attribute. There are three modes
for coloring: Fixed, Continuous and Categorical (Fig. 7.7.3).
Fixed coloring means that all markers are
the same color (except deselected, empty and
marked).
Categorical coloring means that each value
in the chosen column is given its own color.
Categorical coloring makes most sense if there
are less than ten unique values. To control
which color is assigned to each value, click
Customize.
Continuous coloring means that the maximum and minimum values in the selected column are each assigned a color. Intermediate
values are then assigned colors on a scale ranging between the two extreme colors. In scatter
plots, any column can be used for continuous
coloring. Colors representing minimum and
maximum values are set with the Customize
dialog. Begin and End categories define the
color limits. When one of the categories is selected, one can choose which color will represent that end of the value range. A line with the
color scale is displayed below the corresponding query device.
By default, deselected records (i.e., records
that have been filtered out using the query de-
vices) are invisible. It is possible to keep them
visible, but colored differently. Check the box
labeled “Show deselected.” Set the color by
pressing Customize (Fig. 7.7.5).
Regardless of coloring mode, the choice of
colors can be controlled by clicking Customize
on the Markers tab of the Properties dialog.
Depending on the current coloring mode, the
top-most list will display the fixed color, Begin and End colors (continuous mode), or the
color associated to each category (Categorical mode). The other list displays colors associated with deselected, empty, and marked
records (“Empty” refers to records for which
no value is specified in the column used for coloring). To change a color, click the category to
modify, then click a color in the palette. To revert to default coloring, click Default Colors.
To select a color from the complete palette,
click “Other. . .”.
Marker size and other properties
The size of markers can be made to reflect
the value of a particular column. Select a column from the drop-down list under Size. Moving the slider changes the size of all markers,
while maintaining the size ratio of different
markers. It is possible to customize the size of
different markers.
The shape of markers can be fixed, or made
to reflect the value of a particular column. Click
Fixed or By to alternate between these modes.
Only columns with less than 16 distinct values
can be used for controlling shapes. Click Customize to chose appropriate shapes for each
value (Fig. 7.7.6).
7.7.6
Supplement 6
Current Protocols in Bioinformatics
Figure 7.7.5
The Customize Colors window for scatter plot (and other visualizations).
Figure 7.7.6
The Customize Shapes window for scatter plot.
It is possible to customize the size of the
markers in a scatter plot. This overrides the
usual size slider in the properties Marker tab.
To customize the marker size, select a value,
then select a shape for that value. Next, check
the “Specify size” check box. Enter Width and
Height. These values are relative to the scale
used in the current visualization. Look at the
scale used in the current visualization and determine how large the markers are to be.
It is also possible to specify the order in
which the markers of a scatter plot are drawn,
to use rotation of markers to reflect the value of
a column, and to tag each marker with a label
showing the value of a particular column, using
one of the following options:
“None”: no labels are visible;
“Marked records”: only records that are
marked will have labels next to them (maximum 1000 records);
Analyzing
Expression
Analysis
7.7.7
Current Protocols in Bioinformatics
Supplement 6
Figure 7.7.7 The 3D Scatter Plot visualization in Spotfire. This black and white facsimile
of the figure is intended only as a placeholder; for full-color version of figure go to
http://www.interscience.wiley.com/c p/colorfigures.htm.
“All records, max”: all records (up to a configurable maximum number) will have labels
next to them.
3-D Scatter Plots
3-D scatter plots allow even more information to be encoded into visualizations. They are
especially useful when analyzing data that is
not clustered along any of the axes (columns)
of the data set.
A new 3-D scatter plot is created in one
of the following ways: click the “3D” button on the toolbar, click Ctrl-2, or Select
New 3D Scatter Plot from the Visualization
menu (Fig. 7.7.7). While navigating 3-D visualizations, the zoom bars are used as in
2-D. Additionally, holding down Ctrl and
dragging the right mouse button allows users
to rotate the graph, while holding down Shift
and dragging using the right mouse button allows zooming. 3-D scatter plots are ideal visualizations for viewing results of Principal
Components Analysis (UNIT 7.9).
Histograms and Bar Charts
An Overview of
Spotfire for
Gene-Expression
Studies
Histograms and bar charts can effectively
analyze very large data sets. New Histograms
are made in one of the following ways: click
the Histogram button on the toolbar, press
Ctrl-3, or select New Histogram from the Visualization menu.
Bar charts are created in one of the following ways: click the Bar Chart button on
the toolbar, press Ctrl-4, or select New Bar
Chart from the Visualization menu. In traditional bar charts, the height of the bar is the
sum of the values of the records in a certain column. In histogram-type visualizations,
heights of bars are proportional to the number of records specified by the “X axis” column. The attributes of both histograms and bar
charts can be altered from the Bars tab in the
Properties dialog.
Profile Charts
A profile chart maps each record as a line, or
profile. Each attribute of a record is represented
by a point on the line. This makes profile charts
similar in appearance to line charts, but the way
data are translated into a plot is substantially
different. Profile charts are an ideal visualization for t test/ANOVA calculations (UNIT 7.9),
as they provide a good (if somewhat simplified) overview of characteristics. To create a
profile chart, press Ctrl-7, click the New Profile Chart button on the toolbar, or select New
Profile Chart (Fig. 7.7.8) from the Visualization menu. Next, go to the axis selector of the
x axis and uncheck solitary columns that are
not to be included in the chart, such as identifier columns, or go to the Profile Columns
tab of the Properties dialog to change multiple
7.7.8
Supplement 6
Current Protocols in Bioinformatics
Figure 7.7.8 The Profile Chart visualization in Spotfire. This black and white
facsimile of the figure is intended only as a placeholder; for full-color version of figure go
to http://www.interscience.wiley.com/c p/colorfigures.htm.
columns. The Properties dialog can be used to
adjust the various properties of the chart.
Heat Map Plots
Heat Map plots are also known as Intensity plots or Matrix plots. A Heat Map can be
likened to a spreadsheet, where the values in
the cells are represented by colors instead of
numbers. More specifically, a Heat Map is a
type of plot in which the pivoted (short/wide)
data are presented as a matrix of rows and
columns, where the cells are of equal size and
the information represented by the color of the
cells is the most important property. Heat Maps
can be used to identify clusters of records with
similar values, as these are displayed as “areas”
of similar color. New Heat Maps are created
in one of the following ways: click the Heat
Map button on the toolbar, press Ctrl-8, or
select New Heat Map from the Visualization
menu (Fig. 7.7.9). Heat Maps are controlled
from two tabs in the Properties dialog: the Heat
Map Columns tab, where one selects which
columns are to be included in the visualization,
and the Colors tab, where one can customize
the coloring of the Heat Map.
The Heat Map Columns tab of the Properties dialog is used to organize the columns
in the heat map. Fig. 7.7.10 shows the Properties dialog with the Heat Map Columns tab
selected. The list on the right-hand side shows
the columns that are included in the visualization, while the one on the left shows those that
are not. Use the Add and Remove buttons to move columns between the two lists,
or click Remove All to remove all columns
from the “Columns in heat map” list. The list
of available columns can be sorted by checking
the box labeled “List columns alphabetically,”
or by clicking the Column field in the column
heading.
The Colors tab of the Properties dialog is
used to modify the color range of the heat
map. The default color range is set to green for
minimum values, black for intermediate values, and red for maximum values. To apply a
specific color range to one or more columns,
select the appropriate column(s) from the list,
then choose a range from the “Color range”
drop-down list box, and finally click the Apply to column(s) button. Use Shift or Ctrl
to select several columns at a time.
To change the color range of one or more
columns, it is necessary to create a new range.
Click on the New button to open the Create
New Color Range dialog. Enter a new range
Analyzing
Expression
Analysis
7.7.9
Current Protocols in Bioinformatics
Supplement 6
Figure 7.7.9 The Heat Map visualization in Spotfire. This black and white facsimile
of the figure is intended only as a placeholder; for full-color version of figure go to
http://www.interscience.wiley.com/c p/colorfigures.htm.
An Overview of
Spotfire for
Gene-Expression
Studies
Figure 7.7.10
Heat Map Properties dialog box.
7.7.10
Supplement 6
Current Protocols in Bioinformatics
Figure 7.7.11 The Edit Color Range dialog box allows users to choose the colors for their heat
map visualization. This black and white facsimile of the figure is intended only as a placeholder;
for full-color version of figure go to http://www.interscience.wiley.com/c p/colorfigures.htm.
name in the text field at the top, then select
Categorical Coloring or Continuous Coloring
(Fig. 7.7.11). Depending on which radio button is selected, the lower part of the window
will change to show the relevant suboptions.
Categorical Coloring means that each unique
value in the heat map is represented by its own
color. This is most useful when dealing with
a smaller number of varying values or when
looking for identical values in a heat map.
If one has a certain value and one wishes to
change its color, select that value from the list,
and then choose a new color for it from the
palette. Continuous Coloring means that the
color range is linear from one specific color to
another color, via a third middle color. By default, this is set to show low values in shades of
green, intermediate values going toward black,
and high values in shades of red. Select new
colors to represent the Min, Mid, or Max values
by clicking on their corresponding color button and picking a new color from the palette
that appears. Continuous Coloring is further
divided into three sub-options: “Shared custom range” means that it is possible specify an
exact Min, Mid, and Max value for the color
range, instead of these values being automatically determined. “Shared” means that all selected columns will be colored according to
these values regardless of their own individ-
ual Min and Max values; “Shared auto range”
means that the Min, Mid, and Max value for
the range is automatically set to the lowest,
middle, and highest value that exists in all the
selected columns. It also means that all selected columns will be colored according to
these values regardless of their own individual
Min and Max values. “Individual auto range”
means that the Min, Mid, and Max values for
the range are automatically set to the lowest,
middle, and highest values, respectively, that
exist in each individual column. This means
that all selected columns will be colored according to their own individual Min and Max
values.
Making a record active or marking several
records in a heat map plot differs somewhat
from the method used with other plots. In a
heat map, one row always equals one record.
Consequently, one always selects or marks one
or more entire rows, equaling one or many
records. When one clicks on a row, a black
triangle appears at both ends of the selected
row to indicate that it is active. Information
about the row is displayed in the Details-onDemand window. By clicking and holding the
mouse button while the mouse pointer is on
a row and dragging it to cover several rows,
these rows all become marked. This is indicated by a small bar shown at the left and
Analyzing
Expression
Analysis
7.7.11
Current Protocols in Bioinformatics
Supplement 6
right of the rows in question. Details on these
records are shown in the Details-on-Demand
window.
Tables
The Table visualization presents the data as
a table of rows and columns. The Table can
handle the same number of rows and columns
as any other visualization in DecisionSite. In
the Table, a row represents a record. By clicking on a row, that record is made active, and by
holding down the mouse button and dragging
the pointer over several rows, it is possible to
can mark them. One can sort the rows in the
table according to different columns by clicking on the column headers, or filter out unwanted records by using the query devices. To
create a Table, press Ctrl-9, click the New
Table button on the toolbar, or select New
Table (Fig. 7.7.12) from the Visualization
menu. One can then click on the header of the
column by which the rows are to be sorted, or
rearrange the order of the columns by dragging
and dropping the column headers horizontally.
Use the Properties dialog to further adjust the
various properties of the chart.
An Overview of
Spotfire for
Gene-Expression
Studies
Annotating Visualizations And
Changing Visualization Columns
It is possible to give any visualization a title
and an annotation. The title will appear as the
caption of the window. It can also appear in
the heading of printouts. The annotation will
appear as a tool tip when the mouse pointer
is placed over the paper clip at the bottomleft corner of the visualization. To set title and
annotation:
1. Go to the Annotations tab of the Properties dialog (Fig. 7.7.13);
2. Enter a title and/or an annotation;
3. Check “Append axes name to visualization title” if the current axes are to be appended
to the title.
One can type in a great deal of text in the
Annotation field, as well as cut and paste to
and from other Windows applications.
Which and how many columns should be
included in a visualization can be controlled
by the Properties dialog in most visualizations
(Fig. 7.7.14). In scatter plot, the Columns tab
houses this information. One or more columns
can be selected by clicking (or Ctrl-clicking);
the selected columns can then be deleted,
Figure 7.7.12 The Table visualization allows users to view data in a sortable spreadsheet
format. Like other visualizations, Table is also dynamically linked to the query devices and to
other visualizations.
7.7.12
Supplement 6
Current Protocols in Bioinformatics
Figure 7.7.13 Annotations can be appended to most visualizations (example shown here with
the scatter plot) through the Properties dialog box.
Figure 7.7.14 The number and type of columns in a scatter plot can be controlled via the Columns
tab in the Properties dialog.
Analyzing
Expression
Analysis
7.7.13
Current Protocols in Bioinformatics
Supplement 6
moved up or down, or renamed, or their scale
can be reset.
Handling Multiple Visualizations
Users will often need to work simultaneously with multiple visualizations in Spotfire.
Bar charts and histograms are powerful tools
for analyzing aggregate data, while scatter
plots can reveal trends and correlation. Specific
Tools like hierarchical clustering (UNIT 7.9) will
generate specific visualizations such as heat
maps.
Spotfire DecisionSite is able to show multiple visualizations, each one as a window
presenting the same data, but in different ways.
The visualizations may have dissimilar coloring or axes—or even be of different types—one
a 3-D scatter plot, another a bar chart. Each visualization can fill the entire window, all can
be seen simultaneously, or each can reside on
its own tab of a workbook. When operating
the query devices, all visualizations are simultaneously updated, showing alterations in all
visualizations when a factor is changed. When
a marker is highlighted, it is highlighted in all
visualizations simultaneously.
New visualizations are created by selecting
the appropriate icon or a command from the
An Overview of
Spotfire for
Gene-Expression
Studies
Visualization menu, or by using the keyboard
shortcuts shown in the same menu. There are
several ways to reposition windows; the commands governing these functions all reside in
the Window menu:
Auto Hide Axis Selectors: When the visualization is small enough, this option automatically hides the zoom bars and the axis
selectors.
Hide Window Frame: Hides the title bar
giving more space to the visualizations; this
option is only available when several visualizations have been tiled.
Auto Tile: Arranges all the windows on
screen according to an internal algorithm. The
active visualization will be given leftmost, uppermost, and size priority (Fig. 7.7.15).
Cascade: Arranges the visualization windows so that they partially overlap each other,
leaving each window accessible by clicking on
the title bar.
Tile Horizontal: Splits the window area horizontally according to the number of visualizations, giving each visualization equal area.
Tile Vertical: Splits the window area vertically according to the number of visualizations, giving each visualization equal
area.
Figure 7.7.15 The auto-tile feature allows all the visualizations present in a particular Spotfire
session to be viewed at once.
7.7.14
Supplement 6
Current Protocols in Bioinformatics
QUERY DEVICES
Spotfire DecisionSite automatically creates
query devices when a data set is loaded. One
device is created for each column of data. This
section describes query devices within Spotfire, and how they can be used. A Spotfire
DecisionSite query device is a visual tool for
performing dynamic queries against an underlying data set.
Query Devices are used to filter microarray data without the need to know Structured
Query Language (SQL). Query devices include sliders, check boxes, or other graphic
controls used to filter the data shown in the visualization, as described in the following paragraphs. A query device is always associated
with a specific data column (Fig. 7.7.16).
Range Slider
A range slider is used to select records with
values in a certain range. The left and right
drag box can be used to change the lower and
upper limit of the range—meaning that only
records with values within the chosen range
are selected and are therefore visible in the visualization. Labels above the slider indicate
the selected span. The range can also be adjusted with the arrow keys when the query device is active: left and right arrows move the
lower limit (left drag box), and up and down
Figure 7.7.16
arrow keys move the upper limit. A minimum
or maximum value can be typed into a range
slider. The user can double-click on the minimum or maximum number above the drag box
and then enter the desired value in the edit
field. Alternatively one can click on the left
or right drag box. No edit field will appear, but
by simply typing the desired value, the slider
will adjust.
The currently selected interval of the
range slider can be grabbed and moved
to pan the selected range; this provides a
powerful way of sweeping over different
“slices” of a data set. Click and drag the yellow
portion of the range slider to do this. Observing the reactions of the other sliders to such a
sweep can give some interesting clues to correlation between parameters in the data set.
If other query devices impose further restrictions, then the result may be that parts of the
interval of the range slider are unpopulated.
This area is indicated with a pale yellow color,
as opposed to the bright yellow color that indicates the populated interval.
An important feature of the range slider is
that the values are distributed on a linear scale
according to the values of the data. Therefore,
if values are unevenly distributed, this will be
reflected in the range slider. This is not the case
with item sliders, where values are evenly distributed along the range of the slider, regardless
Various types of query devices are assigned to different data columns.
Analyzing
Expression
Analysis
7.7.15
Current Protocols in Bioinformatics
Supplement 6
of what values appear in the column. The range
slider can be set to span the current selection
of data by double clicking at the center of the
range slider.
Item Slider
An item slider is used to select individual
values in a column. In an item slider query
device, data items are evenly distributed on
a continuous linear scale. However, the item
slider selects only a single item at a time. The
selected value is displayed as a label above the
slider. As a special case, all items are selected
when the slider handle is at the extreme left of
the scale. A specific value can be typed into
an item slider. The user can achieve this by either double-clicking on the number above the
slider and then typing the desired value, or by
clicking on the drag box itself and then typing the desired value. Note that no edit field
will appear in which to type the value. Simply type the value after clicking, and the item
slider will adjust itself to the value nearest possible to the value typed. The scope of an item
slider is dependent on the settings of other
query devices. This means that the item slider
range constantly changes as one manipulates
the query devices. When the input focus is set
on the slider (marked by a dotted line), the
arrow keys on the keyboard can be used to adjust the slider to the exact position of the entry.
Up and right arrows move to the next record,
down and left to the previous one. When the
item slider drag box is moved to its leftmost
position, all values for the slider are selected,
as indicated by the label (All) above the slider.
Check Boxes
An Overview of
Spotfire for
Gene-Expression
Studies
Check boxes, one for each value appearing
in the corresponding column in the data set,
are used to select or deselect the values for appearance in the visualization. Check boxes are
typically used when the record field holds just
a few distinct values. In a check box query
device, each unique value is represented by a
check box, which is either checked (selected in
the local context) or unchecked (deselected).
If all records with a certain value are deselected by some other query device the label of
that value becomes red. Coloring is set to categorical; ticking a check box causes all records
of that particular color to show (unless they
are deselected by another query device). By
default, Spotfire DecisionSite assigns check
boxes to any column containing ten values or
less. Initially, all boxes are checked, which
makes all records in the data set visible. For
quick checking or unchecking of all the values, right click on the check boxes query device
and select All or None from the pop-up menu.
Like radio buttons, check boxes provide options that are either On or Off. Check boxes differ from radio buttons in that check boxes are
typically used for independent or nonexclusive
choices.
Radio Buttons
Radio buttons are similar to check boxes,
but enable only one choice among the alternatives. In a radio button query device, a radio button represents each unique value. Radio buttons, also referred to as option buttons,
represent a single choice within a limited set
of mutually exclusive choices. That is, in any
group of option buttons, only one option in the
group can be set. However an All option is always present among the radio buttons, which
makes it possible to select all the records in that
column. If all records with a certain value are
deselected (by this or some other query device)
the label of that value becomes red.
Full Text Search
Full text search query devices permit the
search for a specific string of alphanumeric
characters with the use of Boolean operators.
The full-text search query device allows users
to search for (sub)strings within columns. It
also allows one to search for a pattern by using Regular Expressions. For example, one can
enter a pattern that means “a letter followed
by two digits.” Alternatively, users can search
strings that do not contain regular expressions
in normal-text search. The search can be made
as complex as desired by use of the logical operators AND (&) and OR (blank space). Search
expressions are evaluated from left to right.
Once the search string has been entered, holding down the Enter key on the keyboard executes the search. All records matching the
search criteria will be shown in the visualization window. The full-text search query
device also supports Cut/Copy/Paste of text
strings using the Ctrl-X, Ctrl-C, and Ctrl-V
keystroke combinations.
Spotfire’s default choice of query devices is
based on the column content and the number
of unique values present in the data set for
that attribute. If a column contains ten unique
values or less, check boxes are assigned as the
query device. For columns containing more
than ten values, an item slider is chosen for alphanumeric (string) attributes, such as names
and descriptions. Range sliders are assigned to
7.7.16
Supplement 6
Current Protocols in Bioinformatics
numeric columns like date, time, and decimal
or integer values. Users can change the type of
query device to use for the column, with one
restriction: check boxes and radio buttons can
only be used for columns having less than 500
unique values. The currently selected query
device is marked with a bullet. To change the
type of query device, right-click the query
device to make the pop-up menu appear. Select
the appropriate query device option from the
pop-up menu or Select the Columns tab of the
Properties dialog. This tab contains a list of all
Figure 7.7.17 (A) Records can be marked by left-clicking the mouse and dragging the cursor
around the desired region. (B) Marking records in an irregular shape (by lasso) can be achieved
by pressing Shift while left-clicking the mouse and dragging the cursor around the desired region.
Analyzing
Expression
Analysis
7.7.17
Current Protocols in Bioinformatics
Supplement 6
the columns in the data set. Mark a column and
select the type of query device to use for
that column. From the Columns tab of the
Properties dialog, one can also make new
columns from expressions or by binning,
as well as delete columns from the Spotfire
DecisionSite internal database.
The order of the query devices can be
sorted in four ways: by original order, by
annotation, by name, or by type. For example,
users can group all range sliders together, or
sort the query devices in alphabetical order. To
sort the query devices, right-click on a query
device, select Sort from the pop-up menu,
and select Original, by Annotation, by Name,
or by Type. Alternatively, users may want
to regroup query devices and rearrange their
order to avoid having to scroll up and down to
keep track of the changes. The initial order of
the Query Devices depends on the structure of
the dataset loaded into Spotfire DecisionSite
or the SQL query (UNIT 9.2) that was used to
acquire data. This can be changed as needed
by rearranging columns in the originating
Figure 7.7.18 Details-on-Demand window shows a snapshot of the marked data. Data shown
in this window can be exported to Excel or as text/HTML data.
An Overview of
Spotfire for
Gene-Expression
Studies
Figure 7.7.19
record.
Details-on-Demand window can also be used to exhibit data for a single highlighted
7.7.18
Supplement 6
Current Protocols in Bioinformatics
Figure 7.7.20 (A) Details-on-Demand (HTML) format. (B) Selecting the external Web browser
option from the View tab allows export of the HTML data to an external browser window (C).
Analyzing
Expression
Analysis
7.7.19
Current Protocols in Bioinformatics
Supplement 6
spreadsheet program or by writing the SQL
query in a certain order.
DETAILS-ON-DEMAND
The Details-on-Demand window allows
the user to display all data linked to a particular record or a set of records in text or HTML
format. Records of interest can be marked using the left mouse button to create a rectangular
box (Fig. 7.7.17A) around the records or by using a lasso (Fig. 7.7.17B), i.e., by surrounding
the records with a line drawn in an arbitrary
shape by pressing the Shift key and using the
left mouse button to drag an arbitrary line.
For all marked records, the Details-onDemand window shows linked data in text format (Fig. 7.7.18).
Instead of marking a set of records, the
Details-on-Demand window can also be used
to display data linked to an individual record
by simply clicking on that record to highlight
it. This creates a circle around that record and
the Details-on-Demand window displays data
corresponding only to that record (Fig. 7.7.19).
The Details-on-Demand feature can export
marked data to Excel; data can also be exported as an HTML file (Fig. 7.7.20A) within
Spotfire or in an external browser window
(Fig. 7.7.20B and C).
STRENGTHS AND WEAKNESSES
OF SPOTFIRE AS A DESKTOP
MICROARRAY ANALYSIS
SOFTWARE
An Overview of
Spotfire for
Gene-Expression
Studies
Spotfire is a powerful tool for data mining and visualization. Its strengths include
the ability to import data from a number of
databases for visualization in a single session.
This feature makes it a “virtual data warehouse” and allows evaluation of, for example,
gene expression data and corresponding proteomics data in a single session. Its ability to
manipulate visualizations is superb and it has a
substantial collection of most frequently used
statistical analysis algorithms. However, Spotfire does have limitations. It is currently unable
to perform LOWESS (intensity dependent)
normalization, which has particular relevance
to the two-color microarray system. It is unable to perform a False-Discovery Rate calculation after obtaining an ANOVA-based list
of significantly differential genes, and it lacks
powerful statistical tools such as multiple testing corrections and mixed-model ANOVAs. It
is unable to directly upload Affymetrix .cel
files to obtain and analyze data at the probe-set
level, and it lacks publication-quality graphics,
among other deficits. Fortunately, Spotfire is
also easily modified to incorporate new algorithms via its Application Programming Interface or through custom development by Spotfire staff. For example:
1. Spotfire has recently offered a custompackage upgrade to integrate powerful geneexpression data analysis tools from the statistical language R (http://www.bioconductor.org),
allowing users to access and deploy these R
scripts to perform, e.g., LOWESS normalization or mixed-model ANOVA, false-discovery
rate, or Bonferroni correction.
2. Spotfire now offers a tool that interacts
with the pathway-generating tool from Jubilant
Biosciences, thereby allowing users to scan
their expression data for pathway-based relationships.
3. Spotfire has entered into a partnership with Rosetta Impharmatics that allows
licensed users to benefit from the errormodeling algorithms of Rosetta Resolver.
4. Spotfire is now offering a custompackage upgrade to analyze several types of
proteomics data in a manner similar to geneexpression data.
Spotfire thus offers much of the functionality one would typically require in a desktop gene expression analysis application, along
with significant flexibility in adapting the application to one’s own environment and needs.
Literature Cited
Cheok, M.H., Yang, W., Pui, C.H., Downing, J.R.,
Cheng, C., Naeve, C.W., Relling, M.V., and
Evans, W.E. 2003. Treatment-specific changes
in gene expression discriminate in vivo drug response in human leukemia cells. Nat. Genet.
34:85–90.
Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G.,
Moore, T., Lee, J.C., Trent, J.M., Staudt, L.M.,
Hudson, J. Jr., Boguski, M.S., Lashkari, D.,
Shalon, D., Botstein, D., and Brown, P.O. 1999.
The transcriptional program in the response of
human fibroblasts to serum. Science 283:83–
87.
Kerr, M.K. and Churchill, G.A. 2001. Experimental design for gene expression microarrays. Biostatistics 2:183–201.
Kozal, M.J., Shah, N., Shen, N., Yang, R., Fucini,
R., Merigan, T.C., Richman, D.D., Morris, D.,
Hubbell, E., Chee, M., and Gingeras, T.R. 1996.
Extensive polymorphisms observed in HIV1 clade B protease gene using high-density
oligonucleotide arrays. Nat. Med. 2:753–759.
Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T.,
Bar-Joseph, Z., Gerber, G.K., Hannett, N.M.,
Harbison, C.T., Thompson, C.M., Simon, I.,
Zeitlinger, J., Jennings, E.G., Murray, H.L.,
Gordon, D.B., Ren, B., Wyrick, J.J., Tagne, J.B.,
Volkert, T.L., Fraenkel, E., Gifford, D.K., and
7.7.20
Supplement 6
Current Protocols in Bioinformatics
Young, R.A. 2002. Transcriptional regulatory
networks in Saccharomyces cerevisiae. Science
298:799–804.
Leung, Y.F. and Cavalieri, D. 2003. Fundamentals of
cDNA microarray data analysis. Trends Genet.
19:649–659.
Schena, M., Shalon, D, Davis, R.W., and Brown,
P.O. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA
microarray. Science 270:467–470.
Schena, M., Heller, R.A., Theriault, T.P., Konrad,
K., Lachenmeier, E., and Davis, R.W. 1998.
Microarrays: Biotechnology’s discovery platform for functional genomics. Trends Biotechnol. 16:301–306.
Smyth, G.K., Yang, Y.H., and Speed, T. 2003. Statistical issues in cDNA microarray data analysis.
Methods Mol. Biol. 224:111–136.
Yeoh, E.J., Ross, M.E., Shurtleff, S.A., Williams,
W.K., Patel, D., Mahfouz, R., Behm, F.G.,
Raimondi, S.C., Relling, M.V., Patel, A., Cheng,
C., Campana, D., Wilkins, D., Zhou, X., Li,
J., Liu, H., Pui, C.H., Evans, W.E., Naeve,
C., Wong, L., and Downing, J.R. 2002. Classification, subtype discovery, and prediction
of outcome in pediatric acute lymphoblastic
leukemia by gene expression profiling. Cancer
Cell 1:133–143.
Contributed by Deepak Kaushal and
Clayton W. Naeve
St. Jude Children’s Research Hospital
Memphis, Tennessee
Analyzing
Expression
Analysis
7.7.21
Current Protocols in Bioinformatics
Supplement 6
Loading and Preparing Data for Analysis
in Spotfire
UNIT 7.8
Microarray data exist in a variety of formats, which often depend on the particular array
technology and detection instruments used. These data can easily be loaded into Spotfire
DecisionSite (Spotfire DecisionSite, UNIT 7.7) by a number of methods including copying/
pasting from a spreadsheet, direct loading of text or comma separated (.csv) files, or
direct loading of Microsoft Excel files. Data can also be loaded via preconfigured or
ad hoc queries of relational databases and from proprietary databases and export file
formats from microarray manufacturers such as Affymetrix (see Alternate Protocol 1)
and Agilent, or scanner manufacturers such as GenePix (see Basic Protocol 1). Once
the data are loaded, it is necessary to filter and preprocess the data prior to analysis (see
Support Protocol 1).
Subsequently, data transformation and normalization are critical to correctly perform
powerful microarray data mining expeditions. These steps extract or enhance meaningful
data characteristics and prepare the data for the application of certain analysis methods
such as statistical tests to compute significance and clustering methods (UNIT 7.9)—which
mostly require data to be normally distributed. A typical example of transformation
methods is calculating the logarithm of raw signal values (see Support Protocol 2). Normalization is a type of transformation that accounts for systemic biases that abound in
microarray data. One may then wish to normalize the data within an experiment (see
Basic Protocol 2) or between multiple experiments (see Basic Protocol 3). During these
processes it may be useful to combine data from multiple rows (see Basic Protocol 4).
NOTE: UNIT 7.7 provides a general introduction to the Spotfire program and environment.
This unit strictly focuses on data preparation within Spotfire. Readers unfamiliar with
Spotfire are encouraged to read UNIT 7.7.
UPLOADING GenePix DATA INTO SPOTFIRE
Spotfire allows the user to upload multiple spotted microarray data files in GenePix format
(.gpr files) using a script that can retrieve the files from a database or from a network
drive. While the original script was set up to retrieve version 3.0 .gpr files, modifications
can be made to it to allow it to recognize and import data from newer versions of GenePix
data files such as 4.0, 4.1, or 5.0. The script reads a .gpr file and ignores the header part
based on the information provided in the .gpr file header about the number of rows and
columns in the data file. It then allows the user to pick and choose the relevant columns
of data from a .gpr file to upload to Spotfire.
BASIC
PROTOCOL 1
Necessary Resources
Hardware
The recommended minimal hardware requirements are modest. The software will
run on an Intel Pentium or equivalent with 100 MHz processor, 64 Mb RAM,
20 Mb disk space; a VGA or better display, and 800 × 6000 pixels resolution are
needed. However, most microarray experiments yield large output files and most
experimental designs require several data files to be analyzed simultaneously, so
the user will benefit from both a much higher RAM and a significantly better
processor speed.
Analyzing
Expression
Analysis
Contributed by Deepak Kaushal and Clayton W. Naeve
Current Protocols in Bioinformatics (2004) 7.8.1-7.8.25
C 2004 by John Wiley & Sons, Inc.
Copyright 7.8.1
Supplement 6
Software
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
A standard install of Microsoft Internet Explorer; v. 5.0 through 6.0 may be used
MDAC (Microsoft Data Access Components); versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12) may be used
A Web connection to the Spotfire server (http://home.spotfire.net; UNIT 7.7) or a local
customer specific Spotfire Server. A Web connection is also required to take
advantage of Web Links for the purpose of querying databases and Web sites on
the Internet using columns of data residing in Spotfire
Microsoft PowerPoint, Word, and Excel are required to take advantage of a number
of features available within Spotfire related to export of text results or
visualizations (UNIT 7.9)
Spotfire (6.2 or above) is required (see UNIT 7.7)
Files
Spotfire (Functional Genomics module) can import data in nearly any format, but
the authors focus here on the two-color spotted microarray data produced using
GenePix software (Axon, Inc.). Several types of spotted arrays, scanners,
scanning software packages, and their corresponding data types exist, including
those from commercial vendors (Agilent, Motorola, and Mergen) that supply
spotted microarrays for various organisms, as well as those from facilities that
manufacture their own chips. GenePix data files are a tab-delimited text format
(.gpr), which can be directly imported into a Spotfire session.
1. Run Spotfire (UNIT 7.7) and ensure that access is available to the .gpr files from
either a network drive or a database.
Depending on the type of setup, it may be necessary to log in to the Spotfire application
as well as the data source. Systems and database administrators may be able to provide
more information. In this example, a GenePix version 3.0 data file is used.
2. In the Tools pane on the left-hand side of the screen, click on Access, then on Import
GenePix files (Fig. 7.8.1). The Import GenePix Files dialog appears (Fig. 7.8.2A).
Loading and
Preparing Data
for Analysis in
Spotfire
Figure 7.8.1
Tools pane with the Import GenePix Files tab highlighted.
7.8.2
Supplement 6
Current Protocols in Bioinformatics
Figure 7.8.2 (A) The Import Genepix Files dialog allows users to specify files to be uploaded
into a Spotfire session. (B) The Data Import Options allow users to chose all or any columns from
the data set.
Analyzing
Expression
Analysis
7.8.3
Current Protocols in Bioinformatics
Supplement 6
3. Click Add. Point to the directory where the files to be analyzed are located, and
double-click on the desired file. It is possible to load either a single file or multiple
files with the help of the Shift key. The user may upload as many as seven files at
one time. Uploading more than seven files will require repeating the process. The
filename will appear in the center of the dialog box.
4. Specify the file(s) and click on the Columns button (Fig. 7.8.2A) to specify the data
columns (Fig. 7.8.2B) to upload. One can choose to upload the entire file (requiring
longer upload times).
The 43 columns listed in Figure 7.8.2B are generated by the GenePix software and are
related to the position (Block, Column, Row, X, and Y), identification (Name, ID), and
morphology (Diameter) of the spot and its intensity in either the Cy5 or Cy3 channel
(all other columns). B represents Background and F represents Fluorescence. 635 and
532 represent the two wavelengths used during scanning (532 for Cy3 and 635 for Cy5).
Suggested columns to upload include F635 Median, B635 Median, F532 Median, B532
Median, Ratio of Medians, F635 Median-B635, F532 Median-B532, Flags, Norm Ratio
of Medians, and Norm Flags.
5. Check all columns to import, then click OK. The Import GenePix files window will
appear again. Click OK again. Data will begin loading into Spotfire. This could
take several minutes depending on the size and number of the data columns being
uploaded and RAM/processor speeds.
At the end of the data-upload process, Spotfire will automatically display an initial
visualization where each record is represented by a marker, along with a number of
query devices for manipulating the visualization. Alternative visualizations (UNIT 7.7)
can be opened by clicking on appropriate visualization toolbars, choosing Visualization
from the File menu, or using the shortcuts Ctrl-1 through Ctrl-9 on the keyboard for
various visualizations.
6. Filter and preprocess the data as described in Support Protocols 1 and 2.
ALTERNATE
PROTOCOL 1
UPLOADING AFFYMETRIX TEXT DATA INTO SPOTFIRE
Support for standard microarray platforms, such as Affymetrix, is integrated within DecisionSite for Functional Genomics. Spotfire allows the user to upload multiple Affymetrix
data files in the metric text format (.met files) using a script that can retrieve these files
from a database or from a network drive. A guide is available to upload data from both
MAS 4.0 and MAS 5.0 versions. The MAS 5.0 guide also works with the latest Affymetrix
software GCOA 1.1. The script reads a .met file while largely ignoring the information
provided in the header. It then allows the user to pivot the relevant columns of data from
the .met file(s) to upload.
Necessary Resources
Hardware
The recommended minimal hardware requirements are modest. The software will
run on an Intel Pentium or equivalent with 100 MHz processor, 64 Mb RAM,
20 Mb disk space; a VGA or better display; and 800 × 6000 pixels resolution
are needed. However, most microarray experiments yield large output files and
most experimental designs require several data files to be analyzed
simultaneously, so the user will benefit from both a much higher RAM and a
significantly better processor speed.
Loading and
Preparing Data
for Analysis in
Spotfire
Software
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
7.8.4
Supplement 6
Current Protocols in Bioinformatics
A standard install of Microsoft Internet Explorer; v. 5.0 through 6.0, may be used
MDAC (Microsoft Data Access Components); versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12) may be used
A Web connection to the Spotfire server (http://home.spotfire.net; UNIT 7.7) or a local
customer specific Spotfire Server. A Web connection is also required to take
advantage of Web Links for the purpose of querying databases and Web sites on
the Internet using columns of data residing in Spotfire
Microsoft PowerPoint, Word, and Excel are required to take advantage of a number
of features available within Spotfire related to export of text results or
visualizations (UNIT 7.9)
Spotfire (6.2 or above) is required (see UNIT 7.7)
Files
Spotfire (Functional Genomics module) can import data in nearly any format, but
the authors focus here on the commercial GeneChip microarray data
(Affymetrix, Inc.). Spotfire facilitates the seamless import of Affymetrix output
files (.met) from Affymetrix MAS v. 4.0 or v. 5.0 software. The .met file is a
tab-delimited text file containing information about attributes such as probe set
level, gene expression levels (signal), and detection quality controls (p value and
Absence/Presence calls). In the illustration below, MAS 5.0 .met files will be
used as an example.
1. Run Spotfire (UNIT 7.7) and ensure that access is available to the .met files from
either a network drive or a database.
Depending on the type of setup, it may be necessary to log in to the Spotfire application
as well as the data source. Systems and database administrators may be able to provide
more information.
2. In the Tools pane on the Left hand side, a plus sign (+) in front of the script Access
indicates that it can be expanded to explore other items under this directory. Click
on Access, then on Import Affymetrix Data. This reveals all options available for
downloading Affymetrix data (from version 4.0 or 5.0 MAS files on a network drive,
or from a local or remote database). Click on Import Affymetrix V5 Files (Fig. 7.8.3).
Figure 7.8.3
Tools pane with the Import Affymetrix v5 Files tab highlighted.
Analyzing
Expression
Analysis
7.8.5
Current Protocols in Bioinformatics
Supplement 6
Figure 7.8.4 (A) The Import Affymetrix Files dialog allows users to specify files to be uploaded
into a Spotfire session. (B) The Data Import Options allow users to chose all or any columns from
the data set.
3. Clicking on Import Affymetrix V5 Files will open a window for the user to specify
the files to upload to Spotfire (Fig. 7.8.4A).
Loading and
Preparing Data
for Analysis in
Spotfire
4. Click Add. Point to the directory where the files to be analyzed are located, and
double-click on the desired file. It is possible to load either a single file or multiple
files with the help of the Shift key. The user may upload as many as seven files at
one time. Uploading more than seven files will require that the process be repeated.
The filename will appear in the center of the dialog box.
7.8.6
Supplement 6
Current Protocols in Bioinformatics
5. Specify the file(s) and click on the Columns button (Fig. 7.8.4A) to specify the data
columns (Fig. 7.8.4B) to upload. One can choose to upload the entire file (requiring
longer upload times).
6. Check all columns to import, then click OK. The Import Affymetrix Files window
will appear again. Click OK again. Data will begin loading into Spotfire. This could
take several minutes depending on the size and number of the data columns being
uploaded and RAM/processor speeds.
At the end of the data-upload process, Spotfire will automatically display an initial
visualization where each record is represented by a marker, along with a number of query
devices for manipulating the visualization. Alternative visualizations can be opened by
clicking on appropriate visualization toolbars, choosing Visualization from the File
menu, or using the shortcuts Ctrl-1 through Ctrl-9 on the keyboard for various
visualizations.
7. Filter and preprocess the data as described in Support Protocols 1 and 2.
FILTERING AND PREPROCESSING MICROARRAY DATA
Successfully completing microarray experiments includes assessing the quality of the
array design, the experimental design, the experimental execution, the data analysis, and
the biological interpretation. At each step, data quality and data integrity should be maintained by minimizing both systematic and random measurement errors. Before embarking
on the actual analysis of data, it is important to perform filtering and preprocessing, and
other kinds of transformations, to remove systemic biases that are present in microarray
data. It is not uncommon for users to overlook the importance of such quality-control
measures. Typical filtering operations include removing genes with background levels of
expression from the data, as these would likely confound later transformations and cause
spurious effects during fold-change calculations and significance analysis. This can be
readily achieved by filtering on the basis of absence/presence calls and detection p value.
SUPPORT
PROTOCOL 1
Query devices are assigned to every field of data and allow the user to perform filtering
with multiple selection criteria, resulting in updates of all visualizations to display the
results of this cumulative filtering. Guides can be used to perform such repetitive tasks
quickly or to initiate a series of specific steps in the analysis. Throughout analysis, filtering
using any data-field query device can be used to subset data and limit the number of genes
that are included in further calculations and visualizations. Genes can be filtered on the
basis of detection p value, Affymetrix signal, GenePix signal, GenePix signal-to-noise
ratio, fold change, standard deviation, and modulation (frequency crossing a threshold).
For example, filtering genes on modulation by setting a 0.05 p value threshold will split
genes out by the number of times they fall above the 0.05 limit in the selected experiments.
To preprocess Affymetrix text data
1a. Initiate a Spotfire session (UNIT 7.7) and upload Affymetrix text (.met) files as
described in Alternate Protocol 1.
2a. Pay careful attention to the query devices as a default visualization is loaded. A
query device appears for every column of data that is uploaded and can be used to
manipulate data visualization. In the Guides pane on the top-left corner, click on
the link for Data Analysis, then on “Analyze Affymetrix absence/presence calls”
(Fig. 7.8.5).
3a. This script allows one to choose Detection columns containing Absent (A), Marginal
(M), and Present (P) calls. Click on all the detection columns to be considered from
the display in the Guides pane, then click on Continue.
Analyzing
Expression
Analysis
7.8.7
Current Protocols in Bioinformatics
Supplement 6
Figure 7.8.5
Guides pane with the Analyze Affymetrix absence/presence calls guide highlighted.
Figure 7.8.6 The data are binned on the basis of the number of times a particular Probe set was
called Absent, Present, or Marginal, and presents a histogram to display the results.
4a. The frequency of absent, present, and marginal occurrences is then calculated across
the selected experiments for each gene. It is possible to filter data using three new
query devices: Absent Count, Present Count, and Marginal Count. A histogram
may be created to view the distribution of Absent, Present, and Marginal counts
using the Histogram Guide (Fig. 7.8.6).
Loading and
Preparing Data
for Analysis in
Spotfire
This display allows users to quickly identify those genes that are repeatedly called Absent.
In the above example, there are eight metric text files (Fig. 7.8.6). The histogram displays
all genes based on how many times they were binned into the P category in these eight
7.8.8
Supplement 6
Current Protocols in Bioinformatics
Figure 7.8.7 The data generated from the use of the Affymetrix absence/presence guide is added
to the Spotfire session as a new column and a new corresponding query device generated.
experiments. The distribution ranges from 0-1, which identifies genes that are always
or almost always Absent, to 7-8, which identifies genes that are almost always called
Present.
5a. Using the above histogram it is possible to exclude genes in one or more groups.
Similar results can be obtained by sending “Absent call” results to different bins.
When the histogram is displayed, associated data are linked to parent data in the
Spotfire session and a new query device is created for this column of data (Fig. 7.8.7).
6a. By default, the query device is in the range-slider format. Right-click on the center
of the Query device and choose Check Boxes (Fig. 7.8.8).
7a. Uncheck the check box for category 0-1. Notice how the number of visible records
on the activity line changes from 6352 to 4441, reflecting the 1911 genes that were
filtered out using this method (Fig. 7.8.9).
Records under the histogram 0-1 pertain to those genes that were called Present either 0
or 1 time out of a total of 8 Affymetrix chips in this particular experiment. This indicates
that these genes are not reliably detected under these conditions. Filtering out these
genes allows further calculations and transformations to be performed on the rest of the
data set without any effect from these genes.
8a. Alternatively, data may be filtered based on criteria (detection p value or raw signal)
other than Absence/Presence calls. To do so, click on Data Preparation in the Guides
pane, followed by Filter Genes. Users can filter genes by “Standard deviation,” “Fold
change,” or “Modulation.” To filter genes by “Standard deviation,” it is necessary to
normalize data based on Z-score calculations (see Basic Protocol 3 and Background
Information). Similarly, genes can only be filtered by “Fold change” when the
appropriate normalization has been applied to the data (see Basic Protocol 3 and
Background Information). Genes can also be filtered by modulation or frequency
of crossing a threshold. In a set of 12 .met files, for example, one can query how
many times a certain gene has a detection p value greater than 0.05. This calculation
can be carried out for every gene in the dataset and groups of genes can be removed
based on a particular frequency.
Analyzing
Expression
Analysis
7.8.9
Current Protocols in Bioinformatics
Supplement 6
Figure 7.8.8
another.
Query Device for a particular column of data can be modified from one type to
Figure 7.8.9 Clearing check box corresponding to “Binned Present count 0-1” alters the number
of visible records (shown on the Activity Line).
9a. Choose Modulation. Next, choose all the p value columns to be considered from
the display in the Guides pane. Hit Continue (Fig. 7.8.10).
Loading and
Preparing Data
for Analysis in
Spotfire
10a. Select a modulation threshold. If interested in filtering out genes on the basis of
a p value cutoff of 0.05, for example, type 0.05. Click on Filter by Modulation
(Fig. 7.8.11).
7.8.10
Supplement 6
Current Protocols in Bioinformatics
Figure 7.8.10
fashion.
The Filter Genes guide helps users to perform data preprocessing in a stepwise
Figure 7.8.11 The Filter Genes by Modulation guide bins data by the number of times a record
(gene) crosses the specified threshold in the given experiments.
11a. The frequency of p value occurrences above 0.05 is across the selected experiments
for each gene is displayed. It is possible to filter data using the new query device
or from the histogram or trellis display (Fig. 7.8.12).
Similar filtering may be performed on raw signal data.
To preprocess spotted array (GenePix) data
1b. Initiate a Spotfire session (UNIT 7.7) and upload appropriate columns from GenePix
(.gpr) files as described in Basic Protocol 1.
It is useful to retrieve data from the raw signal columns and background-corrected
signal columns. In addition, GenePix data contain indicators of data quality in Signal
Analyzing
Expression
Analysis
7.8.11
Current Protocols in Bioinformatics
Supplement 6
Figure 7.8.12 A new data column and a new query device are added to the Spotfire session,
based on the Filter Genes>Modulation>p-value selection.
to Noise Ratio columns for every channel and a Flags column for every slide. It is
useful to retrieve these data. In the example below, six cDNA microarray experiments
(12 channels of signal data) are uploaded to Spotfire.
2b. In the Guides pane on the top-left corner, click on the link for Data Preparation and
then on Filter Genes (Fig. 7.8.13).
3b. Click on Modulation. Filtering can be performed to remove bad data from GenePix
files using data contained in the Flags columns and/or the Signal to Noise Ratio
column. Choose Flags columns for any number of arrays to be mined, then hit
Continue (Fig. 7.8.14).
GenePix software provides the ability to flag individual features with quality indicators
such as Good, Bad, Absent, or Not Found. In the text data file, these indicators are
converted to numeric data. Features with a Bad flag are designated −100, Good features
are flagged as +100, Absent features are flagged as −75, and Not Found features as
−50. All other genes are designated as 0 in the .gpr file. By modulating data on the
Flags column at a setting of 0, it is possible to identify those genes that are consistently
good or bad.
4b. The frequency of various flagged occurrences is then calculated across the selected
experiments for each gene. It is possible to filter the data using query devices for the
newly generated columns. A histogram can be created using the Histogram Guide
to better view the distribution. When the histogram is created, associated data are
linked to parent data in the Spotfire session and a new query device is created for
this column of data (Fig. 7.8.15).
Loading and
Preparing Data
for Analysis in
Spotfire
This display allows users to quickly identify those genes that are repeatedly called Absent.
In the above example, there are six GenePix files. The histogram displays all genes based
on how many times they were binned into the Flag category from six columns of data.
The distribution ranges from 0, which identifies genes that are never flagged Bad or
7.8.12
Supplement 6
Current Protocols in Bioinformatics
Figure 7.8.13 Clicking on the Filter Genes Guide allows users to perform preprocessing on
GenePix data.
Figure 7.8.14
columns.
Preprocessing can be performed on GenePix data using the Flags or the SNR
Not Found or Absent (hence the good genes), to 6, which identifies genes that are most
frequently flagged and need to be filtered out of the data set.
5b. Using the above histogram it is possible to exclude genes in one or more groups.
By default, the query device is in the range slider format. Right click on the center
of the Query device and choose Check Boxes. By filtering the “flagged 6 times
group,” 2937 genes are filtered out (Fig. 7.8.16).
6b. Users may also filter GenePix data based on criteria other than Flag, such as Signal
to Noise Ratio (SNR), raw signal, or Background pixel saturation levels. Click on
Analyzing
Expression
Analysis
7.8.13
Current Protocols in Bioinformatics
Supplement 6
Figure 7.8.15 A new data column and a new query device are added to the Spotfire session,
based on the Filter Genes>Modulation>Flags selection.
Figure 7.8.16 Clearing check box corresponding to Modulation by Flags column (category 6)
alters the number of visible records (shown on the Activity Line).
Loading and
Preparing Data
for Analysis in
Spotfire
Data Preparation in the Guides pane, followed by Filter Genes. Users can filter
genes by “Standard deviation,” “Fold change,” or “Modulation.” In order to filter
genes by “Standard deviation,” it is necessary to normalize data based on Z-score
calculations (see Basic Protocol 3 and Background Information). Similarly, genes
can only be filtered by “Fold change” when the appropriate normalization has been
applied to the data (see Basic Protocol 3 and Background Information). Genes can
7.8.14
Supplement 6
Current Protocols in Bioinformatics
be filtered by modulation or frequency of crossing a threshold. In a set of 12 GenePix
files, for example, one can ask how many times a certain gene has a SNR value
greater than 1.5. This calculation can be carried out for every gene in the dataset
and groups of genes can be removed based on a particular frequency.
LOG TRANSFORMATION OF MICROARRAY DATA
The logarithmic (henceforth referred to as log) function has been used to preprocess
microarray data from the very beginning (Yang et al., 2002). The range for raw intensity
values in microarray experiments spans a very large interval from zero to tens of thousands. However, only a small fraction of genes have values that high. This generates a long
tail in the distribution curve, making it asymmetrical and non-normal. Log transformation provides values that are easily interpretable and more meaningful from a biological
standpoint. The log transformation accomplishes the goal of defining directionality and
fold change, whereas raw signal numbers only demonstrate relative expression levels.
The log transformation also makes the distribution of values symmetrical and almost
normal, by removing the skew originating from long tails originating from values with
high intensities.
SUPPORT
PROTOCOL 2
1. Open an instance of Spotfire (UNIT 7.7). Upload (see Basic Protocol 1 or Alternate
Protocol 1) and prefilter (see Support Protocol 1) microarray data.
2. In the Guides pane of the DecisionSite Navigator (see UNIT 7.7), click on Data
preparation>Transform columns to log scale. A new window is opened within the
Guides pane (Fig. 7.8.17).
3. Select the columns on which to perform log transformation. These would typically
be the signal columns in Affymetrix data and Cy-3 and Cy-5 signal data in the case
Figure 7.8.17 The “Transform columns to log scale” guide allows the user to convert any numeric
data column to its logarithm counterpart, allowing the user to chose log to base 2 or 10.
Analyzing
Expression
Analysis
7.8.15
Current Protocols in Bioinformatics
Supplement 6
of two-color arrays. Hold down the Ctrl key in order to select multiple columns. In
order to select all the columns displayed in the guide, select the first column, hold
down the Shift key and then select the last column (Fig. 7.8.17).
4. Click Continue. The user is now presented with the option of transforming log to
the base 10 or 2.
5. Click on “log10” or “log2.” Most microarray users have a preference for log2. New
data columns are generated and added to the data set. Query Devices for these newly
generated columns are also added and can be used to manipulate visualizations. Log
transformed values for input values less than or equal to zero are not calculated and
are left empty.
6. Load the Guides pane again by clicking on Back to Contents.
BASIC
PROTOCOL 2
NORMALIZATION OF MICROARRAY DATA WITHIN AN EXPERIMENT
Experimental comparisons of expression are only valid if the data are corrected for
systemic biases such as the technology used, protocol used and investigator. Since these
biases are regularly detected in raw microarray data, it is imperative that some sort of
normalization procedure be used to address this issue (Smyth and Speed, 2003). At this
time, however, there is no consensus way to perform normalization.
Several methods are available in the normalization module of Spotfire. These can broadly
be divided into two categories: those that make experiments comparable (i.e., within
experiments) and those that make the genes comparable (i.e., between experiments, see
Basic Protocol 3). “Normalize by mean,” “Normalize by trimmed mean,” “Normalize by
percentile,” “Scale between 0 and 1,” and “Subtract the mean or median” are all examples
of the former category, which is particularly relevant for the spotted arrays but rarely need
for the Affymetrix chips (Background Information).
1. Open an instance of Spotfire (UNIT 7.7) and upload (see Basic Protocol 1) and
prefilter (see Support Protocol 1) microarray data.
To normalize by mean (also see Background Information)
2a. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data
preparation>Normalization. The normalization dialog box 1(2) opens (Fig. 7.8.18).
3a. Choose the “Normalize by mean” radio button and then click the Next> button.
The normalization dialog box 2(2) opens (Fig. 7.8.19).
4a. Select the “Value columns” on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns.
5a. Click a radio button to select whether to work with “All records” or “Selected
records.”
6a. Select a method from the “Replace empty values with” drop-down list. “Constant”
allows the user to replace empty values with a constant value; “Row average”
replaces empty values by the average for the entire row; and “Row interpolation”
sets the missing values to the interpolated value between the two neighboring values
in the row.
Loading and
Preparing Data
for Analysis in
Spotfire
7a. Set one of the columns to be used for normalization as a baseline by selecting
from the “Baseline for rescaling” drop-down list. The control channel in a twocolor experiment or the control GeneChip in an Affymetrix experiment are obvious
examples. Select None if no baseline is needed.
7.8.16
Supplement 6
Current Protocols in Bioinformatics
Figure 7.8.18 The Normalization dialog 1(2) allows the users to choose from several Normalization options.
Figure 7.8.19 The Normalization dialog 2(2) allows the users to choose Value column on which
to perform Normalization and other variables.
Analyzing
Expression
Analysis
7.8.17
Current Protocols in Bioinformatics
Supplement 6
8a. Check the “Overwrite existing columns” check box if it is desirable to overwrite
the previous column generated by this method. If this check box is deselected, the
previous column is retained.
9a. Click a radio button to specify whether to calculate mean from “All genes” or
“Genes from Portfolio.” If Genes from Portfolio is selected, a portfolio dialog box
will open and the user can specify a number of records or list(s) from which to
calculate mean. Click OK.
10a. Click Finish. Normalized columns are computed and added to the data set.
To normalize by trimmed mean (also see Background Information)
2b. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data
preparation>Normalization. The normalization dialog box 1(2) opens.
3b. Choose the “Normalize by trimmed mean” radio button and then click the Next>
button. The normalization dialog box 2(2) opens.
4b. Select the “Value columns” on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns.
5b. Click a radio button to select whether to work with “All records” or “Selected
records.”
6b. Select a method from the “Replace empty values with” drop-down list. “Constant”
allows the user to replace empty values with a constant value; “Row average”
replaces empty values by the average for the entire row; and “Row interpolation”
sets the missing values to the interpolated value between the two neighboring values
in the row.
7b. Set one of the columns to be used for normalization as a baseline by selecting
from the “Baseline for rescaling” drop-down list. The control channel in a twocolor experiment or the control GeneChip in an Affymetrix experiment are obvious
examples. Select None if no baseline is needed.
8b. Enter a “Trim value.” If a trim value of 10% is entered, the highest and the lowest
5% of the values are excluded when calculating the mean.
9b. Check the “Overwrite existing columns” check box if it is desirable to overwrite
the previous column generated by this method. If this check box is deselected, the
previous column is retained.
10b. Click a radio button to specify whether to calculate mean from “All genes” or
“Genes from Portfolio.” If Genes from Portfolio is selected, a portfolio dialog box
will open and the user can specify a number of records or list(s) from which to
calculate mean. Click OK.
11b. Click Finish. Normalized columns are computed and added to the data set.
To normalize by percentile (also see Background Information)
2c. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data
preparation>Normalization. The normalization dialog box 1(2) opens.
3c. Choose “Normalize by percentile value” and then click the Next> button. The
normalization dialog box 2(2) opens.
Loading and
Preparing Data
for Analysis in
Spotfire
4c. Select the “Value columns” on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns.
7.8.18
Supplement 6
Current Protocols in Bioinformatics
5c. Click a radio button to select whether to work with “All records” or “Selected
records.”
6c. Select a method from the “Replace empty values with” drop-down list. “Constant”
allows the user to replace empty values with a constant value; “Row average”
replaces empty values by the average for the entire row; and “Row interpolation”
sets the missing value to the interpolated value between the two neighboring values
in the row.
7c. Select one of the columns to be used for normalization as a baseline by selecting
from the “Baseline for rescaling” drop-down list. The control channel in a twocolor experiment or the control GeneChip in an Affymetrix experiment are obvious
examples. Select None if no baseline is needed.
8c. Enter a Percentile. For example, “85-percentile” is the value that 85% of all values
in the data set are less than or equal to.
9c. Check the “Overwrite existing columns” check box if it is desirable to overwrite
the previous column generated by this method. If this check box is deselected, the
previous column is retained.
10c. Click a radio button to specify whether to calculate mean from “All genes” or
“Genes from Portfolio.” If Genes from Portfolio is selected, a portfolio dialog box
will open and the user can specify a number of records or list(s) from which to
calculate mean. Click OK.
11c. Click Finish. Normalized columns are computed and added to the data set.
To scale between 0 and 1 (also see Background Information)
2d. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data
preparation>Normalization. The normalization dialog box 1(2) opens.
3d. Choose “Scale between 0 and 1” and then click the Next> button. The normalization
dialog box 2(2) opens.
4d. Select the “Value columns” on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns.
5d. Click a radio button to select whether to work with “All records” or “Selected
records.”
6d. Select a method from the “Replace empty values with” drop-down list. “Constant”
allows the user to replace empty values with a constant value; “Row average”
replaces empty values by the average for the entire row; and “Row interpolation”
sets the missing value to the interpolated value between the two neighboring values
in the row.
7d. Check the “Overwrite existing columns” check box if it is desirable to overwrite
the previous column generated by this method. If this check box is deselected, the
previous column is retained.
8d. Click a radio button to specify whether to calculate mean from “All genes” or
“Genes from Portfolio.” If Genes from Portfolio is selected, a portfolio dialog box
will open and the user can specify a number of records or list(s) from which to
calculate mean. Click OK.
9d. Click Finish. Normalized columns are computed and added to the data set.
Analyzing
Expression
Analysis
7.8.19
Current Protocols in Bioinformatics
Supplement 6
To subtract the mean (also see Background Information)
2e. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data
preparation>Normalization. The normalization dialog box 1(2) opens.
3e. Choose “Subtract the mean” and then click the Next> button. The normalization
dialog box 2(2) opens.
4e. Select the “Value columns” on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns.
5e. Click a radio button to select whether to work with “All records” or “Selected
records.”
6e. Select a method from the “Replace empty values with” drop-down list. “Constant”
allows the user to replace empty values with a constant value; “Row average”
replaces empty values by the average for the entire row; and “Row interpolation”
sets the missing value to the interpolated value between the two neighboring values
in the row.
7e. Check the “Overwrite existing columns” check box if it is desirable to overwrite
the previous column generated by this method. If this check box is deselected, the
previous column is retained.
8e. Click a radio button to specify whether to calculate mean from “All genes” or
“Genes from Portfolio.” If Genes from Portfolio is selected, a portfolio dialog box
will open and the user can specify a number of records or list(s) from which to
calculate mean. Click OK.
9e. Click Finish. Normalized columns are computed and added to the data set.
To subtract the median (also see Background Information)
2f. In the Tools Pane of the DecisionSite Navigator, click on Analysis>Data
preparation>Normalization. The normalization dialog box 1(2) opens.
3f. Choose “Subtract the mean” and then click the Next> button. The normalization
dialog box 2(2) opens.
4f. Select the “Value columns” on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns.
5f. Click a radio button to select whether to work with “All records” or “Selected
records.”
6f. Select a method from the “Replace empty values with” drop-down list. “Constant”
allows the user to replace empty values with a constant value; “Row average”
replaces empty values by the average for the entire row; and “Row interpolation”
sets the missing values to the interpolated value between the two neighboring values
in the row.
7f. Check the “Overwrite existing columns” check box if it is desirable to overwrite
the previous column generated by this method. If this check box is deselected, the
previous column is retained.
Loading and
Preparing Data
for Analysis in
Spotfire
8f. Click a radio button to specify whether to calculate mean from “All genes” or
“Genes from Portfolio.” If Genes from Portfolio is selected, a portfolio dialog box
will open and the user can specify a number of records or list(s) to calculate mean
from. Click OK.
9f. Click Finish. Normalized columns are computed and added to the data set.
7.8.20
Supplement 6
Current Protocols in Bioinformatics
NORMALIZATION OF MICROARRAY DATA BETWEEN EXPERIMENTS
Experimental comparisons of expression are valid only if the data are corrected for
systemic biases such as the technology used, protocol used, and investigator. Since these
biases are regularly detected in raw microarray data, it is imperative that some sort of
normalization procedure be used to address this issue (Smyth and Speed, 2003). At this
time, however, there is no consensus way to perform normalization. Fold change as signed
ratio, fold change as log ratio, fold-change as log ratio in standard deviation units, and Zscore calculation are all examples of between-experiments normalization that are equally
applicable to both spotted and Affymetrix array platforms.
BASIC
PROTOCOL 3
1. Open an instance of Spotfire (UNIT 7.7) and upload (see Basic Protocol 1 or
Alternate Protocol 1) and prefilter (see Support Protocol 1) microarray data.
To normalize by calculating fold change (as signed ratio, log ratio, or log ratio in
standard deviation units; also see Background Information)
2a. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data
preparation>Normalization. The normalization dialog box 1(2) opens.
3a. Select a radio button for “Fold change as signed ratio,” “Fold change as log ratio,” or “Fold change as log ratio in Standard Deviation units.” Click Next. The
Normalization dialog box 2(2) opens.
4a. Select the “Value columns” on which to perform the operation.
5a. Click a radio button to select whether to work with “All records” or “Selected
records.”
6a. Select a method from the “Replace empty values with” drop-down list. “Constant”
allows the user to replace empty values with a constant value; “Row average”
replaces empty values by the average for the entire row; and “Row interpolation”
sets the missing values to the interpolated value between the two neighboring values
in the row.
7a. Check the “Overwrite existing columns” check box if it is desirable to overwrite
the previous column generated by this method. If this check box is deselected, the
previous column is retained.
8a. Click a radio button to specify whether to calculate mean from “All genes” or
“Genes from Portfolio.” If Genes from Portfolio is selected, a portfolio dialog box
will open and the user can specify a number of records or list(s) from which to
calculate mean. Click OK.
9a. Click Finish. Normalized columns are computed and added to the data set.
For Z-score calculation (also see Background Information)
2b. In the Tools pane of the DescisionSite Navigator, click on Analysis>Data
preparation>Normalization. The normalization dialog box 1(2) opens.
3b. Click Z-score Normalization and then click the Next> button. The Normalization
dialog box 2(2) opens.
4b. Select the “Value columns” on which to perform the operation.
5b. Click a radio button to select whether to work with “All records” or “Selected
records.”
6b. Select a method from the “Replace empty values with” drop-down list. “Constant”
allows the user to replace empty values with a constant value; “Row average”
Analyzing
Expression
Analysis
7.8.21
Current Protocols in Bioinformatics
Supplement 6
replaces empty values by the average for the entire row; and “Row interpolation”
sets the missing value to the interpolated value between the two neighboring values
in the row.
7b. Check the “Overwrite existing columns” check box if it is desirable to overwrite
the previous column generated by this method. If this check box is deselected, the
previous column is retained.
8b. Select the “Add mean column check box” if it is desirable to add a column with the
mean of each gene.
9b. Select the “Add standard deviation check box” if it is desirable to add a column
with the standard deviation of each gene.
10b. Select the “Add coefficient of variation check box” if it is desirable to add a column
with the coefficient of variation of each gene.
11b. Click a radio button to select whether to calculate the Z-scores from “All genes”
or “Genes from Portfolio.” Selecting the latter option opens a portfolio dialog box
where on can choose a number of records or lists from which to calculate Z-score.
Choose a list and go back to the Normalization dialog.
12b. Click Finish. Columns containing normalized data are added to the data set.
BASIC
PROTOCOL 4
ROW SUMMARIZATION
The row summarization tool allows users to combine values from multiple columns
(experiments) into a single column. Measures such as averages, standard deviations,
and coefficients of variation of groups of columns can be calculated. Since microarray
experiments are typically performed in multiple replicates, this tool serves to summarize
those experiments and determine the extent of variability.
1. Open an instance of Spotfire (UNIT 7.7) and upload (see Basic Protocol 1 or
Alternate Protocol 1) and prefilter (see Support Protocol 1) microarray data.
2. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data
Preparation>Row summarization (Fig. 7.8.20). The “Row summarization” dialog
box (Fig. 7.8.21) is displayed.
3. Create the appropriate number of groups using the New Groups tool. Move the
desired value columns to suitable groups in the “Grouped value columns” list. To
determine the average per row of n columns, create a new group in the “Grouped
value columns” list, and then select it. Click to select all of the n columns in the
value columns list and then click the Add button. In this manner, several groups
can be summarized simultaneously. At least two value columns must be present in
any “Grouped value columns” for this tool to work. Clicking on “Delete group”
deletes the selected group and its contents (value columns) are transferred to the
bottom of the “Value columns” list (Fig. 7.8.21).
Loading and
Preparing Data
for Analysis in
Spotfire
4. Select a group and click on Rename Group to edit the group name. This is important
because the default column names are names of the original columns followed by
the chosen comparison measure in parentheses. When dealing with a number of
experiments, this sort of nomenclature can be problematic. Therefore, it is advisable
to choose meaningful group names at this stage.
5. Click a radio button to select whether to work with “All records” or “Selected
records.”
7.8.22
Supplement 6
Current Protocols in Bioinformatics
Figure 7.8.20
The Row Summarization Tool is displayed.
Figure 7.8.21 Row Summarization dialog allows the users to chose the value columns on which
to perform the summarization, as well as other variables such as which measure (e.g., Average,
Standard Deviation) to use.
Analyzing
Expression
Analysis
7.8.23
Current Protocols in Bioinformatics
Supplement 6
6. Select a method from the “Replace empty values” drop-down list. “Constant” allows
the user to replace empty values with a constant value; “Row average” replaces
empty values by the average for the entire row; and “Row interpolation” sets the
missing values to the interpolated value between the two neighboring values in the
row.
7. Select a “Summarization measure” (e.g., average, standard deviation, variance, min,
max, median) from the list box and click on OK.
8. Results are added to the dataset and new query devices created.
COMMENTARY
Background Information
Loading and
Preparing Data
for Analysis in
Spotfire
Normalization methods
Normalize by mean. The mean intensity of
one variable (in two-color arrays) is adjusted so
that it is equal to the mean intensity of the control variable (logR − logG = 0, where R and
G are the sum of intensities of each variable).
This can be achieved in two ways: rescaling
the experimental intensity to a baseline control
intensity that remains constant, or rescaling
without designating a baseline so that intensity
levels in both channels are mutually adjusted.
Normalize by trimmed mean. This method
works in a manner that is essentially similar to
normalization by mean, with the exception that
the trimmed mean for a variable is based on all
values except a certain percentage of the lowest
and the highest values of that variable. This
has the effect of reducing the effect of outliers
during normalization. Setting the trim value to
10%, for example, excludes the top 5% and the
bottom 5% values from the calculation. Once
again, the normalization can be performed with
and without a baseline.
Normalize by percentile. The X-percentile
is the value in a data set that X% of the data
are lower than or equal to. One common way
to control for systemic bias in microarrays is
normalizing to the distribution of all genes—
i.e., normalizing by percentile value. Signal
strength of all genes in sample X is therefore
normalized to a specified percentile of all of the
measurements taken in sample X. If the chosen
percentile value is very high (∼85-percentile),
the corresponding data point lies sufficiently
far away from the origin that a good line can be
drawn through all the points. The slopes of the
line for each variable are then used to rescale
each variable. One caveat of this sort of normalization is that it assumes that the median
signal of the genes on the chip stays relatively
constant throughout the experiment. If the total number of expressed genes in the experiment changes dramatically due to true biological activity (causing the median of one chip
to be much higher than another), then the true
expression values have been masked by normalizing to the median of each chip. For such
an experiment, it may be desirable to consider
normalizing to something other than the median, or one may want to instead normalize to
positive controls.
Scale between 0 and 1. If the intent of a microarray experiment is to study the data using
clustering, the user may need to put different
genes on a single scale of variation. Normalizations that may accomplish this include scaling between 0 and 1. Gene expression values
are scaled such that the smallest value for each
gene becomes 0 and the largest value becomes
1. This method is also known as Min-Max normalization.
Subtract the mean. This method is generally used in the context of log-transformed
data. This will replace each value by [value –
mean (expression values of the gene across hybridizations)]. Mean and median centering are
useful transformations because they reduce the
effect of highly expressed genes on a dataset,
thereby allowing the researcher to detect interesting effects in weakly expressed genes.
Subtract the median. This method is also
generally used in the context of logtransformed data and has a similar effect to
mean centering, but is more robust and less
susceptible to the effect of outliers. This will
replace each value by [value – median (expression values of the gene across hybridizations)].
Fold change as signed ratio. This is essentially similar to normalization by mean. A fold
change for a gene under two different conditions (or chips) is created. If there are n
genes and five variables A, B, C, D, and E,
assuming that variable A is considered baseline, the normalized value ei for the variable
E in the ith gene is calculated as Norm ei =
ei /ai , where ai is the value of variable A in the
ith gene.
Fold change as log ratio. If there are n
genes and five variables (A, B, C, D, and E),
7.8.24
Supplement 6
Current Protocols in Bioinformatics
assuming that variable A is considered baseline, the normalized value ei for the variable E
in the ith gene is calculated as Norm ei = log
(ei /ai ), where ai is the value of variable A in the
ith gene.
Fold change as log ratio in standard deviation units. If there are n genes and five variables
(A, B, C, D, and E), assuming that variable A
is considered baseline, the normalized value ei
for the variable E in the ith gene is calculated
as Norm ei = 1/Std(x) · log (ei /ai ) where Std(x)
is the standard deviation of a matrix of log ratios of all signal values for the corresponding
record.
Z-score calculation. Z-score provides a way
of standardizing data across a wide range of
experiments and allows the comparison of microarray data independently of the original hybridization intensities. This normalization is
also typically performed in log space. Each
gene is normalized by subtracting the given expression level from the median (or the mean)
on all experiments, and then divided by the
standard deviation. This weighs the expression
levels in favor of those records that have lesser
variance.
Literature Cited
Cheok, M.H., Yang, W., Pui, C.H., Downing, J.R.,
Cheng, C., Naeve, C.W., Relling, M.V., and
Evans, W.E. 2003. Treatment-specific changes
in gene expression discriminate in vivo drug response in human leukemia cells. Nat. Genet.
34:85-90.
clade B protease gene using high-density
oligonucleotide arrays. Nat. Med. 2:753-759.
Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T.,
Bar-Joseph, Z., Gerber, G.K., Hannett, N.M.,
Harbison, C.T., Thompson, C.M., Simon, I.,
Zeitlinger, J., Jennings, E.G., Murray, H.L., Gordon, D.B., Ren, B., Wyrick, J.J., Tagne, J.B.,
Volkert, T.L., Fraenkel, E., Gifford, D.K., and
Young, R.A. 2002. Transcriptional regulatory
networks in Saccharomyces cerevisiae. Science
298:799-804.
Leung, Y.F. and Cavalieri, D. 2003. Fundamentals of
cDNA microarray data analysis. Trends Genet.
19:649-659.
MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics and Probability
1967:281-297.
Sankoff, D. and Kruskal, J.B. 1983. Time Warps,
String Edits, and Macromolecules. The Theory
and Practice of Sequence Comparison. AddisonWesley, Reading Mass.
Schena, M., Shalon, D., Davis, R.W., and Brown,
P.O. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA
microarray. Science 270:467-470.
Schena, M., Heller, R.A., Theriault, T.P., Konrad,
K., Lachenmeier, E., and Davis, R.W. 1998.
Microarrays: Biotechnology’s discovery platform for functional genomics. Trends Biotechnol. 16:301-306.
Smyth, G.K. and Speed, T. 2003. Normalization of
cDNA microarray data. Methods 31:265-273.
Smyth, G.K., Yang, Y.H., and Speed, T. 2003. Statistical issues in cDNA microarray data analysis.
Methods Mol. Biol. 224:111-136.
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of
genome-wide expression patterns. Proc. Natl.
Acad. Sci. U.S.A. 95:14863-14868.
Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho,
R.J., and Church, G.M. 1999. Systematic determination of genetic network architecture. Nat.
Genet. 22:281-285.
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C.,
Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh,
M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., and Lander, E.S. 1999. Molecular
classification of cancer: class discovery and class
prediction by gene expression monitoring. Science 286:531-537.
Yang, Y., Buckley, M.J., Dudoit, S., and Speed, T.R.
2002. Comparison of methods for image analysis on cDNA microarray data. J. Comp. Stat.
11:108-136.
Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G.,
Moore, T., Lee, J.C., Trent, J.M., Staudt, L.M.,
Hudson, J. Jr., Boguski, M.S., Lashkari, D.,
Shalon, D., Botstein, D., and Brown, P.O. 1999.
The transcriptional program in the response of
human fibroblasts to serum. Science 283:83-87.
Jolliffe, I.T. 1986. Principal Component Analysis.
Springer Series in Statistics. Springer-Verlag,
New York.
Kerr, M.K. and Churchill, G.A. 2001. Experimental design for gene expression microarrays. Biostatistics 2:183-201.
Kozal, M.J., Shah, N., Shen, N., Yang, R., Fucini,
R., Merigan, T.C., Richman, D.D., Morris, D.,
Hubbell, E., Chee, M., and Gingeras, T.R. 1996.
Extensive polymorphisms observed in HIV-1
Yeoh, E.J., Ross, M.E., Shurtleff, S.A., Williams,
W.K., Patel, D., Mahfouz, R., Behm, F.G., Raimondi, S.C., Relling, M.V., Patel, A., Cheng, C.,
Campana, D., Wilkins, D., Zhou, X., Li, J., Liu,
H., Pui, C.H., Evans, W.E., Naeve, C., Wong, L.,
Downing, J.R. 2002. Classification, subtype discovery, and prediction of outcome in pediatric
acute lymphoblastic leukemia by gene expression profiling. 2002. Cancer Cell 1:133-143.
Contributed by Deepak Kaushal and
Clayton W. Naeve
Hartwell Center for Bioinformatics and
Biotechnology
St. Jude Children’s Research Hospital
Memphis, Tennessee
Analyzing
Expression
Analysis
7.8.25
Current Protocols in Bioinformatics
Supplement 6
Analyzing and Visualizing Expression
Data with Spotfire
UNIT 7.9
Spotfire DecisionSite (http://hc-spotfire.stjude.org/spotfire/support/manuals/manuals.
jsp) is a powerful data mining and visualization program with application in many disciplines. Modules are available in support of gene expression analysis, proteomics, general
statistical analysis, chemical lead discovery analysis, geology, as well as others. Here
the focus is on Spotfire’s utility in analyzing gene expression data obtained from DNA
microarray experiments. Other units in this manual present a general overview of the
Spotfire environment along with the hardware and software requirements for installing
it (UNIT 7.7), and how to load data into Spotfire for analysis (UNIT 7.8). This unit presents
numerous methods for analyzing microarray data. Specifically, Basic Protocol 1 and Alternate Protocol 1 describe two methods for identifying differentially expressed genes.
Basic Protocol 2 discusses how to conduct a profile search. Additional protocols illustrate various clustering methods, such as hierarchical clustering (see Basic Protocol
4 and Alternate Protocol 2), K-means clustering (see Basic Protocol 5), and Principal
Components Analysis (see Basic Protocol 6). A protocol explaining coincidence testing
(see Basic Protocol 3) allows the reader to compare the results from multiple clustering
methods. Additional protocols demonstrate querying the Internet for information based
on the microarray data (see Basic Protocol 7), mathematically transforming data within
Spotfire to generate new data columns (see Basis Protocol 8), and exporting final Spotfire
visualizations (see Basic Protocol 9).
Spotfire (Functional Genomics module) can import data in nearly any format, but the authors have focused here on two popular microarray platforms, the commercial GeneChip
microarray data (Affymetrix) and two-color spotted microarray data produced using
GenePix software (Axon). Spotfire facilitates the seamless import of Affymetrix output
files (.met) from Affymetrix MAS v4.0 or v5.0 software. The .met file is a tab-delimited
text file containing information about attributes such as probe set level, gene expression
levels (signal), detection quality controls (p-value and Absence/Presence calls), and so
forth. In the illustration below, the authors use MAS 5.0 .met files as an example. Several
types of spotted arrays and their corresponding data types exist, including commercial
vendors (i.e., Agilent, Motorola, and Mergen) that supply spotted microarrays for various
organisms as well as facilities that manufacture their own chips. Several different scanners
and scanning software packages are available. One of the more commonly used scanners
is the Axon GenePix. GenePix data files are in a tab-delimited text format (.gpr), which
can be directly imported into a Spotfire session.
NOTE: This unit assumes the reader is familiar with the Spotfire environment, has successfully installed Spotfire, and has uploaded and prepared data for analysis. For further
information regarding these tasks, please see UNITS 7.7 & 7.8.
IDENTIFICATION OF DIFFERENTIALLY EXPRESSED GENES USING
t-TEST/ANOVA
The treatment comparison tool provides methods for distinguishing between different
treatments for an individual record. There are two types of treatment comparison algorithms: t-test/ANOVA (Kerr and Churchill, 2001) and Multiple Distinction (Eisen et al.,
1998). Both algorithms seek to identify differentially expressed genes based on their
expression values.
Contributed by Deepak Kaushal and Clayton W. Naeve
Current Protocols in Bioinformatics (2004) 7.9.1-7.9.43
C 2004 by John Wiley & Sons, Inc.
Copyright BASIC
PROTOCOL 1
Analyzing
Expression
Analysis
7.9.1
Supplement 7
The t-test is a commonly used method to evaluate the differences between the means of
two groups by verifying that observed differences between them are statistically significant. Analysis of variation (ANOVA) works along the same principle but can be used
to differentiate between more than two groups. ANOVA calculates the variance within a
group and compares it to the variance between the groups. The original (null) hypothesis
assumes that the mean expression levels of a gene are not different between the two
groups. The null hypothesis is then either rejected or accepted for each gene in consideration. The results are expressed in terms of a p-value, which is the observed significance
level—i.e., the probability of a type I error concluding that a difference exists in the mean
expression values of a given gene when in fact there is no difference. If the p-value is
below a certain threshold, usually 0.05, it is considered that a significant difference exists.
The lower the p-value, the higher the difference. The ANOVA algorithm in Spotfire has
a one-way layout; therefore it can only be used to discriminate between groups based
on one variable. Further, this algorithm assumes the following: (1) the data is normally
distributed and (2) the variances of separate groups are similar. Failure to maintain these
assumptions will lead to erroneous results. One way to ensure that the data is normally
distributed is to log transform the data (UNIT 7.8).
Necessary Resources
Hardware
Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM,
20 MB disk space, and VGA or better display with 800 × 600 pixels resolution
(user may benefit from much higher RAM and a significantly better processor
speed) or
Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space,
256 color (or better) video display, and a network interface card (NIC) for
network connections to MetaFrame servers
Software
PC:
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
Microsoft Internet Explorer 5.0 through 6.0
Spotfire 6.2 or above
Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12)
Web connection to the Spotfire server (http://home.spotfire.net) or local
customer-specific Spotfire Server
Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to
export of text results or visualizations)
Macintosh:
Operating system (OS) 7.5.3 or later
Citrix ICA client (http://www.citrix.com)
Open Transport TCIP/IP Version 1.1.1 (or later)
Files
Data files (e.g., .met files, .gpr files)
Analyzing and
Visualizing
Expression Data
with Spotfire
1. Click Analysis, followed by Pattern Detection, followed by Treatment Comparison
in the Tools pane of DecisionSite Navigator (Fig. 7.9.1).
7.9.2
Supplement 7
Current Protocols in Bioinformatics
Figure 7.9.1
The Treatment Comparison tool is shown.
The treatment comparison dialog-box is displayed and all available columns are listed in
the Value Columns field. (Note that if the tool has been used before, it retains the earlier
grouping and the user will have to delete it.) Value Columns are the original data columns
that have been uploaded into the Spotfire session. Any data column can be used as a value
column as long as it includes integers or real numbers.
2. Use the following procedure to move and organize the desired value columns into
the Grouped Value Columns field, which displays columns that the user has defined
as being part of a group (e.g., replicate microarrays) on which the calculation is to
be performed.
Note that at least two columns should be present in every group for the tool to be able to
perform its calculations.
a. Select the desired column. Click the Add button.
The column will end up in the selected group of the Grouped Value Columns field.
b. Click New Group to add a group or Delete Group to remove a group.
If the deleted group contained any value columns, they are moved back to the Value
Columns field (Fig. 7.9.2).
c. Click Rename Group to open the edit group name dialog box, which can be
used to rename a group.
It is useful to rename the columns to something meaningful because the default names
are Group1, Group2, and so on.
3. From the same dialog box, choose whether All Records or Selected Records are to
be used.
Choosing All Records causes all records that were initially uploaded into Spotfire to be
used for the calculations. If any preprocessing or filtering steps have been performed and
the user would like to exclude those records from calculations, the user should choose
Selected Records.
Analyzing
Expression
Analysis
7.9.3
Current Protocols in Bioinformatics
Supplement 7
Figure 7.9.2 The Treatment Comparison dialog box allows the users to group various Value
Columns into different groups on which t-test/ANOVA is to be performed.
Analyzing and
Visualizing
Expression Data
with Spotfire
Figure 7.9.3 A profile chart is generated to display the results of t-test/ANOVA analysis. The “ttest/ANOVA Query Device” (a range slider) can be manipulated to identify highly significant genes.
The profile chart is colored in the Continuous Coloring mode based on the t-test/ANOVA p-values.
7.9.4
Supplement 7
Current Protocols in Bioinformatics
4. If there are empty values in the data, select a method to replace empty values from
the following choices in the drop-down list:
Choice
Constant Numeric Value
Row Average
Row Interpolation
Replaces empty values with
Specified value
Average of all the values in the row
Interpolated value of the two neighboring values.
5. Select “t-test/ANOVA” from the Comparison Measure list box.
6. Type a new identifier in the Column Name text box or use the default. Check the
Overwrite box to replace the values of a previously named column. If the user wishes
not to overwrite, make sure that the Overwrite check box is unchecked.
7. Click OK.
This will add a new column containing p-values to the data set and creates a new Profile
Chart visualization. The profiles are ordered by the group with the lowest p-value setting
(Fig. 7.9.3).
IDENTIFICATION OF DIFFERENTIALLY EXPRESSED GENES USING
DISTINCTION CALCULATION
ALTERNATE
PROTOCOL 1
The distinction calculation algorithm (Eisen et al., 1998) is slightly different from that
of t-test/ANOVA (see Basic Protocol 1). It is a measure of how distinct the expression
level is between two parts of a profile. The Distinction Calculation algorithm divides the
variables (columns) within a row into two groups. A distinction value is then calculated
for each row based on the two groups of values. The distinction value is a measure of how
distinct the difference in expression level is between two parts of the row (e.g., tumor
cells versus normal cells).
The algorithm divides the variables in the profile data into groups based on factors such as
type of tissue and tumor, and looks for genes that show a distinct difference in expression
level between them. The profiles can be compared to an idealized pattern to identify genes
closely matching that pattern. One such idealized pattern could be where the expression
level is uniformly high for one group of experiments and uniformly low for another
group for the given gene. Profiles that match this ideal pattern closely (i.e., those that
have high expression values in the first set of experiments and low expression values in
the second) are given high positive distinction values. Similarly, profiles that give low
expression values in the first group and high expression values in the second group are
given high negative correlation values. The calculated distinction value is a measure of
how similar each profile is with this ideal. Profiles that have high expression values in
the first group and low expression values in the second are given high positive distinction
values. Likewise, profiles that have low expression values in the first group and high
expression values in the second are given high negative correlation values.
Necessary Resources
Hardware
Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM,
20 MB disk space, and VGA or better display with 800 × 600 pixels resolution
(user may benefit from much higher RAM and a significantly better processor
speed) or
Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space,
256 color (or better) video display, and a network interface card (NIC) for
network connections to MetaFrame servers
Analyzing
Expression
Analysis
7.9.5
Current Protocols in Bioinformatics
Supplement 7
Software
PC:
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
Microsoft Internet Explorer 5.0 through 6.0
Spotfire 6.2 or above
Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12)
Web connection to the Spotfire server (http://home.spotfire.net) or local
customer-specific Spotfire Server
Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to
export of text results or visualizations)
Macintosh:
Operating system (OS) 7.5.3 or later
Citrix ICA client (http://www.citrix.com)
Open Transport TCIP/IP Version 1.1.1 (or later)
Files
Data files (e.g., .met files, .gpr files)
1. Click Analysis, followed by Pattern Detection, followed by Treatment Comparison
in the Tools pane of DecisionSite Navigator (Fig. 7.9.1).
The Treatment Comparison dialog-box is displayed (Fig. 7.9.4) and all available columns
are listed in the Value Columns field. (Note that if the tool has been used before, it retains
the earlier grouping and the user will have to delete it.) Value Columns are the original
data columns that have been uploaded into the Spotfire session. Any data column can be
used as a value column provided it includes integers or real numbers.
2. Organize columns, choose records, and fill empty values as described (see Basic
Protocol 1, steps 2 to 4).
3. Select Distinction/Multiple Distinction from the Comparison Measure list box and
click OK (Fig. 7.9.4).
Analyzing and
Visualizing
Expression Data
with Spotfire
Figure 7.9.4 The Treatment Comparison dialog box allows the users to group various Value
Columns into different groups on which Multiple Distinction is to be performed.
7.9.6
Supplement 7
Current Protocols in Bioinformatics
Figure 7.9.5 Results of Multiple Distinction are originally displayed in a profile chart. The users
can however build a heat map based on these results. (A) A set of genes on the basis of which
eight experiments can be distinctly identified using the Multiple Distinction algorithm. (B) A zoomed
in version of the same heat map.
This will add new columns containing distinction values to the data set and new profile
visualization will be created. The profiles are ordered by the group with the lowest value
(highest distinction).
4. Use these results to order a heat map based on the results of the Distinction/Multiple
Distinction for better visualization and identification of genes with different profiles
in different samples (Fig. 7.9.5).
Analyzing
Expression
Analysis
7.9.7
Current Protocols in Bioinformatics
Supplement 7
A heat map is a false color image of a data set (e.g., microarray data) which allows users
to detect the presence of certain patterns in the data. Heat maps resemble a spreadsheet in
which each row represents a gene present on the microarray and each column represents a
microarray experiment. By coloring the heat map according to signal or log ratio values,
trends can be obtained about the behavior of genes as a function of experiments.
BASIC
PROTOCOL 2
IDENTIFICATION OF GENES SIMILAR TO A GIVEN PROFILE:
THE PROFILE SEARCH
In a profile search, all profiles (i.e., all data-points or rows) are ranked according to
their similarity to a master. The similarity between each of the profiles and the master
is then calculated according to one of the available similarity measures. Spotfire adds a
new data column with values for each individual profile (index of similarity) and a rank
column, which enables users to identify numerous genes that have profiles similar to
the master-profile. In order to successfully use this algorithm, the user must specify the
following.
A gene to be used as a master-profile. A profile search is always based on a master profile.
Spotfire allows users to designate an existing and active profile as the master. Alternatively,
a new master-profile can be constructed by averaging several active profiles. It is possible
to edit the designated master-profile using the built-in editor function before embarking
on profile search (Support Protocol 1).
A similarity measure to be used. Similarity measures express the similarity between
profiles in numeric terms, thus enabling users to rank profiles according to their similarity.
Available methods include Euclidean Distance, Correlation, Cosine Correlation, CityBlock Distance, and Tanimoto (Sankoff and Kruskal, 1983).
Whether to include or exclude empty values from the calculation. If a profile contains a
missing value and the user opts to exclude empty values, the calculated similarity between
the profiles is then based only on the remaining part of the profile.
Necessary Resources
Hardware
Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM,
20 MB disk space, and VGA or better display with 800 × 600 pixels resolution
(user may benefit from much higher RAM and a significantly better processor
speed) or
Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space,
256 color (or better) video display, and a network interface card (NIC) for
network connections to MetaFrame servers
Software
Analyzing and
Visualizing
Expression Data
with Spotfire
PC:
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
Microsoft Internet Explorer 5.0 through 6.0
Spotfire 6.2 or above
Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12)
Web connection to the Spotfire server (http://home.spotfire.net) or local
customer-specific Spotfire Server
7.9.8
Supplement 7
Current Protocols in Bioinformatics
Figure 7.9.6 The Profile Search dialog box allows users to chose Value Columns to be used for
this calculation as well as variables such as Similarity Measure and Calculation Options.
Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to
export of text results or visualizations)
Macintosh:
Operating system (OS) 7.5.3 or later
Citrix ICA client (http://www.citrix.com)
Open Transport TCIP/IP Version 1.1.1 (or later)
Files
Data files (e.g., .met files, .gpr files)
1. Activate the profile to be used as the master in the Profile Chart/Diagram view.
Alternatively, mark a number of profiles on which to base the master profile.
2. If changing the master profile is desired, or to create a totally new profile, edit the
master profile as described (see Support Protocol 1).
3. Click on Analysis, followed by Pattern Detection, followed by Profile Search in the
Tools pane of the DecisionSite Navigator.
A Profile Search dialog box will appear (Fig. 7.9.6).
4. Select the Value Columns on which to perform the profile search. For multiple selections, hold down the Ctrl key while continuing to click the desired columns.
5. Click a radio button to choose to work with All Records or Selected Records (see
Basic Protocol 1, step 3).
6. From the drop-down list, select a method to Replace Empty values from the dropdown list (see Basic Protocol 1, step 4).
7. If both marked records and an active record exist, select whether to use profile from
the Active Record or Average from Marked Records.
Analyzing
Expression
Analysis
7.9.9
Current Protocols in Bioinformatics
Supplement 7
Only one record can be activated at a time (by clicking on the record in any visualization).
An active record appears with a black circle around it. Several or all records present
can be marked by clicking and drawing around them in any visualization. Marked data
corresponding to these records can then be copied to the clipboard. See UNIT 7.7 for more
information.
Following this selection, the selected profile is displayed in the profile editor along with
its name. At this point, the profile and its name can be edited in any manner desired.
8. Select the Similarity Measure to be used.
For a detailed description on similarity measures, see Sankoff and Kruskal (1983).
9. Type a Column Name for the resulting column or use the default. Check the Overwrite
box if appropriate (see Basic Protocol 1, step 6).
10. Click OK.
This will cause the search to be performed and displayed in the editor, and the results
to be added to the dataset as a new column. Additionally, a new scatter plot is created
which displays rank versus similarity, and annotations containing information about the
calculation settings are added to the Visualization.
At the end of the profile search, selected profiles in the data are ranked according to their
similarity to the selected master profile.
11. If desired, create a scatter plot between Similarity and Similarity Rank.
In such a plot, the record that is most similar to the master profile will be displayed in the
lower left corner of the visualization.
SUPPORT
PROTOCOL 1
EDITING A MASTER PROFILE
Since the starting profile does not restrict the user in any fashion, one can modify existing
values to create a master profile of their choice.
Necessary Resources
Hardware
Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM,
20 MB disk space, and VGA or better display with 800 × 600 pixels resolution
(user may benefit from much higher RAM and a significantly better processor
speed) or
Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space,
256 color (or better) video display, and a network interface card (NIC) for
network connections to MetaFrame servers
Software
Analyzing and
Visualizing
Expression Data
with Spotfire
PC:
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
Microsoft Internet Explorer 5.0 through 6.0
Spotfire 6.2 or above
Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12)
Web connection to the Spotfire server (http://home.spotfire.net) or local
customer-specific Spotfire Server
Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to
export of text results or visualizations)
7.9.10
Supplement 7
Current Protocols in Bioinformatics
Figure 7.9.7 The Profile Search: Edit dialog box allows users to edit an existing profile to create
an imaginary profile upon which to base the search.
Macintosh:
Operating system (OS) 7.5.3 or later
Citrix ICA client (http://www.citrix.com)
Open Transport TCIP/IP Version 1.1.1 (or later)
Files
Data files (e.g., .met files, .gpr files)
1. Activate the profile to be used for creating an edited master profile by simply clicking
on the profile in the Profile Chart visualization.
2. Click Analysis, followed by Pattern Detection, followed by Profile Search in the
Tools pane of the DecisionSite Navigator.
A profile search dialog box will appear.
3. Select the Value Columns on which to perform the profile. For multiple selections,
hold down the Ctrl key while continuing to click on the desired columns.
4. Click Edit.
This will open the profile search edit dialog box (Fig. 7.9.7).
5. Click directly in the editor to activate the variable to be changed. Drag the value
to obtain a suitable look on the profile. Delete any undesirable value(s) using the
Delete key on the keyboard.
The new value will be instantaneously displayed in the editor.
6. Type a profile name in the text box or use the default name.
7. Click OK.
This closes the editor and shows the edited profile in the profile search dialog box (Fig.
7.9.6).
8. If desired, revert to the original profile by clicking Use Profile From: Active Record.
The Edited radio button is selected by default.
Analyzing
Expression
Analysis
7.9.11
Current Protocols in Bioinformatics
Supplement 7
BASIC
PROTOCOL 3
COINCIDENCE TESTING
This tool can be used to compare two columns and determine whether the apparent similarity between the two distributions is a coincidence or not. Essentially, the coincidence
testing tool calculates the probability of getting an outcome as extreme as the particular
outcome under the null hypothesis (Tavazoie et al., 1999). This tool is particularly useful
in comparing the results of several different clustering methods (e.g., see Basic Protocols
4 and 5, and Alternate Protocol 2).
Necessary Resources
Hardware
Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM,
20 MB disk space, and VGA or better display with 800 × 600 pixels resolution
(user may benefit from much higher RAM and a significantly better processor
speed) or
Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space,
256 color (or better) video display, and a network interface card (NIC) for
network connections to MetaFrame servers
Software
PC:
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
Microsoft Internet Explorer 5.0 through 6.0
Spotfire 6.2 or above
Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12)
Web connection to the Spotfire server (http://home.spotfire.net) or local
customer-specific Spotfire Server
Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to
export of text results or visualizations)
Macintosh:
Operating system (OS) 7.5.3 or later
Citrix ICA client (http://www.citrix.com)
Open Transport TCIP/IP Version 1.1.1 (or later)
Files
Data files (e.g., .met files, .gpr files)
1. Click on Analysis, followed by Pattern Detection, followed by Coincidence Testing
in the Tools pane of the DecisionSite Navigator.
A dialog box will be displayed (Fig. 7.9.8).
2. Select the First Category Column.
For example, in comparing the results of two different clustering methods, select the first
one here.
3. Select the Second Category Column.
Analyzing and
Visualizing
Expression Data
with Spotfire
4. Select whether to work with All Records or Selected Records (see Basic Protocol 1,
step 3).
5. Type a Column Name for the resulting column or use the default.
7.9.12
Supplement 7
Current Protocols in Bioinformatics
Figure 7.9.8
The Coincidence Testing dialog box.
6. Select the Overwrite check box to overwrite a previous column with the same name
(see Basic Protocol 1, step 6).
7. Click OK.
A new results column containing p-values is added to the dataset. An annotation may also
be added.
HIERARCHICAL CLUSTERING
Hierarchical clustering arranges objects in a hierarchy with a tree-like structure based on
the similarity between the objects. The graphical representation of the resulting hierarchy
is known as a dendrogram (Eisen et al., 1998). In Spotfire DecisionSite, the vertical axis
of the dendrogram consists of the individual records and the horizontal axis represents
the clustering level. The individual records in the clustered data set are represented by
the right-most nodes in the row dendrogram. Each remaining node in the dendrogram
represents a cluster of all records that lie below it to the right in the dendrogram, thus
making the left-most node in the dendrogram a cluster that contains all records. Clustering
is a very useful data reduction technique; however, it can easily be misapplied. The
clustering results are highly affected by the choice of similarity measure and other input
parameters. If possible, the user should replicate the clustering analysis using different
methods.
BASIC
PROTOCOL 4
The algorithm used in the Hierarchical Clustering tool is a hierarchical agglomerative
method. This means that the cluster analysis begins with each record in a separate cluster,
and in subsequent steps the two clusters that are the most similar are combined to a new
aggregate cluster. The number of clusters is thereby reduced by one in each iteration step.
Eventually, all records are grouped into one large cluster.
Necessary Resources
Hardware
Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM,
20 MB disk space, and VGA or better display with 800 × 600 pixels resolution
(user may benefit from much higher RAM and a significantly better processor
speed) or
Analyzing
Expression
Analysis
7.9.13
Current Protocols in Bioinformatics
Supplement 7
Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space,
256 color (or better) video display, and a network interface card (NIC) for
network connections to MetaFrame servers
Software
PC:
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
Microsoft Internet Explorer 5.0 through 6.0
Spotfire 6.2 or above
Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12)
Web connection to the Spotfire server (http://home.spotfire.net) or local
customer-specific Spotfire Server
Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to
export of text results or visualizations)
Macintosh:
Operating system (OS) 7.5.3 or later
Citrix ICA client (http://www.citrix.com)
Open Transport TCIP/IP Version 1.1.1 (or later)
Files
Data files (e.g., .met files, .gpr files)
Initiating hierarchical clustering in Spotfire DecisionSite
1. Click on Analysis, followed by Clustering, followed by Hierarchical Clustering in
the Tools pane of the DecisionSite Navigator (Fig. 7.9.9).
The Hierarchical Clustering dialog box is displayed.
2. Select the Value Columns on which to base clustering. For multiple selections, hold
down the Ctrl key and click on the desired columns or click on one of the columns
and drag to select (Fig. 7.9.10).
3. Select whether to work with All Records or Selected Records (see Basic Protocol 1,
step 3).
4. Select a Method to Replace Empty values with from the drop-down list (see Basic
Protocol 1, step 4).
5. Select which Clustering Method to use for calculating the similarity between two
clusters.
6. Select which Similarity Measure to use in the calculations (Sankoff and Kruskal,
1983).
Correlation measures are based on profile shape and are therefore better measures of
complex microarray studies than measures like Euclidean distance, which are just based
on numeric similarity.
7. Select which Ordering Function to use while displaying results.
Analyzing and
Visualizing
Expression Data
with Spotfire
8. Use the default name or type a new column name in the text box. Check the Overwrite
box if overwriting a previously added column with the same name (see Basic Protocol
1, step 6).
7.9.14
Supplement 7
Current Protocols in Bioinformatics
Figure 7.9.9 The Hierarchical Clustering algorithm can be accessed from the Tools as well as
the Guides menu.
Figure 7.9.10 The Hierarchical clustering dialog box allows users to specify Value Columns to be
included in the clustering calculation and various other calculation options such as the Clustering
Method and Similarity Measure.
Analyzing
Expression
Analysis
7.9.15
Current Protocols in Bioinformatics
Supplement 7
Figure 7.9.11 Hierarchical clustering results are displayed as a (default red-green) heat map
with an associated dendrogram.
9. Select the Calculate Column Dendrogram check box if creating a column dendrogram
is desired.
A column dendrogram arranges the most similar columns (experiments) next to each other.
10. Click OK.
The hierarchical clustering dialog box will close and the clustering initiated. The results
are displayed according to the user’s preferences in the dialog box (Fig. 7.9.11).
11. If desired, add the ordering column to the ordering dataset in order to compare the
clustering results with other methods (see Support Protocol 2).
Marking and activating nodes
12. To mark a node in the row-dendrogram to the left of the heat map, click just outside it,
drag to enclose the node within the frame that appears, and then release. Alternatively,
press Ctrl and click on the node to mark it. To mark more than one node, hold down
the Ctrl key and click on all the nodes to be marked. To unmark nodes, click and
drag an area outside the dendrogram.
When one or more nodes are marked, that part of the dendrogram is shaded in green. The
corresponding parts are also marked in the heat map and the corresponding visualizations.
13. To activate a node, click it in the dendrogram.
Analyzing and
Visualizing
Expression Data
with Spotfire
A black ring appears around the node. Only one node can be active at a given time. This
node remains active until another node is activated. It is possible to zoom in on the active
node by selecting Zoom to Active from the hierarchical clustering menu.
7.9.16
Supplement 7
Current Protocols in Bioinformatics
Zooming in and resizing a dendrogram
14. Zoom to a subtree in the row-dendrogram by using either the visualization zoom bar
or by right clicking in the dendrogram and clicking Zoom to Active in the resulting
pop-up menu. Alternatively, double click on a node.
15. To go one-step back, double click on an area in the dendrogram not containing any
part of a node. To return to the original zoom, click Reset Zoom.
16. If desired, adjust the space occupied by the dendrogram in the visualization by
holding down the Ctrl key and using the left/right arrow keys on the keypad to
slim or widen it.
Saving a dendrogram
NOTE: Dendrograms are not saved in the Spotfire data file (.sfs) but can be saved as
.xml documents.
17. To save, select Save, followed by Row Dendrogram or Column Dendrogram from
the Hierarchical Clustering menu.
18. Type the file name and save the file as a .dnd file.
Opening a saved dendrogram
19. Click on Analysis, followed by Clustering, followed by Hierarchical Clustering in
the Tools pane of the DecisionSite Navigator to display the Hierarchical Clustering
dialog box.
20. Click on Open to display the Dendrogram Import dialog box.
21. Click on the Browse button by the Row Dendrogram field to display an Open File
dialog box.
ADDING A COLUMN FROM HIERARCHICAL CLUSTERING
The ordering column that is added to the dataset when hierarchical clustering is performed
is used only to display the row dendrogram and connect it to the heat map. In order to
compare the results of hierarchical clustering to that of another method much as K-means
clustering (see Basic Protocol 5), a clustering column should be added to the data.
SUPPORT
PROTOCOL 2
Necessary Resources
Hardware
Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM,
20 MB disk space, and VGA or better display with 800 × 600 pixels resolution
(user may benefit from much higher RAM and a significantly better processor
speed) or
Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space,
256 color (or better) video display, and a network interface card (NIC) for
network connections to MetaFrame servers
Software
PC:
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
Microsoft Internet Explorer 5.0 through 6.0
Spotfire 6.2 or above
Analyzing
Expression
Analysis
7.9.17
Current Protocols in Bioinformatics
Supplement 7
Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12)
Web connection to the Spotfire server (http://home.spotfire.net) or local
customer-specific Spotfire Server
Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to
export of text results or visualizations)
Macintosh:
Operating system (OS) 7.5.3 or later
Citrix ICA client (http://www.citrix.com)
Open Transport TCIP/IP Version 1.1.1 (or later)
Files
Data files (e.g., .met files, .gpr files)
1. Perform Hierarchical Clustering on a dataset as described in Basic Protocol 4 and
locate the row dendrogram, which can be found to the left of the heat map (Fig.
7.9.11).
2. If the cluster line is not visible, right click and select View from the resulting pop-up
menu, followed by Cluster Scale.
The cluster line, which is the dotted red line in the row dendrogram, enables users to
determine the number of clusters being selected.
3. Click on the red-circle on the cluster slider above the dendrogram and drag it to
control how many clusters should be included in the data column. Alternatively, use
the left and right arrow keys on the keyboard to scroll through the different number
of clusters.
Analyzing and
Visualizing
Expression Data
with Spotfire
Figure 7.9.12 Hierarchical Clustering visualization allows users to zoom in and out of the heat
map as well as the dendrogram. Individual or a group of clusters can be marked and a data column
added to the Spotfire session.
7.9.18
Supplement 7
Current Protocols in Bioinformatics
All clusters for the current position on the cluster slider are shown as red dots within
the dendrogram. Upon positioning the red circle on its right-most position in the cluster
slider, one cluster can be obtained for every record. Positioning it on its left-most position,
on the other hand, causes all records to be comprised of a single cluster.
4. To retain a previously added cluster column, ensure that the Overwrite check box in
the hierarchical clustering dialog is unchecked (see Basic Protocol 1, step 6).
5. Select Clustering, followed by Add New Clustering Column from the Hierarchical
Clustering menu.
A column with information pertaining to which cluster each record belongs, will be added
to the dataset. Note that the records that are not included in the row dendrogram will have
empty values in the new clustering column (Fig. 7.9.12).
HIERARCHICAL CLUSTERING ON KEYS
A structure key is a string that lists the substructures (for example various descriptions
in the gene ontology tree). Clustering on keys therefore implies grouping genes with a
similar set of substructures. Clustering on keys is based solely on the values within the key
column that should contain comma-separated values for some, if not all, records in the
dataset. This is a valuable tool to determine if there is an overlap between the expression
data and gene ontology descriptions UNIT 7.2).
ALTERNATE
PROTOCOL 2
Necessary Resources
Hardware
Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM,
20 MB disk space, and VGA or better display with 800 × 600 pixels resolution
(user may benefit from much higher RAM and a significantly better processor
speed) or
Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space,
256 color (or better) video display, and a network interface card (NIC) for
network connections to MetaFrame servers
Software
PC:
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
Microsoft Internet Explorer 5.0 through 6.0
Spotfire 6.2 or above
Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12)
Web connection to the Spotfire server (http://home.spotfire.net) or local
customer-specific Spotfire Server
Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to
export of text results or visualizations)
Macintosh:
Operating system (OS) 7.5.3 or later
Citrix ICA client (http://www.citrix.com)
Open Transport TCIP/IP Version 1.1.1 (or later)
Files
Data files (e.g., .met files, .gpr files)
Analyzing
Expression
Analysis
7.9.19
Current Protocols in Bioinformatics
Supplement 7
1. Click on Analysis, followed by Clustering, followed by Hierarchical Clustering on
Keys in the Tools pane of the DecisionSite Navigator.
The Hierarchical Clustering dialog box will be displayed.
2. Select the Key Columns on which to base clustering.
The Key Column can be any string column in the data.
3. Select whether to work with All Records or Selected Records (see Basic Protocol 1,
step 3).
4. Select a method to Replace Empty Values from the drop-down list (see Basic Protocol
1, step 4).
5. Select which Clustering Method to use for calculating the similarity between two
clusters.
6. Select which Similarity Measure to use in the calculations (Sankoff and Kruskal,
1983).
7. Select which Ordering Function to use while displaying results.
8. Type a New Column Name or use the default in the text box. If desired, check the
Overwrite check box if to overwrite a previously added column with the same name
(see Basic Protocol 1, step 6).
9. Select the Calculate Column Dendrogram check box to create a column dendrogram,
if desired.
A column dendrogram arranges the most similar columns (experiments) next to each other.
10. Click OK.
The Hierarchical Clustering on Keys dialog box will be closed and clustering initiated.
The results are displayed according to the users preferences in the dialog box. A heat map
and a row-dendrogram visualization are displayed and added to the dataset.
BASIC
PROTOCOL 5
K-MEANS CLUSTERING
K-means clustering is a method for grouping objects into a predetermined number of clusters based on their similarity (MacQueen, 1967). It is a type of nonhierarchical clustering
where the user must specify the number of clusters into which the data will eventually be
divided. K-means clustering is an iterative process in which: (1) a number of user defined
clusters are predetermined by the user for a data set, (2) a centroid (the center point for
each cluster) is chosen for each cluster based on a number of methods by the user, and
(3) each record in the data set is assigned to the cluster whose centroid is closest to that
record. Note that the proximity of each record to the centroid is determined on the basis
of a user-defined similarity measure. The centroid for each cluster is then recomputed
based on the latest member of the cluster. These steps are repeated until a steady state
has been reached.
Necessary Resources
Hardware
Analyzing and
Visualizing
Expression Data
with Spotfire
Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM,
20 MB disk space, and VGA or better display with 800 × 600 pixels resolution
(user may benefit from much higher RAM and a significantly better processor
speed) or
7.9.20
Supplement 7
Current Protocols in Bioinformatics
Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space,
256 color (or better) video display, and a network interface card (NIC) for
network connections to MetaFrame servers
Software
PC:
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
Microsoft Internet Explorer 5.0 through 6.0
Spotfire 6.2 or above
Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12)
Web connection to the Spotfire server (http://home.spotfire.net) or local
customer-specific Spotfire Server
Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to
export of text results or visualizations)
Macintosh:
Operating system (OS) 7.5.3 or later
Citrix ICA client (http://www.citrix.com)
Open Transport TCIP/IP Version 1.1.1 (or later)
Files
Data files (e.g., .met files, .gpr files)
Performing K-means clustering
1. To initiate K-means clustering, click on Analysis, followed by Clustering, followed
by K-means Clustering in the Tools pane of the DecisionSite Navigator.
The K-means clustering dialog box will be displayed (Fig. 7.9.13).
Figure 7.9.13 The K-means Clustering Tool dialog box allows the users to specify the number
of desired clusters, the method of choice for initiating centroids, the similarity measure, and other
variables.
Analyzing
Expression
Analysis
7.9.21
Current Protocols in Bioinformatics
Supplement 7
2. Select the Value Columns on which to perform the analysis. For multiple selections,
hold down the Ctrl key and click on the desired columns or click on one column
at a time and drag.
3. Click on the radio button to specify whether to work with All Records or Selected
Records (see Basic Protocol 1, step 3).
4. Select a method to Replace Empty Values with from the drop-down list (see Basic
Protocol 1, step 4).
5. Enter the Maximum Number of Clusters.
This is the number of clusters that the K-means tool will attempt to generate from the
given data set. However, if empty clusters are generated, they will be discarded and the
number of clusters displayed may be less than that specified.
6. Select a Cluster Initialization method from the drop-down menu.
The user must specify the number of clusters in which the data should be organized
and a method for initializing the cluster centroids. Among the methods available for
this purpose are the Data Centroid Based Search, Evenly Spaced Profiles, Randomly
Generated Profiles, Randomly Selected Profiles, and Marked Records. These methods are
summarized in Table 7.9.1.
7. Select a Similarity Measure to use from the drop-down menu.
Several different similarity measures are available to the K-means clustering tool. These
measures express the similarity between different records as numbers, thereby making it
possible to rank the records according to their similarity. These include Euclidian distance,
Correlation, Cosine Correlation, and City-Block distance (Sankoff and Kruskal, 1983).
Table 7.9.1 Cluster Initiation Methods
Analyzing and
Visualizing
Expression Data
with Spotfire
Method
Description
Data Centroid Based
Search
An average of all profiles in the data set is chosen to be the first centroid
in this method. The similarity between the centroid and all members of
the cluster is calculated using the defined similarity measure. The
profile that is least fit in this group or which is least similar to the
centroid is then assigned to be the centroid for the second cluster. The
similarity between the second centroid and all the rest of the profiles is
then calculated and all those profiles that are more similar to the
second centroid than the first one are the assigned to the second cluster.
Of the remaining profiles, the least similar profile is then chosen to be
the third centroid and the above process is repeated. This process
continues until the number of clusters specified by the user is reached.
Evenly Spaced
Profiles
This method generates profiles to be used as centroids that are evenly
distributed between the minimum and maximum value for each
variable in the profiles in the data set. The centroids are calculated as
the average values of each part between the minimum and the
maximum values.
Randomly
Generated Profiles
Centroids are assigned from random values based on the data set. Each
value in the centroids is randomly selected as any value between the
maximum and minimum for each variable in the profiles in the data set.
Randomly Selected
Profiles
Randomly selected existing profiles (and not some derivation) from the
data set are chosen to be the centroids of different clusters.
From Marked
Records
Currently marked profiles (marked before initiating K-means
clustering) are used as centroids of different clusters.
7.9.22
Supplement 7
Current Protocols in Bioinformatics
Figure 7.9.14 K-means clustering results are displayed as a group of profile charts. Each group
is uniquely colored as specified by the check-box query device.
8. Type a new column name for the resulting column or use the default. Check the
Overwrite check box to overwrite any previously existing column with the same
name (see Basic Protocol 1, step 6).
9. Click OK.
The K-means dialog box will close and clustering initiated. At the end of clustering, the
results are added to the data set as new columns and graphical representation of the
results can be visualized (Fig. 7.9.14).
PRINCIPAL COMPONENTS ANALYSIS
Principal components analysis (PCA) is a tool to reduce the dimensionality of complex
data so that it can be easily interpreted but without causing significant loss of data (Jolliffe,
1986). Often, this reduction in the dimensionality of data enables researchers to identify
new, meaningful, underlying variables.
BASIC
PROTOCOL 6
PCA involves a mathematical procedure that converts high dimension data containing a
number of (possibly) correlated variables into a new data set containing fewer uncorrelated
variables called principal components. The first principal component accounts for as much
of the variability in the data as possible, and each succeeding component accounts for
as much of the remaining variability as possible. New variables are linear combinations
of the original variables, thereby making it possible to ascribe meaning to what they
represent. This tool works best with transposed data (see Support Protocol 3).
Necessary Resources
Hardware
Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM,
20 MB disk space, and VGA or better display with 800 × 600 pixels resolution
(user may benefit from much higher RAM and a significantly better processor
speed) or
Analyzing
Expression
Analysis
7.9.23
Current Protocols in Bioinformatics
Supplement 7
Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space,
256 color (or better) video display, and a network interface card (NIC) for
network connections to MetaFrame servers
Software
PC:
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
Microsoft Internet Explorer 5.0 through 6.0
Spotfire 6.2 or above
Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12)
Web connection to the Spotfire server (http://home.spotfire.net) or local
customer-specific Spotfire Server
Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to
export of text results or visualizations)
Macintosh:
Operating system (OS) 7.5.3 or later
Citrix ICA client (http://www.citrix.com)
Open Transport TCIP/IP Version 1.1.1 (or later)
Files
Data files (e.g., .met files, .gpr files)
Performing PCA
1. To initiate PCA, click on Analysis, followed by Clustering, followed by Principal
Components Analysis in the Tools pane of the DecisionSite navigator.
The PCA dialog box will open (Fig. 7.9.15).
2. Select the Value Columns on which to perform PCA. For multiple selections, hold
down the Ctrl key and click on the desired columns or click on one column at a
time and drag.
3. Click on the radio button to specify whether to work with All Records or Selected
Records (see Basic Protocol 1, step 3).
4. Select a method to Replace Empty Values from the drop down list (see Basic Protocol
1, step 4).
5. Specify the number of Principal Components.
The number of Principal Components is the total number of dimensions into which the
user wishes to reduce the original data. K-means clustering is an iterative process and
is most valuable when it is repeated several times, using different numbers of defined
clusters. There is no way to predict a good number of clusters for any data set. A pattern
that is obvious in a cluster number of 20 which the user might think will be better defined
with 50 clusters may in fact not appear at all in 50 clusters. It is sometimes helpful to
perform a hierarchical clustering prior to K-means clustering. By looking at the heat-map
and dendrogram generated by hierarchical clustering, the user will get some idea about
how many clusters to specify for K-means clustering.
Analyzing and
Visualizing
Expression Data
with Spotfire
6. Type a new Column Name for the resulting column or use the default name. If desired,
check the Overwrite box to overwrite a previously existing column with the same
name (see Basic Protocol 1, step 6).
7.9.24
Supplement 7
Current Protocols in Bioinformatics
Figure 7.9.15 The PCA dialog box allows the users to specify which Value Columns should be
included in the calculation. In addition, it allows users to define variables such as the number of
desired components.
Figure 7.9.16
PCA results are displayed as 2-D or 3-D plots according to the users specifications.
Analyzing
Expression
Analysis
7.9.25
Current Protocols in Bioinformatics
Supplement 7
7. Select whether to create 2D or a 3D scatter plot showing the Principal Components,
or to perform the PCA calculations without creating a scatter plot by clearing the
Create Scatter Plot check box. The 3D scatter plot can be rotated (Ctrl + right
mouse key) or zoomed (Shift + right mouse key) to assist visualization.
8. Check the Generate Report box.
This report is an HTML page that contains information about the calculation. If the user
does not wish to generate this report, this box can be left unchecked.
9. Click OK.
The Principal Components are now calculated and the results added to the data set as
new columns. A new scatter plot and report is created according to the settings chosen in
this protocol (Fig. 7.9.16). Note that the PCA tool in Spotfire is limited to 2000 columns
of transposed data (i.e., 2000 records in original data). If more records are present at the
time of running this Tool, they will be eliminated from the data.
SUPPORT
PROTOCOL 3
TRANSPOSING DATA IN SPOTFIRE DECISION SITE
The Transpose data tool is used to rotate a dataset so that columns (measurements or
experiments) now become rows (genes) and vice-versa. Often, transposition is necessary
to present data for a certain type of visualization—e.g., Principal Components Analysis
(PCA; see Basic Protocol 6)—or just to get a good overview of the data. Consider
Table 7.9.2 as an example.
As more and more genes are added, the table will grow taller. (Most typical microarrays
contain thousands to tens of thousands of genes.) While useful during data collection,
this may not be the format of choice of certain types for visualizations or calculations.
By transposing this table, the following the format shown in Table 7.9.3.
Table 7.9.2 Typical Affymetrix or Two-Color Microarray Dataa
Gene Name
Experiment 1
Experiment 2
Experiment 3
Gene A
250
283
219
Gene B
1937
80
1655
Gene C
71
84
77
Gene D
47358
131
39155
Gene E
28999
24107
24981
Gene F
689
801
750
Gene G
2004
2371
2205
a Analyzed microarray data typically consists of several rows, each representing a gene or a probe on
the array, and several columns, each corresponding to different experiments (e.g., different tumors or
treatments). This is the “tall-skinny” format.
Table 7.9.3 Microarray Data After Transpositiona
Analyzing and
Visualizing
Expression Data
with Spotfire
Experiment
Gene A
Gene B
Gene C
Gene D
Gene E
Gene F
Gene G
Experiment 1
250
1937
71
47358
28999
689
2004
Experiment 2
283
80
84
131
24107
801
2371
Experiment 3
219
1635
77
39155
24981
750
2205
a After transposition, the data is flipped so that each row now represents an experiment whereas each column now represents
the observations for a gene. This “short-wide” data format is suitable for data visualization techniques like PCA.
7.9.26
Supplement 7
Current Protocols in Bioinformatics
Necessary Resources
Hardware
Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM,
20 MB disk space, and VGA or better display with 800 × 600 pixels resolution
(user may benefit from much higher RAM and a significantly better processor
speed) or
Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space,
256 color (or better) video display, and a network interface card (NIC) for
network connections to MetaFrame servers
Software
PC:
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
Microsoft Internet Explorer 5.0 through 6.0
Spotfire 6.2 or above
Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12)
Web connection to the Spotfire server (http://home.spotfire.net) or local
customer-specific Spotfire Server
Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to
export of text results or visualizations)
Macintosh:
Operating system (OS) 7.5.3 or later
Citrix ICA client (http://www.citrix.com)
Open Transport TCIP/IP Version 1.1.1 (or later)
Files
Data files (e.g., .met files, .gpr files)
1. Open Transpose Data Wizard 1 by clicking Analysis, followed by Data Preparation,
followed by Transpose Data in the Tools pane of the DecisionSite Navigator.
2. Select an identifier column from the drop-down list.
Each value in this column will become a column name in the transposed dataset.
3. Select whether to create columns from All Records or Selected Records (see Basic
Protocol 1, step 3).
The transposed data will have exactly the same number of columns as records in the
original data with an upper limit of 2000. The rest of the data will be truncated.
4. Click on Next to open Transpose Data Wizard 2.
5. Select the columns to be included in the transposition and then click Add>>.
Each selected column will become a record in the new dataset.
6. Click on Next to open Transpose Data Wizard 3.
7. If needed, select Annotation Columns.
Each transposed column is annotated with the value of this column.
8. Click Finish.
A message box opens prompting the user to save previous work.
Analyzing
Expression
Analysis
7.9.27
Current Protocols in Bioinformatics
Supplement 7
9. Click Yes to save data.
The transposed data now replaces the previous data set. Note that the user should save
the previous data set with a different file name to avoid losing that data set.
BASIC
PROTOCOL 7
USING WEB LINKS TO QUERY THE INTERNET FOR USEFUL
INFORMATION
The Web Links tool enables users to send a query to an external Web site to search for
information about marked records. The search results are displayed in a separate Web
browser. The Web Links tool is shipped with a number of predefined Web sites that are
ready to use, though the user can easily set up new links to Web sites of their choice.
Necessary Resources
Hardware
Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM,
20 MB disk space, and VGA or better display with 800 × 600 pixels resolution
(user may benefit from much higher RAM and a significantly better processor
speed) or
Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space,
256 color (or better) video display, and a network interface card (NIC) for
network connections to MetaFrame servers
Software
PC:
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
Microsoft Internet Explorer 5.0 through 6.0
Spotfire 6.2 or above
Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12)
Web connection to the Spotfire server (http://home.spotfire.net) or local
customer-specific Spotfire Server
Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to
export of text results or visualizations)
Macintosh:
Operating system (OS) 7.5.3 or later
Citrix ICA client (http://www.citrix.com)
Open Transport TCIP/IP Version 1.1.1 (or later)
Files
Data files (e.g., .met files, .gpr files)
Sending a query using Web links
In order to send a query, the data must be in Spotfire DecisionSite. The query is sent for
the marked records in the visualizations. If more than one record is marked, the records
are separated by the Web link delimiter (specified under Web Links Options) in the query.
1a. In a particular visualization, mark those records for which information is desired.
Analyzing and
Visualizing
Expression Data
with Spotfire
2a. Click on Access, followed by Web Links in the Tools pane of the DecisionSite
Navigator.
The Web Links dialog box will be displayed (Fig. 7.9.17).
7.9.28
Supplement 7
Current Protocols in Bioinformatics
Figure 7.9.17 The Web Links dialog box allows users to specify the Web site to search and the
Identifier column from which to formulate the query.
3a. Click to select the link to the Web site where the query will be sent.
Some Web sites only allow searching for one item at a time.
4a. If there are no hits from a search, mark one record at a time in the visualizations and
try again.
5a. Select the Identifier Column to be used as input to the query.
Any column in the data set can be chosen.
6a. Click OK.
The query is sent to the Web site and the results are displayed in a new Web browser (Fig.
7.9.18).
Setting up a new Web link
1b. Click on Access, followed by Web Links in the Tools pane of the DecisionSite
Navigator.
The Web Links dialog box will be displayed.
2b. Click on Options to cause the Web Links Options dialog box to be displayed.
3b. Click on New.
A new Web Link will be created and selected in the list of Available Web Links. The Preview
shows what the finished query will look like when it is sent.
4b. Edit the name of the new link in the Web Link Name text box.
5b. Edit the URL to the Web link. Use a dollar sign within curly brackets {$} as a
placeholder for ID.
Anything entered between the left bracket and the dollar sign will be placed before each
ID in the query. In the same way, anything placed between the dollar sign and the right
bracket will be placed after each ID in the query.
Analyzing
Expression
Analysis
7.9.29
Current Protocols in Bioinformatics
Supplement 7
Figure 7.9.18 Results of a Web Link query are displayed in a new Web browser window. In this
particular example, a significant outlier list of genes (Genbank Accession numbers) was queried
using a Gene Annotation Database (created at the Hartwell Center for Bioinformatics and Biotechnology) and the results returned included Gene Descriptions and Gene Ontologies (UNIT 7.2) for
the queried records.
6b. Enter the Delimiter to separate the IDs in a query.
The identifiers in a query with more than one record are put together in one search string
separated by the selected delimiter. The delimiters AND, OR, or ONLY can be used. The
ONLY delimiter is useful when specifying genes differentially expressed at one point of
time only, or genes that result in classification of a particular kind of tumor only.
7b. Click OK.
The new Web Link will be saved and displayed together with the other available Web Links
in the user interface.
Editing a Web link
1c. Click on Access, followed by Web Links in the Tools pane of the DecisionSite
Navigator.
The Web Links dialog box will be displayed.
2c. Click on Options to display the Web Links Options dialog box.
3c. Click on the Web Link to be edited in the list of Available Web Links.
The Web Link Name, URL, and Delimiter for the selected Web Link will be displayed and
can be edited directly in the corresponding fields. All changes that are made are reflected
in Preview, which helps show what the finished query will look like.
4c. Make desired changes to the Web Link and click OK.
Analyzing and
Visualizing
Expression Data
with Spotfire
The Web Link will be updated according to the changes and the Web Links Options dialog
box will close.
7.9.30
Supplement 7
Current Protocols in Bioinformatics
Removing a Web link
1d. Click on Access, followed by Web Links in the Tools pane of the DecisionSite
Navigator.
The Web Links dialog box is displayed.
2d. Click on Options to display the Web Links Options dialog box.
3d. Click on the Web Link to be removed in the list of Available Web Links.
The Web Link Name, URL, and Delimiter for the selected Web link will be displayed in
the corresponding fields.
4d. Click Delete to clear all of the fields.
Many Web Links can be deleted at the same time if several Web Links are selected in the
list of Available Web Links and Delete is clicked. Press Ctrl and click on the Web Links
in the list to select more than one. If some of the default Web Links are deleted by mistake,
they can be retrieved by clicking the Add Defaults button. This adds all of the default links
to the Available Web Links list, regardless of whether or not the links already exist.
GENERATING NEW COLUMNS OF DATA IN SPOTFIRE
New columns with numerical values can be computed from the current data set by using mathematical expressions. This protocol describes how to create and evaluate such
expressions. Occasionally the columns included in a data set do not allow users to perform all necessary operations, or to create the visualizations needed to fully explore the
data set. Still, in many cases, the necessary information can be computed from existing
columns. Spotfire provides the option to calculate new columns by applying mathematical operators to existing values. For example, it may be necessary to compute the fold
change in dealing with multiple array experiments. It can easily be computed by dividing
the normalized signal values of the experimental array to the normalized signal values of
the control array for every gene. For a discussion of normalizing data see UNIT 7.8.
BASIC
PROTOCOL 8
This protocol discusses dividing two columns as an example. Other calculations can be
similarly performed. Spotfire supports the functions listed in Table 7.9.4 in expressions
used for calculating new columns.
Necessary Resources
Hardware
Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM,
20 MB disk space, and VGA or better display with 800 × 600 pixels resolution
(user may benefit from much higher RAM and a significantly better processor
speed) or
Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space,
256 color (or better) video display, and a network interface card (NIC) for
network connections to MetaFrame servers
Software
PC:
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
Microsoft Internet Explorer 5.0 through 6.0
Spotfire 6.2 or above
Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12)
Analyzing
Expression
Analysis
7.9.31
Current Protocols in Bioinformatics
Supplement 7
Table 7.9.4 Description of Various Functions Available in Spotfire
Function
Format
Description
ABS
ABS(Arg1)
Returns the unsigned value of Arg1
ADD
Arg1 + Arg2
Adds the two real number arguments and
returns a real number result
CEIL
CEIL(Arg1)
Arg1 rounded up; that is the smallest integer
which ≥Arg1
COS
COS(Arg1)
Returns the cosine of Arg1a
DIVIDE
Arg1/Arg2
Divides Arg1 by Arg2 (real numbers)b
EXP
EXP(Arg1, Arg2) or
Arg1∧ Arg2
Raises Arg1 to the power of Arg2
FLOOR
FLOOR(Arg1)
Returns the largest integer which is ≤Arg1
(i.e., rounds down)
LOG
LOG(Arg1)
Returns the base 10 logarithm of Arg1
LN
LN(Arg1)
Returns the natural logarithm of Arg1
MAX
MAX(Arg1, Arg2, . . .)
Returns the largest of the real number
arguments (null arguments are ignored)
MIN
MIN(Arg1, Arg2, . . .)
Returns the smallest of the real number
arguments (null arguments are ignored)
MOD
MOD(Arg1, Arg2)
Returns the remainder from integer division
∗
MULTIPLY
Arg1 Arg2
Multiplies two real number arguments to
yield a real number result
NEG
NEG(Arg1)
Negates the argument
SQRT
SQRT(Arg1)
Returns the square root of Arg1c
SUBTRACT
Arg1 − Arg2
Subtracts Arg2 from Arg1 (real numbers) to
yield a real number result
SIN
SIN(Arg1)
Returns the sine of Arg1a
TAN
TAN(Arg1)
Returns the tangent of Arg1a
a The argument is in radians.
b If Arg2 is zero, this function results in an error. Examples: 7/2 yields 3.5, 0/0 yields #NUM, 1/0 yields #NUM.
c The result can also be attained by supplying an Arg2 of 0.5 using the EXP function.
Web connection to the Spotfire server (http://home.spotfire.net) or local
customer-specific Spotfire Server
Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to
export of text results or visualizations)
Macintosh:
Operating system (OS) 7.5.3 or later
Citrix ICA client (http://www.citrix.com)
Open Transport TCIP/IP Version 1.1.1 (or later)
Files
Data files (e.g., .met files, .gpr files)
Analyzing and
Visualizing
Expression Data
with Spotfire
Dividing two columns
1. Initiate a new Spotfire session and load data.
For example load a few data columns from a .gpr file.
7.9.32
Supplement 7
Current Protocols in Bioinformatics
Figure 7.9.19
Right clicking in the Query Devices window allows generation of new columns.
Figure 7.9.20
The New Columns dialog box.
Analyzing
Expression
Analysis
7.9.33
Current Protocols in Bioinformatics
Supplement 7
2. Right click in the query devices window. From the resulting pop-up menu (Fig.
7.9.19), chose New Column, followed by From Expression.
A New Column dialog box will appear (Fig. 7.9.20).
3. From the Operators drop-down list, select “/” (Table 7.9.4).
4. Select the desired columns for Arguments 1 and 2.
For example, select the normalized 635 (Cy-5) signal column as Argument 1 and the
normalized 532 (Cy-3) signal column as Argument 2.
5. Click Insert Function.
6. Click Next >.
7. Enter a name for the new column, for example fold change. If the function just
created can be used again later, save it as a Favorite by clicking Add To Favorites.
After being saved, it will appear in the list of Favorites, and can be used again by selecting
it and clicking Insert Favorite.
8. Click Finish.
A new column of data will be added to the session.
BASIC
PROTOCOL 9
EXPORTING SPOTFIRE VISUALIZATIONS
Microarray data analysis techniques usually involve rigorous computation. Most steps
can be tracked and understood by novice users through the use of visualizations in two or
three dimensions with a striking use of colors to demonstrate changes or groupings. UNIT
7.7 provides a detailed discussion of modifying and entracing visualizations. It is desirable that these visualizations be exported from within the Spotfire to other applications.
Currently, Spotfire visualizations can be exported in four different fashions: to Microsoft
Word, to Microsoft PowerPoint, as a Web page, or copied to the clipboard.
Necessary Resources
Hardware
Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM,
20 MB disk space, and VGA or better display with 800 × 600 pixels resolution
(user may benefit from much higher RAM and a significantly better processor
speed) or
Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space,
256 color (or better) video display, and a network interface card (NIC) for
network connections to MetaFrame servers
Software
Analyzing and
Visualizing
Expression Data
with Spotfire
PC:
Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows
Millennium, or Windows 2000
Microsoft Internet Explorer 5.0 through 6.0
Spotfire 6.2 or above
Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3)
through version 2.5 (2.50.4403.12)
Web connection to the Spotfire server (http://home.spotfire.net) or local
customer-specific Spotfire Server
Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to
export of text results or visualizations)
7.9.34
Supplement 7
Current Protocols in Bioinformatics
Figure 7.9.21
The Microsoft Word Presentation dialog box.
Macintosh:
Operating system (OS) 7.5.3 or later
Citrix ICA client (http://www.citrix.com)
Open Transport TCIP/IP Version 1.1.1 (or later)
Files
Data files (e.g., .met files, .gpr files)
Exporting visualizations to Word
The Microsoft Word export tool exports the active visualization(s) to a Microsoft Word
document. Each visualization is added to a new page in the document along with annotation, title, and legend. Note that Microsoft Word needs to be installed on the machine.
1a. Create Visualizations in Spotfire (UNIT 7.7) and if necessary, edit the Titles and Annotations.
2a. Click on Reporting, followed by Microsoft Word in the Tools pane of DecisionSite
Navigator.
A dialog box will be displayed listing all the visualizations that can be exported (Fig.
7.9.21).
3a. Click to select the visualizations to be exported. To select all, click on Select All. For
multiple selections hold down the Ctrl key and select desired visualizations.
4a. Click OK.
An instance of Microsoft Word will be displayed that contains the selected visualizations.
Exporting visualizations to PowerPoint
The Microsoft PowerPoint export tool exports the active visualization(s) to a Microsoft
PowerPoint document. Each visualization is added to a new page in the document along
with annotation, title, and legend. Note that Microsoft PowerPoint needs to be installed
on the machine.
Analyzing
Expression
Analysis
7.9.35
Current Protocols in Bioinformatics
Supplement 7
1b. Create visualizations in Spotfire and if necessary, edit the Titles and Annotations.
2b. Click on Reporting, followed by Microsoft PowerPoint in the Tools pane of DecisionSite Navigator.
A dialog box will be displayed listing all the visualizations that can be exported similar
to the one displayed in Figure 7.9.21.
3b. Click to select the visualizations to be exported. To select all, click on Select all. For
multiple selections hold down the Ctrl key and select desired visualizations.
4b. Click OK.
An instance of Microsoft PowerPoint that contains selected visualizations will be displayed.
Exporting visualizations as a Web page
The Export as Web Page tool exports the current visualizations as an HTML file and a set
of images. The user can also include annotations, titles, and legends for the visualization.
1c. Create the desired visualizations and set the query devices. If multiple visualizations
are to be included, ensure that they are all visible and are in the right proportions.
This is important because unlike the export to Word or PowerPoint features where each
visualization is pasted on a new page (or slide) in the document, all visualizations are
exported to the same page in this case.
Visualizations are included in the report exactly as they are visible on the screen. Multiple
visualizations can be tiled by clicking Window, followed by Auto Tile.
2c. Click on Reporting, followed by Export as Web Page in the Tools pane of DecisionSite
Navigator.
The Export as Web page dialog box will be displayed (Fig. 7.9.22).
3c. Enter a report header.
This header will appear at the top of the Web Page Report.
4c. Check the options to include in the report.
Analyzing and
Visualizing
Expression Data
with Spotfire
Figure 7.9.22
The Export as Web Page dialog box.
7.9.36
Supplement 7
Current Protocols in Bioinformatics
Figure 7.9.23 Data exported from a Spotfire session to the Web is displayed as a Web page
report containing all the images as well as marked records.
These include Legend, Annotations, SQL query (corresponding to the current query devices
setting), and a table of currently marked records.
5c. Select a graphic output format for the exported images (.jpg or .png).
6c. Click Save As. Enter a file name and a directory where the report is to be saved.
The HTML report will be saved in the designated directory along with a subfolder containing the exported images.
7c. If desired, select View Report After Saving.
A browser window will be launched, displaying the report (Fig. 7.9.23).
Copying to clipboard
This tool enables users to copy any active visualization to the clipboard and paste it to
another application.
1d. Create the desired visualizations and set the Query Devices.
2d. From the File menu, click on Edit, followed by Copy Special, followed by Visualization (Fig. 7.9.24).
The active visualization will be copied to clipboard.
3d. Open an instance of the desired application and paste from the clipboard.
Exporting visualization from the file menu
This option allows users to export data from the file menu as either .jpg or .bmp files.
1e. Create the desired visualizations and set the Query Devices.
Analyzing
Expression
Analysis
7.9.37
Current Protocols in Bioinformatics
Supplement 7
Figure 7.9.24
mode.
Exporting currently active visualization using the Copy Special, Visualization
Figure 7.9.25
The Export Visualization dialog box.
2e. Click on File, followed by Export, followed by Current Visualization.
The Export Visualization dialog box will open.
3e. Select whether to Include Title or use the default for the visualization to be exported.
The title is exported along with the visualization.
Analyzing and
Visualizing
Expression Data
with Spotfire
4e. Select Preserve Aspect Ratio or change the size of the visualization to be exported
by changing the aspect settings.
5e. Click OK (Fig. 7.9.25).
7.9.38
Supplement 7
Current Protocols in Bioinformatics
6e. Choose the directory in which to save the visualization from the ensuing window.
Also specify the format in which the visualization should be saved.
Available choices include bitmap (.bmp), JPEG image (.jpg), PNG image (.png), and
extended windows metafile (.emf).
7e. Click Save.
GUIDELINES FOR UNDERSTANDING RESULTS
The goal of most microarray experiments is to survey patterns of gene expression by
assaying the expression levels of thousands to tens of thousands of genes in a single assay.
Typically, RNA is first isolated from different tissues, developmental stages, disease states
or samples subjected to appropriate treatments. The RNA is then labeled and hybridized
to the microarrays using an experimental strategy that allows expression to be assayed
and compared between appropriate sample pairs. Common strategies include the use of
a single label and independent arrays for each sample (Affymetrix), or a single array
with distinguishable fluorescent dye labels for the individual RNAs (most homemade
two-color spotted microarray platforms).
Irrespective of the type of platform chosen, microarray data analysis is a challenge. The
hypothesis underlying microarray analysis is that the measured intensities for each arrayed
gene represent its relative expression level. Biologically relevant patterns of expression
are typically identified by comparing measured expression levels between different states
on a gene-by-gene basis. Before the levels can be compared appropriately, a number of
transformations must be carried out on the data to eliminate questionable or low-quality
measurements, to adjust the measured intensities to facilitate comparisons, and to select
genes that are significantly differentially expressed between classes of samples.
Most microarray experiments investigate relationships between related biological samples based on patterns of expression, and the simplest approach looks for genes that
are differentially expressed. Although ratios provide an intuitive measure of expression
changes, they have the disadvantage of treating up- and down-regulated genes differently. For example, genes up-regulated by a factor of two have an expression ratio of
two, whereas those down-regulated by the same factor have an expression ratio of −0.5.
The most widely used alternative transformation of the ratio is the logarithm base two,
which has the advantage of producing a continuous spectrum of values and treating upand down-regulated genes in a similar fashion.
Normalization adjusts the individual hybridization intensities to balance them appropriately so that meaningful biological comparisons can be made. There are a number of
reasons why data must be normalized, including unequal quantities of starting RNA,
differences in labeling or detection efficiencies between the fluorescent dyes used, and
systematic biases in the measured expression levels.
Expression data can be mined efficiently if the problem of similarity is converted into
a mathematical one by defining an expression vector for each gene that represents its
location in expression space. In this view of gene expression, each experiment represents
a separate, distinct axis in space and the log2(ratio) measured for that gene in that experiment represents its geometric coordinate. For example, if there are three experiments,
the log2(ratio) for a given gene in experiment 1 is its x coordinate, the log2(ratio) in
experiment 2 is its y coordinate, and the log2(ratio) in experiment 3 is its z coordinate.
It is then possible to represent all the information obtained about that gene by a point
in x-y-z-expression space. A second gene, with nearly the same log2(ratio) values for
each experiment will be represented by a (spatially) nearby point in expression space; a
gene with a very different pattern of expression will be far from the original gene. This
Analyzing
Expression
Analysis
7.9.39
Current Protocols in Bioinformatics
Supplement 7
model can be generalized to an infinite number of experiments. The dimensionality of
expression space equals the number of experiments. In this way, expression data can be
represented in n-dimensional expression space, where n is the number of experiments,
and each gene-expression vector is represented as a single point in that space.
Having been provided with a means of measuring distance between genes, clustering
algorithms sort the data and group genes together on the basis of their separation in
expression space. It should also be noted that if the interest is in clustering experiments,
it is possible to represent each experiment as an experiment vector consisting of the
expression values for each gene; these define an experiment space, the dimensionality of
which is equal to the number of genes assayed in each experiment. Again, by defining
distances appropriately, it is possible to apply any of the clustering algorithms defined
here to analyze and group experiments.
To interpret the results from any analysis of multiple experiments, it is helpful to have
an intuitive visual representation. A commonly used approach relies on the creation of
an expression matrix in which each column of the matrix represents a single experiment
and each row represents the expression vector for a particular gene. Coloring each of the
matrix elements on the basis of its expression value creates a visual representation of
gene-expression patterns across the collection of experiments. There are countless ways
in which the expression matrix can be colored and presented. The most commonly used
method colors genes on the basis of their log2(ratio) in each experiment, with log2(ratio)
values close to zero colored black, those with log2(ratio) values greater than zero colored
red, and those with negative values colored green. For each element in the matrix, the
relative intensity represents the relative expression, with brighter elements being more
highly differentially expressed. For any particular group of experiments, the expression
matrix generally appears without any apparent pattern or order. Programs designed to
cluster data generally re-order the rows, columns, or both, such that patterns of expression
become visually apparent when presented in this fashion.
Before clustering the data, there are two further questions that need to be considered. First,
should the data be adjusted in some way to enhance certain relationships? Second, what
distance measure should be used to group related genes together? In many microarray
experiments, the data analysis can be dominated by the variables that have the largest
values, obscuring other, important differences. One way to circumvent this problem is to
adjust or re-scale the data, and there are several methods in common use with microarray
data. For example, each vector can be re-scaled so that the average expression of each
gene is zero: a process referred to as mean centering. In this process, the basal expression
level of a gene is subtracted from each experimental measurement. This has the effect of
enhancing the variation of the expression pattern of each gene across experiments, without
regard to whether the gene is primarily up- or down-regulated. This is particularly useful
for the analysis of time-course experiments, in which one might like to find genes that
show similar variation around their basal expression level. The data can also be adjusted
so that the minimum and maximum are one or so that the ‘length’ of each expression
vector is one.
Analyzing and
Visualizing
Expression Data
with Spotfire
Various clustering techniques have been applied to the identification of patterns in geneexpression data. Most cluster analysis techniques are hierarchical; the resultant classification has an increasing number of nested classes and the result resembles a phylogenetic
classification. Nonhierarchical clustering techniques also exist, such as K-means clustering, which simply partition objects into different clusters without trying to specify
the relationship between individual elements. Clustering techniques can further be classified as divisive or agglomerative. A divisive method begins with all elements in one
cluster that is gradually broken down into smaller and smaller clusters. Agglomerative
7.9.40
Supplement 7
Current Protocols in Bioinformatics
techniques start with (usually) single-member clusters and gradually fuse them together.
Finally, clustering can be either supervised or unsupervised. Supervised methods use
existing biological information about specific genes that are functionally related to guide
the clustering algorithm. However, most methods are unsupervised and these are dealt
with first.
Although cluster analysis techniques are extremely powerful, great care must be taken
in applying this family of techniques. Even though the methods used are objective in the
sense that the algorithms are well defined and reproducible, they are still subjective in the
sense that selecting different algorithms, different normalizations, or different distance
metrics, will place different objects into different clusters. Furthermore, clustering unrelated data will still produce clusters, although they might not be biologically meaningful.
The challenge is therefore to select the data and to apply the algorithms appropriately so
that the classification that arises partitions the data sensibly.
Hierarchical Clustering
Hierarchical clustering is simple and the result can be visualized easily. It is an agglomerative type of clustering in which single expression profiles are joined to form groups,
which are further joined until the process has been carried to completion, forming a
single hierarchical tree. First, the pairwise distance matrix is calculated for all of the
genes to be clustered. Second, the distance matrix is searched for the two most similar
genes or clusters; initially each cluster consists of a single gene. This is the first true
stage in the clustering process. Third, the two selected clusters are merged to produce a
new cluster that now contains at least two objects. Fourth, the distances are calculated
between this new cluster and all other clusters. There is no need to calculate all distances
as only those involving the new cluster have changed. Last, steps two through four are
repeated until all objects are in one cluster. There are several variations on hierarchical
clustering that differ in the rules governing how distances are measured between clusters
as they are constructed. Each of these will produce slightly different results, as will any
of the algorithms if the distance metric is changed. Typically for gene-expression data,
average-linkage clustering gives acceptable results.
K-Means Clustering
If there is advanced knowledge about the number of clusters that should be represented
in the data, K-means clustering is a good alternative to hierarchical methods. In Kmeans clustering, objects are partitioned into a fixed number (K) of clusters, such that
the clusters are internally similar but externally dissimilar. First, all initial objects are
randomly assigned to one of K clusters (where K is specified by the user). Second, an
average expression vector is then calculated for each cluster and this is used to compute the
distances between clusters. Third, using an iterative method, objects are moved between
clusters and intra- and intercluster distances are measured with each move. Objects are
allowed to remain in the new cluster only if they are closer to it than to their previous
cluster. Fourth, after each move, the expression vectors for each cluster are recalculated.
Last, the shuffling proceeds until moving any more objects would make the clusters more
variable, increasing intracluster distances and decreasing intercluster dissimilarity.
Self-Organizing Maps
A self-organizing map (SOM) is a neural-network-based divisive clustering approach that
assigns genes to a series of partitions on the basis of the similarity of their expression vectors to reference vectors that are defined for each partition. Before initiating the analysis,
the user defines a geometric configuration for the partitions, typically a two-dimensional
Analyzing
Expression
Analysis
7.9.41
Current Protocols in Bioinformatics
Supplement 7
rectangular or hexagonal grid. Random vectors are generated for each partition, but before genes can be assigned to partitions, the vectors are first trained using an iterative
process that continues until convergence so that the data are most effectively separated. In
choosing the geometric configuration for the clusters, the user is, effectively, specifying
the number of partitions into which the data is to be divided. As with K-means clustering,
the user has to rely on some other source of information, such as PCA, to determine the
number of clusters that best represents the available data.
Principal Component Analysis
An analysis of micro-array data is a search for genes that have similar, correlated patterns
of expression. This indicates that some of the data might contain redundant information.
For example, if a group of experiments were more closely related than the researcher had
expected, it would be possible to ignore some of the redundant experiments, or use some
average of the information without loss of information.
Principal component analysis (PCA) is a mathematical technique that reduces the effective
dimensionality of gene-expression space without significant loss of information while also
allowing us to pick out patterns in the data. PCA allows the user to identify those views
that give the best separation of the data. This technique can be applied to both genes and
experiments as a means of classification. PCA is best utilized when used with another
classification technique, such as K-means clustering or SOMs, that requires the user to
specify the number of clusters.
COMMENTARY
Background Information
Analyzing and
Visualizing
Expression Data
with Spotfire
DNA microarray analysis has become one
of the most widely used techniques in modern
molecular genetics and protocols have developed in the laboratory in recent years that have
led to increasingly robust assays. The application of microarray technologies affords great
opportunities for exploring patterns of gene
expression and allows users to begin investigating problems ranging from deducing biological pathways to classifying patient populations.
As with all assays, the starting point for developing a microarray study is planning the
comparisons that will be made. The simplest
experimental designs are based on the comparative analysis of two classes of samples, either
using a series of paired case-control comparisons or comparisons to a common reference
sample, although other approaches have been
described; however, the fundamental purpose
for using arrays is generally a comparison of
samples to find genes that are significantly different in their patterns of expression.
Microarrays have led biological and pharmaceutical research to increasingly higher
throughput because of the value they bring in
measuring the expression of numerous genes
in parallel. The generation of all this data,
however, loses much of its potential value un-
less important conclusions can be extracted
from large data sets quickly enough to interpret
the results and influence the next experimental and/or clinical steps. Generating and understanding robust and efficient tools for data
mining, including experimental design, statistical analysis, data visualization, data representation, and database design, is of paramount
importance. Obtaining maximal value from experimental data involves a team effort that includes biologists, chemists, pharmacologists,
statisticians, and software engineers. In this
unit, the authors describe data analysis techniques used in their center for analysis of
large volumes of homemade and commercial Affymetrix microarrays. An attempt has
been made to describe microarray data analysis methods in a language that most biologists
can understand. Benefit from the knowledge
and expertise of biologists who ensure the right
experiments are carried out is essential, in our
view, for correct interpretation of microarray
data.
Literature Cited
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of
genome-wide expression patterns. Proc. Natl.
Acad. Sci. U.S.A. 95:14863-14868.
7.9.42
Supplement 7
Current Protocols in Bioinformatics
Jolliffe, I.T. 1986. Springer Series in Statistics,
1986: Principal Component Analysis. SpringerVerlag, New York.
Kerr, M.K. and Churchill, G.A. 2001. Experimental design for gene expression microarrays. Biostatistics 2:183-201.
MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations
In Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics and Probability,
Vol I. (L.M. Le Cam and J. Neyman, eds.)
pp. 281-297. University of California Press,
Berkeley, Calif.
Sankoff, D. and Kruskal, J.B. 1983. Time Warps,
String Edits, and Macromolecules: The Theory
and Practice of Sequence Comparison. AddisonWesley Publishing, Reading, Mass.
Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho,
R.J., Church, G.M. 1999. Systematic determination of genetic network architecture. Nat. Genet.
22:281-285.
Contributed by Deepak Kaushal and
Clayton W. Naeve
St. Jude Children’s Research Hospital
Memphis, Tennessee
Analyzing
Expression
Analysis
7.9.43
Current Protocols in Bioinformatics
Supplement 7
Microarray Data Visualization and Analysis
with the Longhorn Array Database (LAD)
UNIT 7.10
One of the many hallmarks of DNA microarray research is the staggering quantity and
heterogeneity of raw data that are generated. In addition to these raw data, it is also important that the experiments themselves be fully annotated for the purpose of long-term
functional understanding and comparative analysis. Finally, the combination of experimental annotation and raw data must be linked to organism-specific genomic biological
annotations for the results to have immediate and long-term biological relevance and
meaning.
To handle these challenges, biologists in recent years have enthusiastically developed
and adopted software applications powered by the resilience of computational relational
databases. The Longhorn Array Database (LAD) is a microarray database that operates
on the open-source combination of the relational database PostgreSQL and the operating
system Linux (Brazma et al., 2001; Killion et al., 2003). LAD is a fully MIAME-compliant
database. The MIAME (Minimal Information About a Microarray Experiment) standard
for describing microarray experiments is being adopted by many journals as a requirement
for the submission of papers. It is a fully open-source version of the Stanford Microarray
Database (SMD), one of the largest and most functionally proven microarray databases
(Sherlock et al., 2001).
The protocols presented in this unit detail the steps required to upload experimental data,
visualize and analyze results, and perform comparative analysis of multiple experiment
datasets with LAD. Additionally, readers will learn how to effectively organize data for
the purpose of efficient analysis, long-term warehousing, and open-access publication of
primary microarray data.
Each of these protocols is based on the assumption that one has access to an unrestricted
user account on a fully deployed and configured LAD server. If one is a database curator
who is in the position of having to maintain a LAD installation for a laboratory, core
facility, or institution, Appendices A and B at the end of this unit, which deal with
configuring global resources and setting up user accounts, will be of additional interest.
Systems administrators who need to install LAD should consult Appendix C, which
outlines the steps needed to do this, along with hardware recommendations.
DATABASE LOCATION AND AUTHENTICATION
Nearly all interaction with LAD is performed using a Web browser. The following protocol
is a simple introduction to communication and authentication with LAD. Each of the
subsequent protocols in this unit will assume that these steps have already been performed.
BASIC
PROTOCOL 1
Necessary Resources
Hardware
Computer capable of running an up-to-date Web browser
Software
Web browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version
1.6 or higher
Contributed by Patrick J. Killion and Vishwanath R. Iyer
Current Protocols in Bioinformatics (2004) 7.10.1-7.10.60
C 2004 by John Wiley & Sons, Inc.
Copyright Analyzing
Expression
Patterns
7.10.1
Supplement 8
Figure 7.10.1
LAD front page.
Table 7.10.1 Links Available on the LAD Front Page (also see Fig. 7.10.1)
Link
Function
Register
Provides a resource that allows prospective new users to apply for a
user account on the LAD server. This process is fully covered in
Appendix A at the end of this unit.
Login
Provides the user account login portal. This function is detailed in the
following steps of this protocol.
Resume
Provides the ability for a preauthenticated session to return to the LAD
main menu without passing through the login portal. This function is
only available if a user authenticates with LAD, navigates the browser
to another Web site, and then wishes to return to LAD without closing
the browser window. If the browser window is closed re-authentication
with LAD will be required by the server.
Tutorials
Provides basic tutorials on how to interact with and navigate the
various functions of the LAD server.
Publications
Provides a comprehensive list of publications that have been created
on this LAD server. Publications are a very exciting feature of the
LAD environment and are fully covered in Basic Protocol 8.
1. Using Internet Explorer or Mozilla, navigate to http://[your lad server]/ilat/. Note
that the phrase [your lad server] indicates the fully qualified network hostname of
the LAD server to be accessed. The LAD front page as pictured in Figure 7.10.1 will
appear. Functions of the links available on this page are described in Table 7.10.1.
2. The user is now ready to authenticate with the LAD authentication mechanism. This
protocol and all of the following protocols of this unit will assume the pkillion
username (those using their own LAD user account should modify accordingly).
Select the login link on the screen shown in Figure 7.10.1, then fill in the login
screen with the following information:
Microarray Data
Visualization and
Analysis with LAD
User Name: pkillion
Password: lad4me
The LAD main menu, as pictured in Figure 7.10.2, will appear. This will be the
starting place for many of the protocols in the remainder of this unit.
7.10.2
Supplement 8
Current Protocols in Bioinformatics
Figure 7.10.2
LAD main menu.
EXPERIMENT SUBMISSION
Experiment submission has several prerequisite actions that must be performed before a
user is able to load microarray data into LAD. Appendix B at the end of this unit details
these operations. They include the creation of SUID, plate, and print global resources to
describe the specific microarray slide. Additionally, in Appendix A at the end of the unit
there exist specific instructions for the creation of an unrestricted user account that will
be utilized by an authenticated user to submit the experimental data.
BASIC
PROTOCOL 2
Necessary Resources
Hardware
Computer capable of running an up-to-date Web browser
Software
Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version
1.6 or higher
Files
The GenePix Pro results file (GPR): This file must be created with GenePix Pro
(from Molecular Devices), versions 3.x, 4.x, 5.x, or 6.x. It is highly
recommended that this file not be edited in any way (especially with Microsoft
Excel) once the GenePix Pro software has created it. This file is a text file,
however, and can easily be read by custom analysis software applications.
The GenePix Pro settings file (GPS): This file is the result of gridding operations
performed within GenePix Pro. It is binary, rather than text, and is not processed
in any way by LAD. LAD stores it for the purpose of complete warehousing of
all information important and relevant to the experimental data.
Analyzing
Expression
Patterns
7.10.3
Current Protocols in Bioinformatics
Supplement 8
GenePix Pro Green Channel Scan File (TIFF One): The single-channel TIFF image
of the green (532 nm) wavelength. GenePix Pro has two options for the creation
of TIFF images. First, images may be saved as multi-channel, thus combining
both the quantitation of the green and red wavelengths into one single file.
Additionally, images may be saved as single-channel. This format separates the
green and red data to separate files. LAD is only compatible with the latter
option—images must be saved as single-channel TIFF images in order to be
loaded into LAD as part of an experiment. TIFF One represents one of these
files, most likely the green channel.
GenePix Pro Red Channel Scan File (TIFF Two): The single-channel TIFF image
of the red (635 nm) wavelength. This file is the second of two TIFF files that
represent the image as captured by a microarray scanner.
Sample files: LAD provides sample files that can be used to upload an experiment
for the included Saccharomyces cerevisiae print. These files will be referenced
for the remainder of this protocol. If one is using one’s own experiment files, one
will need to substitute file information appropriately. These files are located in
the directory /lad/install/yeast_example_print/.
The sample experiment file names are:
250.gpr
250.gps
250 ch1.tif
250 ch2.tif
Each of these four files should be ready for submission. It is recommended that
they be named with a consistent naming scheme that implies their functional
linkage as a group of files relevant to a single microarray hybridization. For
example, foo.gpr, goo.gps, twiddle_1.tif, and twaddle_b.tif
would be an inappropriate naming scheme while SC15-110.gpr,
SC15-110.gps, SC15-110-green.tif, and SC15-110-red.tif
would be an excellent reminder of experimental association. In this manner, the
example files are both appropriately and inadequately named. They have a
consistent naming scheme (250). This simple number, however, is probably
inadequate for describing the exact experiment that these files represent. It will
be seen in the coming steps how consistency becomes important for archiving
and querying many related experiments.
Copy the experiment files to the server
1. Navigate to the LAD URL with a Web browser and log into LAD as pkillion to
access the main menu depicted in Figure 7.10.2 (see Basic Protocol 1).
2. Experiment creation is a two-step process. The first step involves the placement of
the experiment files into the user incoming directory. From the screen pictured in
Figure 7.10.2, select the Upload Experiment Files link The screen shown in Figure
7.10.3 should appear.
3a. To upload files from a local drive: One at a time, use the Browse buttons to locate each
of the four files. Make sure to select the correct file for each label. When complete,
press the Upload Experiment Files button. This process may take from a few seconds
to a few minutes depending on the speed of the network connection between the
local computer and the LAD server. It is recommended that this operation not be
performed over dial-up or other low-bandwidth connections.
Microarray Data
Visualization and
Analysis with LAD
Once complete, confirmation will be received as illustrated in Figure 7.10.4. This indicates
that the files are now in the user incoming directory (on the LAD server). This is the
purpose of the system account created in Appendix A at the end of this unit—to provide
a user-specific locale in which experimental files may be placed for processing into the
database. It is recommended that the provided example files be used at this stage, as they
will be compatible with the print created in Appendix B.
7.10.4
Supplement 8
Current Protocols in Bioinformatics
Figure 7.10.3
Experiment file upload screen.
Figure 7.10.4
Experiment file upload results screen.
3b. Alternative file transfer methods: Alternatively, one could choose to use any other
method of transferring files from one’s client machine to the LAD server. Technologies such as FTP, SFTP, Samba, and NFS might provide a superior solution for the
secure transfer of larger file sets to the LAD server. If one is already working from
an X-Windows console on the LAD server itself, one would simply need to copy the
files to the appropriate incoming directory.
Experiment record submission
With the GPR, GPS, and TIFF files properly copied to the server it is now time to submit
the experiment to the database for processing.
4. From the screen pictured in Figure 7.10.2, select the Enter Experiments and Results
link. The screen shown in Figure 7.10.5 should appear. This screen allows one to
select the correct organism and decide whether one wishes to use batch submission
or single experiment submission.
Batch submission of experiments is useful for submitting of a large number of experiments as a single transaction. It is outside the scope of this protocol to fully cover batch
submission. Please see the LAD Web site documentation for further details on this option.
5. From the Choose Organism pull-down menu, select “Saccharomyces cerevisiae.”
Choose the radio button labeled No under “Check here if you wish to submit
Experiments and Results by batch,” then press the button labeled Enter Experiments
into LAD.
6. The screen shown in Figure 7.10.6 should now appear. This is the screen that will
make it possible to fully describe this experiment in the database. It is important that
consistent and meaningful values be entered into the fields provided so that the longterm entry of experiments yields an environment that is navigable and thoughtfully
organized. The following options are for sample purposes and apply the sample files
provided with the LAD distribution. Fill the experiment submission screen out with
the following values:
Analyzing
Expression
Patterns
7.10.5
Current Protocols in Bioinformatics
Supplement 8
Figure 7.10.5
Experiment upload screen 1.
Figure 7.10.6
Experiment upload screen 2.
From the LAD Print Name pull-down menu, select SC1
Set Slide Name to a value of SC1-250
From the Data File Location pull-down menu, select 250.gpr
From the Grid File Location pull-down menu, select 250.gps
From the Green Scan File Location pull-down menu, select 250-ch1.tiff
From the Red Scan File Location pull-down menu, select 250-ch2.tiff
Leave MIAME Annotation File Location as “You have no MIAME file for
loading.”
The following fields and values are not depicted in the figure but should be set
accordingly:
Microarray Data
Visualization and
Analysis with LAD
Leave Experiment Date as the current date
Set LAD Experiment Name to a value of Sample Experiment
7.10.6
Supplement 8
Current Protocols in Bioinformatics
Set LAD Experiment Description to a value of Set of Experiment
Files that come with LAD
From the LAD Experiment Category pull-down menu, select Test
From the LAD Experiment SubCategory pull-down menu, select Test
Set Green Channel (CH1) Description to a value of Control Channel
Set Red Channel (CH2) Description to a value of Experimental Channel
From the Reverse Replicate pull-down menu, select N
From the Set Normalization Type pull-down menu, select Computed.
7. The Experiment Access section of this experiment submission screen, not depicted in
the figure, can be left with its default selections for now. Experiment access and issues
of experiment ownership and security will be fully discussed in Appendices A and
B at the end of this unit (also see Basic Protocol 3, step 7). Additionally, microarray
data normalization will be further discussed within Guidelines For Understanding
Results.
8. Press the “Load Experiment into LAD button.” A confirmation screen should now
appear. If errors appear, simply hit the browser Back button, fix them, and resubmit
the screen by pressing the Load Experiment into LAD button.
Server-side experiment processing
9. Linux cron is a standard service, installed with every variety of Linux, that maintains
and executes scheduled system tasks. This is both useful and required for the serverside experiment processing step of experiment record submission. If Linux cron
has been previously configured to maintain and execute the loading script then there
is no required action at this time. Proceed directly to the next step of this protocol.
Conversely, if this script is not executing on a regular schedule, it will be necessary
to execute it in order to process the submitted experiment into the database. Log into
the LAD server as the root user and execute:
/lad/www-data/cgi-bin/SMD/queue/LoadExpt2DB.pl
Appendix C of this unit describes how certain operations must be automatically executed
on the LAD server on some timed schedule. The authors suggest the utilization of the Linux
cron functionality for this purpose. One of the specific resources that is mentioned is the
script:
/lad/www-data/cgi-bin/SMD/queue/LoadExpt2DB.pl
When step 3 of this protocol (experiment record submission) has been completed, a record
is placed in an operational loading queue of submitted experiments. This queue cannot
be processed in any way by the LAD Web interface. Rather it must be monitored and
processed by the script listed above.
Experiment loading report
10. The experiment submission confirmation screen mentioned in step 8 (not shown in
figure) contains a link that will make it possible to watch the experiment being processed into the database. This page will refresh regularly and will update itself when
the experiment begins processing. The time of execution will depend on the action
(i.e., the script) executed in step 9 of this protocol. Additionally, other experiments
in the queue could delay the processing of your experiment, as they are sequentially
loaded into LAD.
Log examination
11. Carefully check the log for any error conditions the server may have detected. If the
experiment loads successfully, one can now proceed to the next protocol in this unit
(Basic Protocol 3).
Analyzing
Expression
Patterns
7.10.7
Current Protocols in Bioinformatics
Supplement 8
BASIC
PROTOCOL 3
EXPERIMENT SEARCHING
In order to analyze experimental data, it is necessary to successfully find and aggregate
experiments of interest. This protocol will detail the operations needed to accomplish
this goal.
Necessary Resources
Hardware
Computer capable of running an up-to-date Web browser
Software
Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version
1.6 or higher
Software preparation
1. Navigate to the LAD URL with a Web browser and log into LAD as pkillion to
access the main menu depicted in Figure 7.10.2 (see Basic Protocol 1).
2. Upload the data as in Basic Protocol 2.
Experiment searching
3. From the LAD main menu (Fig. 7.10.2) select the Results Search link. The primary
screen from which one can set filters to locate a desired set of experiments will then
appear (Fig. 7.10.7). The search process is divided into three main methodologies:
Experimenter/Category/SubCategory search
Print search
Array List search
Microarray Data
Visualization and
Analysis with LAD
Figure 7.10.7
Experiment search screen.
7.10.8
Supplement 8
Current Protocols in Bioinformatics
Figure 7.10.8
Experiment search results screen.
All three of these methodologies, however, are globally filtered by the Select Organism filter. Each of these methods will now be demonstrated and described.
4. Perform organism filtering: From the Select Organism pull-down menu, select “Saccharomyces cerevisiae.” Press the Limit Lists By Organism button to limit all lists
present on the screen.
5a. To perform experimenter/category/subcategory search: The Experimenter, Category,
and SubCategory lists are a direct reflection of one’s group membership (see step 7,
below) combined with the previous Organism filter (step 4). It will only be possible
to see experiments to which one has group or user-defined access. By default, the
username that was entered at login, pkillion, is selected in the Experimenter list.
Select the radio button Use Method 1 to use the first method of searching. Press the
Display Data button to retrieve experiments that are compliant with the current filter
settings. A single experiment should now be seen, as only one experiment has been
uploaded at this point. The display will be similar to that in Figure 7.10.8. Press the
browser’s Back button to return to the search filter screen.
5b. To perform a Print search: Select the radio button Use Method 2 to use the second
method of searching. This makes it possible to search for experiments in a single
specific print batch. Note that any filter settings applied in the Use Method 1 section
are not applied to the search.
5c. To perform an Array List search: Select the radio button Use Method 3 to use the
third method of searching. This method utilizes array lists, which are stored sets of
ordered experiments that make it extremely easy to recall commonly grouped sets of
experiments. Array lists will be covered in much more detail in Basic Protocol 7.
Naming conventions
6. Figure 7.10.9 is from the original LAD server at the Iyer Lab, University of Texas at
Austin. This screen is presented for two primary reasons. First, it is intended to convey
what LAD search results will look like when a LAD database contains many more
Analyzing
Expression
Patterns
7.10.9
Current Protocols in Bioinformatics
Supplement 8
Figure 7.10.9 Experiment search results screen from the original LAD server at the Iyer Lab,
University of Texas, Austin.
Figure 7.10.10
Microarray Data
Visualization and
Analysis with LAD
Experiment editing screen.
experiments. Additionally and more importantly, these experiments are named in a
very consistent and meaningful manner. It is important that grouped sets of experiments be uploaded in this manner. Subsequent data analysis and storage will be greatly
enhanced through this operational habit. This and other best practices for the longterm storage of microarray data are more fully elaborated upon in the Commentary.
7.10.10
Supplement 8
Current Protocols in Bioinformatics
Adding experiment group and user permissions
7. From the LAD main menu as depicted in Figure 7.10.2, select the Results Search
link. Do not change the search methodology or filter values. Press the Display Data
button to display the experiment uploaded in a previous protocol (format will be as in
Fig. 7.10.8). Select “Edit” to navigate to a complete experiment information–editing
screen as shown in Figure 7.10.10. From this screen, one has the ability to edit most
information that was previously associated with the experiment at the time it was
submitted to the database. The permissions options, not depicted in the figure but
which would be at the very bottom of the screen, should be noted. One can utilize
these Group and User entries to grant access to one’s experiments to users and groups
other than those in one’s default group.
SINGLE-EXPERIMENT DATA ANALYSIS
LAD-based microarray data analysis is generally performed in one of two differing pathways of interaction. Single-experiment data analysis is focused upon both the qualitative
and quantitative scrutiny of experimental data from a single microarray. This protocol
will focus upon the toolsets and options available in LAD. Basic Protocol 5 will explore
the other domain of LAD data investigation, multiple-experiment comparative analysis.
BASIC
PROTOCOL 4
In this example, LAD is asked to show high-quality spot data, joined dynamically with
genomic annotations, sorted in order of chromosome, where the expression of the transcripts represented by each of the spots is significantly up-regulated.
Necessary Resources
Hardware
Computer capable of running an up-to-date Web browser
Software
Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version
1.6 or higher
Software preparation
1. Navigate to the LAD URL with a Web browser and log into LAD as pkillion
to access the main menu depicted in Figure 7.10.2 (see Basic Protocol 1). From the
LAD main menu (Fig. 7.10.2), select the Results Search link and navigate to the
post-search experiment list (Fig. 7.10.8) as described in Basic Protocol 3.
Experimental data browsing
2. For each experiment, there will be a set of icons as pictured in Figure 7.10.11. Each
icon leads to a unique function with respect to single experiment data analysis.
3. Browsing, filtering, and analysis of the original uploaded data, in line with genomic
annotations depending on the organism being utilized, are available through the Data
icon. For the sample experiment uploaded previously (Basic Protocols 2 and 3), select
the Data icon. A screen will appear that is similar to the one pictured below in Figure
7.10.12.
Figure 7.10.11
Single experiment analysis icons.
Analyzing
Expression
Patterns
7.10.11
Current Protocols in Bioinformatics
Supplement 8
Figure 7.10.12
Single experiment data filtering.
4. The screen depicted in Figure 7.10.12 provides access to several functions that make
it possible to investigate the experimental data.
Set the Sort By pull-down menu to a value of Chromosome.
Set the Sort By order pull-down menu (right of the above) to a value of Descending.
Using the Ctrl key (or the Cmd key on Macintosh), add the following values to the
Display list:
Log(base2) of R/G Normalized Ratio (Median)
Log(base2) of R/G Normalized Ratio (Mean).
Select the following values from the Annotation list (to the right of the Display):
Function
Gene.
Select the check box for Make downloadable files of ALL returned records.
Activate the default filters #1, #2, and #3 by selecting their corresponding check
boxes.
Activate an additional filter #4 by selecting its corresponding check box.
Change the filter #4 name to Log(base2) of R/G Normalized Ratio (Median).
Microarray Data
Visualization and
Analysis with LAD
Set the filter #4 operation to a value of >=.
Set the filter #4 value to a value of 3.
7.10.12
Supplement 8
Current Protocols in Bioinformatics
With regard to the data presented, it should be recognized that this data browsing functionality is being used to ask LAD to do more than simply display data. The options available
have been used to ask a real biological question. It is important to remember that the
end goal of any data analysis task is exactly that—to elucidate the biological phenomena
behind the data.
In this example LAD has been asked to show high-quality spot data, joined dynamically
with genomic annotations, sorted in order of chromosome, where the expression of the
transcripts represented by each of the spots is significantly up-regulated.
The sort is accomplished through use of the Sort By drop-down list. The in-line genomic
annotations are installed with LAD and are made viewable through the extra box of
annotation columns. Selecting any or all of them automatically causes their parallel
inclusion with the raw data columns that are selected for browsing. The filtering for highquality spots is accomplished by the activation of default filters #1, #2, and #3. The first
default filter drops all spots that were flagged as bad during the GenePix Pro gridding
process. The second and third filters ensure that the per-channel signal intensity is above
some minimal threshold. The minimal threshold values provided are completely arbitrary
and may or may not be applicable to a particular set of experimental data. Finally, the
most up-regulated spots are selected through the inclusion of a custom filter #4. This filter
only allows spots whose Log(base2) of R/G Normalized Ratio (Median) is greater than
or equal a significant value of 3. Because this ratio is expressed in log2 this actually
translates to a 23 or 8-fold upregulation of the expression level of the experimental sample
(red) relative to the control sample (green).
5. Press the Submit button to query the database based on all of these values. The
resulting screen will be similar to that pictured in Figure 7.10.13. Note the dynamic
links that available in addition to the experimental data presented:
Zoom: Shows individual spot data and genomic annotations
Whole: Shows spot location on overall microarray
SGD: Dynamic link to gene record in Saccharomyces Genome Database
(Cherry et al., 1998)
One should now experiment with other queries that may perhaps be more biologically
meaningful for one’s particular field or biological questions of interest.
Figure 7.10.13
Single experiment data query results.
Analyzing
Expression
Patterns
7.10.13
Current Protocols in Bioinformatics
Supplement 8
Figure 7.10.14
Single experiment data view screen.
Experiment details viewing
6. Return to the experiment list screen as pictured in Figure 7.10.8. For the sample
experiment uploaded previously (Basic Protocols 2 and 3), select the View icon. A
screen will appear that is similar to the one pictured in Figure 7.10.14. Here one has
complete access to all of the experimental annotation provided when this experiment
was submitted to the database.
Additionally, there are links to visualization tools; these will be explored and described
in step 7. Finally, in the Submitted Files section of this screen it is possible to access the
original GenePix Pro files that were uploaded with this experiment. These files may be
retrieved at any time from LAD for the purpose of analysis in other toolsets. It is important
to note that only the LAD user who submitted this experiment will have access to these
links. Even members of the default group, who by definition have access permission to
query one’s experimental data, cannot download these original data files.
Experiment visualizations: Data distribution
7. On the screen pictured in Figure 7.10.14, select the Data Distribution link. A browser
window similar to the one pictured in Figure 7.10.15 will be displayed. The plot
shown can be extremely useful with respect to understanding the distribution of your
data relative to the log-center of zero. Close the extra browser window that opened
with the selection of this tool.
Microarray Data
Visualization and
Analysis with LAD
The plot is a histogram where the vertical axis is an absolute count of spot frequency. The
horizontal axis is a spread of both positive (up-regulated) and negative (down-regulated)
normalized log-ratio bins. The computed normalization value shown is a direct function
of the data present. In essence, this computed normalization value is a coefficient that was
calculated when the experiment was uploaded to the database. Applied to every spot’s
red-channel value, the normalization coefficient serves to bring the overall normalized logratio distribution to an arithmetic mean of zero. Microarray data normalization is more
completely discussed in Guidelines for Understanding Results, below. Please refer to this
section for a more complete description of the concepts and caveats of data normalization.
7.10.14
Supplement 8
Current Protocols in Bioinformatics
Figure 7.10.15
Single experiment data distribution.
Figure 7.10.16
Single experiment plot data.
Experiment visualizations: Plot data
8. On the screen pictured in Figure 7.10.14, select the Plot Data link. A browser window
similar to the one pictured in Figure 7.10.16 will be displayed. This visualization tool
is utilized in two steps. First, as shown in Figure 7.10.16, one is presented with options
to create a scatter plot of one data column versus another within the context of this
single microarray experiment. Second, these options are used to actually generate
the desired plot.
Analyzing
Expression
Patterns
7.10.15
Current Protocols in Bioinformatics
Supplement 8
This scatter plot can have any GenePix Pro or LAD data value stored in the database on
either axis. Additionally, these values can be rendered in either log2 or linear space with
respect to the transformation of the data values. A common plot utilized in microarray data
analysis is the MA plot (Dudoit et al., 2003). The MA plot demonstrates the relationship
between the intensity of a spot and its log2 ratio. What relationship should these two
variables have in a typical microarray experiment? Ideally, none. The net intensity of a
spot should bear no relationship to the ratio of individual wavelength values. Nonetheless,
biases can and will occur and should be checked for through the use of visualization
toolsets.
A traditional MA plot consists of the scatter-plot rendering of:
log2 (R/G) versus 12 log2 (R×G).
9. This step will demonstrate LAD’s ability to closely approximate this scatter plot
through the Plot Data functionality. Set the following values on the screen depicted
in Figure 7.10.16:
For the X-Axis Scale select Log(2)
For the Values to Plot on X-Axis select the column SUM MEDIAN
For the Y-Axis Scale select Linear
For the Values to Plot on Y-Axis Scale select the column
LOG RAT2N MEDIAN.
10. Press the “Plot ‘em!” button to create the scatter plot. The scatter plot that is rendered
should be similar to the one pictured in Figure 7.10.17. Close the extra browser
window that opened with the selection of this tool.
The example plot shown in Figure 7.10.17 demonstrates the null relationship that one
expects to find between the log-ratio and the sum of median intensities of the spots. One
expects and observes a fairly flat distribution of log-ratio values, centered around zero.
There are, of course, fewer spots at the higher range of spot intensity, but this is to be
expected—there are always going to be more low-intensity spots than high-intensity spots.
If there were a true relationship between log-ratio and spot intensity, one would see either
an upswing or downswing in the log-ratio as the plot extends out towards higher intensity
values.
Microarray Data
Visualization and
Analysis with LAD
Figure 7.10.17
Plot produced via single experiment plot data screen (also see Fig. 7.10.16).
7.10.16
Supplement 8
Current Protocols in Bioinformatics
Figure 7.10.18
Single experiment ratios on array (sample experiment).
The Plot Data tool can, of course, be utilized to generate many other biologically meaningful combinations of scatter plots through the utilization of other column and numerical
transformation values in the plot generation. It can also be used to detect spatial biases
in ratios or intensities on the microarray.
Experiment visualizations: Ratios on array
11. Return to the screen pictured in Figure 7.10.14 and select the Ratios on Array link.
A browser window similar to the one pictured in Figure 7.10.18 will be displayed.
This visualization tool is intended to aid in the location of spatial bias with respect to
patterns of hybridization across the surface of the actual microarray chip. Aside from
biased introduced through the intentional design of the microarray chip, one should
not expect to see a correlation between the log-ratio of a spot and its geographic
location on the array. This tool will aid in the detection of such biases. The data
presented can be analyzed and manipulated in three ways (steps 12a, 12b, and 12c).
12a. First, through manipulation of thresholds required to render the blue, amber, and
dim spots, one can use simple visual analysis to identify spatial bias. An example
of significant spatial bias is depicted in Figure 7.10.19.
12b. Second, the ANOVA (analysis of variance) with respect to log-ratio as it relates to
microarray chip sector can be used to expose a print-based bias with respect to the
experimental data captured by this microarray experiment (Kerr et al., 2000).
Analyzing
Expression
Patterns
7.10.17
Current Protocols in Bioinformatics
Supplement 8
Figure 7.10.19
Single experiment ratios on array with bias shown.
12c. Finally, the ANOVA of the log-ratio as it relates to the 384-well plate has the ability
to expose experimental bias introduced by specific print plates during the fabrication
of microarrays.
Experiment visualizations: Channel intensities
13. Return to the screen pictured in Figure 7.10.14 and select the Channel Intensities link.
A browser window similar to the one pictured in Figure 7.10.20 will be displayed.
This tool is very similar to the Data Distribution tool that was utilized in step 7 of
this protocol.
There are two primary differences: (1) the histogram displays the distribution of net
intensities of each channel (green and red) rather than the distribution of the log-ratio
values; and (2) the data can be viewed as either normalized or prenormalized values. This
is useful for visualizing the effect that data normalization has had upon your raw data.
14. On the screen depicted in Figure 7.10.20, change the “Channel 2 normalization” to
a value of “Non-normalized.” Press the Submit button to view the result.
Microarray Data
Visualization and
Analysis with LAD
The change may be too quickly rendered to enable one to discern the difference in the plot.
The authorsof this unit suggest that the Web browser’s Back and Forward buttons be used
to quickly move back and forth between the current screen and the previous. One should
see a movement of the red channel histogram while the green channel remains steady. This
is simply because the normalization coefficient is only applied to the red channel’s raw
7.10.18
Supplement 8
Current Protocols in Bioinformatics
Figure 7.10.20
Single experiment channel intensities.
Figure 7.10.21
Select data grids (filters).
data, never the green. For more information on microarray data normalization please see
Guidelines for Understanding Results, below.
Data filters to gridded microarray image map
15. Return to the experiment list screen pictured in Figure 7.10.8. For the sample experiment uploaded in a previous protocol (Basic Protocols 2 and 3), select the View
Array Image and Grids icon. A screen will appear that is similar to the one pictured
in Figure 7.10.21. This tool allows one to again explore the relationship between
one’s raw data and its distribution across the surface of the actual microarray image.
Activate default filters #1, #2, and #3 (leaving the filters at their default values). Press
the Submit button.
16. An image map of the entire microarray will now appear, as depicted in Figure 7.10.22.
Spots that are surrounded by a box are the ones that passed the filter settings. Unboxed
spots are spots that did not pass the filters provided. Individual spots are clickable in
order to zoom to the spot-specific data for the experiment being browsed.
Experiment details editing
17. Return to the experiment list screen as pictured in Figure 7.10.8. For the sample
experiment uploaded in a previous protocol (Basic Protocol 2 or 3), select the Edit
Analyzing
Expression
Patterns
7.10.19
Current Protocols in Bioinformatics
Supplement 8
Figure 7.10.22
Data grids (results).
icon. A screen will appear that is similar to the one pictured below in Figure 7.10.10.
This screen allows one to perform three functions.
a. First, it is possible to alter and resubmit any of the annotations provided when
the experiment was submitted. This is useful for maintaining a consistent and
informative naming scheme as one begins to load more and more experiments
into LAD. One might eventually adopt a consistent naming strategy, and therefore
it might become desirable to revisit older experiments to bring them in line with
that naming convention.
b. Second, it is possible renormalize one’s data with a new normalization coefficient
if desired. This can often be quite useful when one has utilized some other analysis
program to determine a custom normalization coefficient.
c. Finally, the Remove Access and Add Access functions, not depicted in the figure
but near the bottom of the screen, make it possible to either add to or remove access
permissions to this specific experiment. Experiment access, the consequences of
group membership, and the global access model implemented by LAD are fully
described in Appendix B of this unit.
Experiment deletion
18. Return to the experiment list screen as pictured in Figure 7.10.8. The Delete icon
brings up a screen that makes it possible to permanently delete a particular experiment.
This function is to be used with extreme caution. Deletion of an experiment not only
expunges its data from the database but causes the deletion of the archived GPR,
GPS, and TIFF files. One should only delete an experiment if one truly intends to
permanently remove it from the database.
BASIC
PROTOCOL 5
Microarray Data
Visualization and
Analysis with LAD
7.10.20
Supplement 8
MULTIEXPERIMENT DATA ANALYSIS
Thus far it has been shown that LAD, like many microarray databases, is an extremely
powerful software environment. It provides solutions to a vast number of problem domains that are encountered in microarray data analysis. How do I safely store all of
my experiments in a central repository? How do I keep the experiments functionally
organized and annotated over time? How do I provide secure yet flexible data access to
a group of possibly geographically disparate research scientists? How can I systematically analyze my experiments to detect false positives, experimental artifacts, microarray
fabrication biases, and meaningful biological relevance? Many of these questions have
been addressed in the preceding protocols.
Current Protocols in Bioinformatics
Where microarray databases truly come into their own, however, is in the analysis of
multiple microarray experiments. The intrinsic ability of LAD, powered by its relational
database, to filter, annotate, and analyze thousands of genes across a large number of
related experiments makes it a powerful analysis tool for obtaining biological insights
from microarray data. In this protocol, what is often called the pipeline of microarray
analysis will be described. This pipeline the process by which one takes more than one
experiment and works with the combined dataset as a unit.
Necessary Resources
Hardware
Computer capable of running an up-to-date Web browser
Software
Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version
1.6 or higher
1. Navigate to the LAD URL with a Web browser and log into LAD as pkillion to
access the main menu depicted in Figure 7.10.2 (see Basic Protocol 1).
2. Prepare multiple data sets (Support Protocol).
3. From the LAD main menu as depicted in Figure 7.10.2 select the Results Search
link. The familiar experiment search filters screen (Fig. 7.10.7) will appear. Make
sure the filter radio button is set to a value of Use Method 1. Change no filter settings.
At the bottom of the screen, press the Data Retrieval and Analysis button. A screen
will appear (not depicted) that makes it possible to select multiple experiments for
parallel analysis. In the selection box entitled “Select experiment names from the
following list” on that screen select a few experiments for analysis. Please note that
all experiments must be from the same organism in order to use them in the multiple
experiment analysis pipeline.
Gene selection and annotation
4. On the experiment selection screen (not depicted) mentioned in the previous step,
press the Data Retrieval and Analysis button to proceed into the multiple experiment
analysis pipeline. The first step of the process is a set of filters that specifically handle
the selection of genes and external genomic annotations. The gene and annotation
selection screen is as shown in Figure 7.10.23.
5. First, specify genes or clones for which to retrieve results This substep of the analysis
pipeline allows one to provide either a text file or typed list containing a specific list
of gene names to which one wishes to constrain the analysis. Select the All radio
button. By selecting All it is indicated that one does not wish to perform any filtering
of genes in this manner. As shown, the specific list of genes can be provided in one
of two ways.
a. A file can be placed in one’s own user genelists directory on the LAD
server. This list contains the sequence names that were given to the genes when
the sequences were uploaded to the database. Please consult Appendix B for
further details on sequence names.
b. A list of gene names can be placed in the provided text-area, each delimited by
two colons.
6. Next, decide whether, and how, to collapse the data. For this example, leave the default
selection. As indicated, this default behavior will cause LAD to detect duplicate gene
names within the context of a single experiment and will average the value of the
spots accordingly.
Analyzing
Expression
Patterns
7.10.21
Current Protocols in Bioinformatics
Supplement 8
Figure 7.10.23
Gene annotation filters.
7. Next, choose the contents of the UID column of the output file. For this example,
leave this box unchecked. This option will simply include the database keys for each
of the spots in the output file of this process.
8. Next, choose the desired biological annotation. Select Function and Chromosome
by holding down either the Ctrl key (Windows) or the Cmd key (Macintosh) while
clicking the selections. This selection box makes it possible to pick and choose from
the biological annotations available for the organism currently being analyzed. Note
that if a custom genelist is used, one can choose to include a second column of custom
annotations that will be retained if the “Genelist annotation” radio button is selected.
9. Finally, choose a label for each array/hybridization by selecting one of the radio
buttons. This will simply select how the experiments are labeled in the final output
file. Press the Proceed to Data Filtering button to bring up the raw data filters screen
(Fig. 7.10.24).
Raw data filters
The raw data filters are similar to the ones seen several times elsewhere within the LAD
environment. In the following steps LAD is once again selecting for genes that meet
some specific criteria—this time the criteria are based upon their actual data values. In
this example, most of the default selections will be used.
Microarray Data
Visualization and
Analysis with LAD
10. First, choose the data column to retrieve, for this example, Log(base2) of R/G Normalized Ratio (Mean). In terms of downstream analysis, this is nearly always the
most interesting value to select for each gene and experiment combination. One may
often wish to use LAD, however, to aggregate a very different value across a set of
experiments. Please note that use of values other than ratios will disable many subsequent analysis and visualization options. An example includes hierarchical clustering
analysis, a tool that assumes it is receiving log-transformed ratio values.
7.10.22
Supplement 8
Current Protocols in Bioinformatics
Figure 7.10.24
Raw data filters.
11. Next, decide whether to filter by spot flag. For the example here, leave this option
selected. We do not wish to include spots that have previously been flagged as bad
in each of the experiments. This flagging occurred during the GenePix Pro gridding
process.
12. Next, select criteria for spots to be selected. Activate filters #1, #2, #3, #4, and #5 by
checking the corresponding check boxes.
Note that in this and many of the other features in LAD where raw data filters are present,
the ability to construct a unique Boolean combination of filters is provided. By default,
filters operate with an implicit Boolean AND logic—each filter is combined with the next
in a relationship that demands that resulting spots that pass the filters pass all of the
filters. Conversely, the Filter String text box below the filter definitions in Figure 7.10.24
can be utilized to design a unique Boolean combination of activated filters to allow for the
creation of more complex queries. For example, one could choose to create the following
for the filters we have activated:
1 AND ((2 OR 3) AND (4 OR 5))) OR 6
13. Last, decide on some image presentation options. For this example, leave “Retrieve
spot coordinates” selected and leave “Show all spots” not selected. The “Retrieve
spot coordinates” option will make it possible to see actual spot images in parallel
with synthetic hierarchical cluster spot coloring later in the analysis process. Proceed
to the next step by pressing the Proceed to Gene Filtering button.
Data transformation and gene filters
The remaining steps describe the final stage of the filtering process. In summary, specific
genes of interest were first selected, combining genomic annotations. Next, raw data
filters were used to again select for genes that meet some quantitative thresholds with
respect to dozens of data metrics on a per-array basis. Now, at the final step, filters and
transformations are applied to a single chosen data column (typically a log-ratio) across
Analyzing
Expression
Patterns
7.10.23
Current Protocols in Bioinformatics
Supplement 8
Figure 7.10.25
Final filters and data transformations.
the set of selected experiments. The key distinction between this step and the previous
filtering step is that the previous step applies filters to values on each array—if a spot
fails the filter on a given array, that becomes a missing value, but there still may be data
for that same gene in another array. In this step, filters are applied to a given spot across
all arrays—if a spot fails a filter, it is excluded from subsequent analysis. The final data
filter and transformation screen is as pictured in Figure 7.10.25.
14. Under the set of options labeled, “First, choose one of these methods to filter genes
based on data distribution,” select the radio button for “Do not filter genes on the
basis of data distribution.”
Percentile ranking is often quite valuable for certain microarray applications, such as the
ChIP-chip technique for studying protein-DNA interactions (Ren et al., 2000; Iyer et al.,
2001). For the purposes of this example, a filter that is available further down this screen
will instead be utilized.
15. Next, choose whether to filter genes and arrays based on the amount of data passing
the spot filter criteria. For this example, leave both “Only use genes with greater than
80% good data” and “Only use arrays with greater than 80% good data” not selected.
Microarray Data
Visualization and
Analysis with LAD
These filters are typically utilized to ensure that a given gene has a sufficient number of
data values across the set of experiments and that a given microarray has a sufficient
number of data values across all genes. Including genes or microarrays with too few data
points may cause noisy data to cluster together in subsequent steps, even though there is
no underlying basis for the clustering. It is important to remember, however, that these
filters can inadvertently drop genes or arrays that have true biological value. Care must
be taken when applying quality controls of this nature.
7.10.24
Supplement 8
Current Protocols in Bioinformatics
16. Next, decide whether to center the data. For this example, leave “Center data for
each gene by: Means,” “Center data for each array by: Means,” and “Don’t iterate”
unselected.
This option is only relevant if one is retrieving log-transformed data values. Centering
gene data by means or medians would simply change each ratio value such that the average
gene ratio will be zero after centering. Likewise, centering by arrays will transform the
data such that the average ratio, in the context of a single experiment, will become zero.
Stated differently, centering data is a generic transformation that will cause the mean or
the median of either the row of gene values or the column of array values to be subtracted
from each individual gene or array value. Gene centering is typically utilized when one
has a set of microarray experiments that have all used a common reference sample in the
hybridization (Eisen et al., 1998).
17. Next, select a method to filter genes based on data values. For this example, select
the radio button labeled “Cutoff: select genes whole Log(base2) of R/G Normalized
Ratio (Mean) is (absolute value >) 3 for at least 1 array(s).”
This filter is a method by which it is possible to eliminate all but the most significantly
up-regulated and down-regulated genes based on the extent of differential expression. The
first part of the filter is where what is often denoted as a fold-change, i.e., an elimination of
genes that are not showing a significant amount of either up- or down-regulation relative
to a reference, is implemented. The second part of the filter, the “for at least X arrays” part,
is where it is possible to implement some requirement for consistent change in expression.
If ten microarray experiments were being analyzed in this example, the value could be
changed from 1 to 5. This would impose a situation where genes would be required not
only be significantly up- or down-regulated, but to do so in at least half of the experiments
analyzed.
18. Finally, decide whether to timepoint-transform data. For this example, leave “Transform all data by the following experiment” unselected.
This option is useful when analyzing time-course experiments. Often, in this situation,
one is interested in the relative expression profiles of genes with respect to some timezero. When this option is selected, the gene values in the indicated experiment are either
subtracted (log number space) or divided (linear number space) from the gene values in
each of the other experiments.
19. Press the Retrieve Data button to apply all of these filters to the experiments selected.
SYNTHETICALLY GENERATE MULTIPLE EXPERIMENTS FOR BASIC
PROTOCOL 5
SUPPORT
PROTOCOL
If genuine multiple experiments are available, they should be used; the process described
in Basic Protocol will be much more interesting in this case. If this is a new LAD
installation and one has only the single experiment that was uploaded in Basic Protocol 2,
the following steps should be performed to synthetically create a set of experiments in
the database.
Necessary Resources
Hardware
Computer capable of running an up-to-date Web browser
Software
Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version
1.6 or higher
Analyzing
Expression
Patterns
7.10.25
Current Protocols in Bioinformatics
Supplement 8
1. Re-execute the Experiment Submission protocol (Basic Protocol 2) several times. It
is recommended that at least three additional experiments be submitted if supplementing the single experiment already submitted.
If using the single set of sample experiment files that are provided with LAD it is possible
to utilize them multiple times with one modification to the procedure. Rather than utilize
a Computed normalization factor (see Basic Protocol 2, step 6), provide a unique UserDefined normalization factor for each additional experiment submitted. Simply make up
a normalization factor between 0 and 10 and provide that unique number with each
additional experiment submitted.
2. Once the new submissions are fully loaded into the database, restart Basic Protocol 5
at the beginning to utilize the newly submitted experiments, ensuring that a few
experiments are selected for analysis.
BASIC
PROTOCOL 6
MANIPULATING STORED DATASETS AND EXPORTABLE FILES
The result of the extraction process from Basic Protocol 5 will look similar to Figure
7.10.26. This screen is broken into two primary zones of interest. First, as the data are
retrieved, filtered, transformed, and annotated, the results are printed to the browser
window as a log of system activity. The extraction of data from many experiments can
be slow, depending on the hardware used. Additionally, if the filters that are used do not
eliminate many genes, the annotation of the resulting gene set can also be slow.
Links on the “Retrieving Data” Screen
The second zone of interest is the array of links provided at the bottom of the screen
shown in Figure 7.10.26, which are discussed individually in the following paragraphs.
Download PreClustering File
This link is to a file that is often called the PCL file (for preclustering). In essence, it
is a table of tab-delimited text information. Gene records are present as rows within the
file while columns define the individual array experiments selected for analysis. This file
Microarray Data
Visualization and
Analysis with LAD
Figure 7.10.26
Data extraction results.
7.10.26
Supplement 8
Current Protocols in Bioinformatics
is the input to both LAD’s hierarchical clustering functionality as well as many other
stand-alone analysis applications.
Download Spot Location File
This file can be considered a sister file to the PCL. It is not terribly meaningful on its
own but can become very useful when combined with the PCL file. These files will be
utilized together in the Cluster Regeneration section of this protocol.
Download Tab Delimited File
This file is a version of the PCL file where genomic annotations that are present have
been separated into their own tab-delimited columns rather than compressed into one
column. This file is not a valid PCL file for this reason. This file is excellent for use in
spreadsheet-type applications such as Microsoft Excel.
Save This Dataset
This link navigates to a LAD function that allows one to save this filtered dataset to the
database for analysis at a later time. This function is described later.
Clustering and Image Generation
This link navigates to LAD’s intrinsic hierarchical clustering and visualization toolset.
This function is described later in this protocol.
Cluster Regeneration
As mentioned above, the PreClustering and Spot Location files are the input into the
hierarchical clustering process. Often, one may wish to right-click on these links to save
these files to one’s local computer. It is then possible to utilize the Cluster Regeneration
link on the LAD main menu (as seen in Fig. 7.10.2) to reupload these files for immediate
entry into the hierarchical clustering toolset.
Stored Datasets
The steps required to properly apply filters for the extraction of multiexperiment data can
be a time-consuming and error-prone process. Once one has settled on an appropriate set
of filters that work, one may want to save the dataset, as extracted, for later analysis by
hierarchical clustering. Stored datasets provide this functionality. Stored datasets can be
considered as a more sophisticated solution to the same problem described above under
Cluster Regeneration. Rather than having to save the PCL and SPOT files to the local
computer and annotate them appropriately, one can simply save the dataset to the LAD
database with a meaningful name and description. The remainder of this section will
consist of walking through this process, as a gateway to hierarchical cluster analysis.
In the data retrieval page (Fig. 7.10.26) select the Save This Dataset link A screen will
appear that is similar to the one pictured in Figure 7.10.27. Note that a list of any datasets
previously saved to the database is now visible.
For the Name field, set a value; for this example, choose
Multi-experiment Dataset.
For the Description field, set a value; for this
example type; The first dataset I created.
Press the Create Dataset button to save the dataset to the LAD database. A screen confirming the creation will appear. This dataset can now be retrieved at any time through
Stored Datasets item on the LAD main menu (as pictured in Fig. 7.10.2).
Analyzing
Expression
Patterns
7.10.27
Current Protocols in Bioinformatics
Supplement 8
Figure 7.10.27
“Save this dataset” screen.
Figure 7.10.28
Clustering and image generation—setup.
Hierarchical clustering
Return to the data retrieval page (Fig. 7.10.26) and select the Clustering and Image
Generation link. A screen will appear that is similar to the one pictured in Figure 7.10.28.
This screen is the setup for hierarchical clustering in LAD. Hierarchical clustering is a
powerful analytical method for the detection of differential and similar gene expression
in DNA microarray data (Eisen et al., 1998).
Microarray Data
Visualization and
Analysis with LAD
There are several configuration options available for this operation. The For Gene and
For Experiment options near the top of Figure 7.10.28 select whether the data will or
will not be hierarchically clustered in each of these dimensions. Additionally, if one
allows for gene and/or experiment clustering, one has the option to either center or noncenter the data during the analytical process. This concept is similar to the concept of data
centering that was previously discussed during the data extraction step of Basic Protocol 5
(step 16).
7.10.28
Supplement 8
Current Protocols in Bioinformatics
The Use option determines the distance metric that will be used by the hierarchical clustering process. LAD uses two different possible distance metrics to measure the similarity
between data vectors. First, the Pearson Correlation is insensitive to amplitude of changes
between two data vectors and instead focuses on similarity in their directionality as seen in
vector space. The Euclidean Distance measures the distance between two points in space
and is thus sensitive to total amplitude of either of the data vectors under consideration.
In addition to variables that will direct the hierarchical clustering process, the LAD
preclustering setup screen (Fig. 7.10.28) contains options that allow for the manipulation
of the final cluster diagram. The Contrast for Image option makes it possible to control
the sensitivity of the coloring scheme to the level of highly expressed and repressed gene
sets. Lower contrast values will lead to a more sensitive coloring scheme—only dramatic
differences in expression will yield intensely colored spots. Additionally, options are
available that allow one to specifically determine the color of missing data spots as well
as the overall color scheme (red/green or blue/yellow).
Select the following options to generate a hierarchical cluster diagram.
For Gene: select Non-centered metric
For Experiment: select No experiment clustering
For Use: select Pearson correlation
For Make: select Hierarchical cluster
Figure 7.10.29
Clustering and image generation—generated.
Analyzing
Expression
Patterns
7.10.29
Current Protocols in Bioinformatics
Supplement 8
Figure 7.10.30
Clustering and image generation—generated.
For Contrast for Image: select 2.5
For RGB color for missing data: select 75% grey
Color Scheme: select red/green
Check the “Show spot images” check box
Check the “Break up images” check box
For “After how many genes”: type 200.
Press the Submit Query button to begin the process of cluster creation. A screen similar
to the one pictured in Figure 7.10.29 will appear. This screen is an intermediate in terms
of visualizing the cluster diagram. It includes a thumbnail image of the cluster generated
as well as links to the true diagram.
On the page pictured in Figure 7.10.29, select the View Cluster Diagram with Spot
Images link. A screen similar to the one pictured in Figure 7.10.30 will appear. On this
screen, one can see that the clustering algorithm has separated up-regulated from downregulated genes. Additionally, one can see that it is now possible to visualize the spot
images adjacent to the synthetic colors of the cluster diagram. Spots that perhaps should
have been flagged “bad” may be selected and flagged in the single-spot screen that will
be shown. This will prevent their reappearance in subsequent cluster analysis.
BASIC
PROTOCOL 7
Microarray Data
Visualization and
Analysis with LAD
CREATING ARRAY LISTS
It is quite common for a researcher to continually utilize the same experiments over and
over in the data-analysis process. Often, a microarray researcher will spend significant
time performing the actual array hybridizations in order to complete the full spectrum
of experiments needed to define a complete research project or paper. At this point they
then upload the experiments en masse and turn towards the task of data analysis.
7.10.30
Supplement 8
Current Protocols in Bioinformatics
Because researchers often want to work with experiments as a defined group and not
repeatedly search for and aggregate them together manually, LAD has a function called
array lists.
Necessary Resources
Hardware
Computer capable of running an up-to-date Web browser
Software
Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version
1.6 or higher
Software preparation
1. Navigate to the LAD URL with a Web browser and log into LAD as pkillion to
access the main menu depicted in Figure 7.10.2 (see Basic Protocol 1).
Array list creation
2. From the LAD main menu as depicted in Figure 7.10.2 select the Results Search link.
The familiar experiment search filters screen (similar to Fig. 7.10.7) will appear.
3. Make sure the filter radio button is set to a value of Use Method 1. Change no filter
settings. At the bottom of the screen, press the Data Retrieval and Analysis button.
4. In the selection box entitled “Select experiment names from the following list” select
a few experiments. Press the button Create Array List. A screen will appear that is
similar to the one pictured in Figure 7.10.31.
5. In the field shown in Figure 7.10.31 titled “Enter a name for your arraylist:” enter
example-list. From the list entitled Starting List of Experiments select each
of the previously selected experiments. Use the “> Add >” button to move these
experiments to the list on the right hand side of the page (Experiments Included
within Array List).
Figure 7.10.31
Array list creation.
Analyzing
Expression
Patterns
7.10.31
Current Protocols in Bioinformatics
Supplement 8
6. At the top of the screen, press the Create Array List button. Close this extra browser
window. In the browser window remaining, select the Longhorn Array Database link
at the top of the screen. There should now have an array list consisting of the selected
experiments.
7. Use Method 3 from the previously discussed Experiment Search screen (Fig. 7.10.7)
to utilize this array list.
BASIC
PROTOCOL 8
OPEN DATA PUBLICATION
One of the most useful features of LAD is the ability to host data publications to the
public domain. In addition to robust and intuitive analysis tools and a proven architecture
for the warehousing of thousands of microarray experiments, users can now utilize LAD
to provide open access to their primary experimental data.
In short, a publication is an aggregated group of experiments that is associated with a
research journal publication. Publications and their associated data can be accessed by
the public domain through a secure portion of LAD that provides limited but powerful
functionality.
Necessary Resources
Hardware
Computer capable of running an up-to-date Web browser
Software
Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version
1.6 or higher
View sample publication
1. If curious as to what an open access publication in LAD looks like, one can simply
browse on the LAD server at the University of Texas at Austin by opening a new
browser window and navigating to the URL http://www.iyerlab.org/hsf.
This page is dedicated to supplementary data for a publication detailing the use of DNA
microarrays to characterize the binding patterns of the transcription factor HSF in Saccharomyces cerevisiae (Hahn et al., 2004).
2. From this page select the “raw data” link. This link navigates directly to the LAD publication that provides direct access to the experimental data discussed and analyzed
within the paper.
Software preparation
3. Navigate to the LAD URL with a Web browser and log into LAD as pkillion to
access the main menu depicted in Figure 7.10.2 (see Basic Protocol 1).
Microarray Data
Visualization and
Analysis with LAD
Publication creation
4. The first step in publication creation is modification of the access permissions for the
experiments that will be aggregated together to define the publication itself. More
specifically, one needs to grant access to one’s experiments to a special WORLD
user. This user is created during the LAD installation. From the LAD main menu
(Fig. 7.10.2) select the Add Experiment Access By Batch link. A screen similar to
the one in Figure 7.10.32 will appear.
7.10.32
Supplement 8
Current Protocols in Bioinformatics
Figure 7.10.32
Add experiment access by batch.
This screen takes advantage of array lists to make it possible to rapidly grant WORLD
access to a large number of experiments. In this example the array list example-list
that was created in Basic Protocol 7 will be used.
5. Set the “Choose an arraylist” pull-down menu to a value of “example-list” In the
User pull-down menu, select WORLD.
6. Press the Check Access button to proceed to the next step. A confirmation screen will
appear. Press the Add Access button to grant WORLD access to the experiments in
this array list. A second confirmation screen will appear. Select the Longhorn Array
Database link at the top of this screen.
Experiment set creation
Now that the experiments to be bundled into a publication have been modified for public
access, it is time to aggregate them into a package. This package is known as an Experiment
Set.
7. From the LAD main menu as depicted in Figure 7.10.2 select the Results Search
link. The familiar experiment search filters screen (Fig. 7.10.7) will appear.
8. Change the filter radio-button to a value of Use Method 3. In the array list pull-down
menu to the right of this option, select the array list “example-list.” At the bottom of
the screen, press the Data Retrieval and Analysis button. In the selection box entitled
“Select experiment names from the following list,” select each of the experiments
displayed.
9. Press the Create Experiment Set button. A screen will appear similar to the one pictured in Figure 7.10.33. Use the “> Add >” button to move the selected experiments
to the Experiments Included Within the Set list.
10. At the bottom of the screen press the Save Experiment Set button. A screen will
appear similar to the one pictured in the following Figure 7.10.34.
11. On the page depicted in Figure 7.10.34:
Enter Test Experiment Set in the Experiment Set Name box
Set Cluster Weight to a value of 1.0
Analyzing
Expression
Patterns
7.10.33
Current Protocols in Bioinformatics
Supplement 8
Figure 7.10.33
Experiment set creation.
Figure 7.10.34
Experiment set creation, part 2.
Enter Creation of a sample experiment set in the Experiment
Set Description box
Select YES for “Do you want to publish this experiment set?”.
Press the Create Experiment Set button to complete the process. A confirmation
screen will appear. Press the Close Window button on the confirmation screen.
Microarray Data
Visualization and
Analysis with LAD
12. In the browser window remaining, select the Longhorn Array Database link at the
top of the screen.
7.10.34
Supplement 8
Current Protocols in Bioinformatics
Figure 7.10.35
Publication creation.
Publication creation
13. With the experiments of interest given proper WORLD access and having created
an experiment set bundling them together into a discrete package, it is now possible
to actually create the publication. From the LAD main menu as depicted in Figure
7.10.2, select the Publication link (under Enter Data). A screen will appear similar
to the one pictured in Figure 7.10.35.
It is recommended that one create publications within LAD when one actually has a
PubMed ID and citation for the actual journal publication. For the purposes of this
protocol, the LAD publication PubMed ID will be used.
14. For the World-Viewable Experiment Sets list, select the experiment set that was just
created, Test Experiment Set. Set the PubMed ID field to a value of 12930545.
15. Press the “Fill the fields below” button to populate the screen with PubMed information. With the fields populated by the remotely queried information it is possible to
proceed with the creation of the publication. Press the Create Publication button to
create the publication in the database. A confirmation screen will appear informing
that the publication has been created. Select the Longhorn Array Database link at the
top of the screen.
Analyzing
Expression
Patterns
7.10.35
Current Protocols in Bioinformatics
Supplement 8
Publication browsing
16. To view the publication just created, select the link Publications (under Search) from
the LAD main menu (Fig. 7.10.2). This will bring up a list of the publications created
in LAD.
17. Browse the links and functions provided to get an idea of the information exposed to the public domain through the creation of publications. The primary link
to a publication is of the form: http://your lad server/cgi-bin/SMD/publication/
viewPublication.pl?pub no=1.
It is highly recommended, however, that this link never be provided to any journal for
inclusion in the supplementary data section of a publication. This URL structure may
change in future releases of LAD or one’s server name may change due to network
reorganizations. For this reason and many others, one should follow the strategy that was
demonstrated in steps 1 and 2 of this protocol. In the example viewed there, the authors
set up a static location to host the supplementary data Web site. Part of this Web site is
a link to the publication in LAD. Because one will probably have access to the page that
links to the LAD publication, one will always have the freedom to modify it. If the direct
link to the LAD publication were included directly in the written publication, it would be
forever documented that way, unchangeable, and possibly become incorrect and out of
date.
GUIDELINES FOR UNDERSTANDING RESULTS
Nomenclature
A GenePix Pro GPR file has been described as the source of primary microarray data with
respect to a single microarray experiment. What values reside in this file, however, and
how are these files used to describe the qualitative and quantitative aspects of a microarray
experiment?
A GPR file is a collection of over fifty distinct values for each spot on the microarray
used in an experiment. Feature intensities, background intensities, sums of intensities,
spot regression metrics, spot flags, mean/median intensities, and intensity ratios are but a
few of the values provided. Some of these values are reflections of primary data acquired
during the scanning process. Other values are informative statistical measures of the
primary data. It is important to recognize that even many of the primary data values are
products of processed and aggregated data values. For example, during the microarray
scanning process, a spot is divided into a grid of dozens of individual pixels. These
pixels are numerically combined to produce spot intensities. These intensities are then
mathematically combined to form the ratio values that comprise many of the distinct
values present for each spot in the GPR file. Each of these values eventually becomes a
data column with respect to their final appearance in LAD.
Additionally, LAD manipulates many data values to produce distinctly new data metrics
that are only available after an experiment has been processed into the LAD database.
Normalized intensities and normalized ratios are just two examples of these unique LADproduced data metrics.
Microarray Data
Visualization and
Analysis with LAD
Within the context of microarray data analysis, ratio values are often transformed to
log-ratio values. The purpose of this operation is two-fold. First, in linear ratio space,
the ratio values are centered around a value of 1.0. All genes that are up-regulated will
have values above 1.0, but with no clear upper bound. All genes that are down-regulated
will have ratio values that are compressed between 0 and 1.0. This situation creates a
numerically asymmetric situation—the distribution of linear ratio values is clearly not
normal. By transforming all ratio values to log-space, a situation is created in which the
data are now normally distributed around zero. Genes that are up-regulated by two-fold
7.10.36
Supplement 8
Current Protocols in Bioinformatics
are now at a value of +1, while two-fold repression is quantified by a value of –1. One
can now see that log transformation has created a second benefit: it is possible to quickly
discern up-regulated genes from repressed genes simply by their numerical sign.
Microarray Data Normalization
Data from a single microarray experiment can be affected by systematic variation. Printing
biases such as plate and print-tip influences, total starting sample material amounts,
sample storage conditions, experimenter biases, dye effects, and scanner effects are just
a few of the possible sources of variation. This variation can significantly affect the
ability of a researcher to meaningfully compare results from one microarray experiment
to the results of another. LAD uses the technique of data normalization as an attempt to
cleanse the data of this variation (Yang et al., 2002). The end goal of normalization is
to, hopefully, remove nonbiological influences on the data, leaving only true biological
phenomena behind for analysis.
During experiment submission (see Basic Protocol 2, step 6), a user has the ability to
choose a normalization scheme for the experiment to be processed. One of two options
is available. First, a user may provide a User-Defined normalization coefficient. This
coefficient will be applied to the ratio values for each individual spot such that Normalized
Ratio = Coefficient × Ratio. This normalization coefficient can come from many sources.
The use of positive controls can lead to the algorithmic computation of a normalization
coefficient from the primary microarray data itself. A user might decide to provide a value
of 1.0 as a User-Defined normalization coefficient. This value would in ensure that LAD
does not apply normalization to the experiment in process, instead allowing the primary
data to stand as recorded by GenePix Pro.
The other option for LAD-base data normalization is for the user to select Computed as the
Normalization Type. This selection will cause LAD to determine a global normalization
coefficient for the experiment and numerically apply the coefficient as indicated above.
Global normalization is based upon the assumption that in the average whole-genome
microarray experiment, most of the genes on the microarray are not going to show
differential expression. Expressed mathematically, this means that the expected median
ratio value across all spots should be one. This is based upon the biological phenomenon
whereby a small percentage of the total genome is being differentially expressed in a
given cell at a given time.
When a user requests a Computed normalization coefficient, LAD will identify a subset
of spot values that are considered high-quality according to GenePix Pro quality metrics such as regression correlation and the amount of signal intensity above a standard
deviation of average background intensity. The ratio values of these spots will then be
averaged, providing a mathematical foundation for the calculation of a normalization
coefficient. The actual calculation is a bit complex because it effectively sets the median
ratio to zero. This is described in the LAD online help. If the median spot ratio value is
>1, the normalization coefficient will be <1, and vice versa. This will ensure that the
final application of the normalization coefficient to the primary data will yield a data
distribution that, when log-transformed, will be normally distributed around an average
of zero.
Care must always be taken, when selecting Computed normalization, that the underlying
assumption behind the computation of a global normalization coefficient is valid for the
experiment being submitted.
Analyzing
Expression
Patterns
7.10.37
Current Protocols in Bioinformatics
Supplement 8
COMMENTARY
Background Information
One of the most successful and proven of publicly utilized microarray databases is
the Stanford Microarray Database (SMD; http://genome-www.stanford.edu/microarray/;
Sherlock et al., 2001). SMD’s source code has been freely available for some time, which
theoretically allows any researcher to install an SMD server within his or her own research
environment. SMD in this form, however, is based on proprietary hardware and software
infrastructure that would require a significant capital expenditure from any laboratory
that wished to operate such a server. Additionally, SMD was designed and written to
utilize the Oracle relational database–management system. The cost of initial investment
and long-term ownership of these technologies is significantly higher, however, than alternative open-source technology choices. Additionally, not only is Oracle expensive, it
is a very demanding database in terms of the expertise required of professional database
administrators to maintain it as a piece of software infrastructure.
Given the numerous strengths and proven nature of SMD, the authors of this unit wanted
to adapt it to run on a free, open-source, widely available, and powerful operating system
and relational database. The combination of Linux and PostgreSQL was chosen to replace
Solaris and Oracle. The Longhorn Array Database is the product of this effort (Killion
et al., 2003); it is a completely open-source incarnation of SMD. Additionally, new
features have been developed to enhance its warehousing and analytical capabilities.
Critical Parameters and Troubleshooting
Best practices for warehouse organization
As previously discussed, one of LAD’s primary functions outside the domain of actual
microarray data analysis is the organized warehousing of experiments. LAD has the
capacity to store tens of thousands of experiments for a nearly unlimited number of users
and groups. The ability to achieve this scale of operation, however, requires that LAD
system curators and users adopt and maintain certain operational standards.
Curators have the capacity to encourage the organization of the database environment
through the proper utilization of user accounts, user groups, experimental categories and
subcategories, and microarray prints.
First, it is highly recommended that curators always assign each distinct LAD user an
individual user account. Additionally, it is suggested that users be divided into meaningful
groups as the research environment dictates. The implementation of these two practices
will ensure that a curator is always able to identify specific users who may be misusing or
abusing account privileges and that each user has no greater experiment access privileges
than their research affiliation should allow.
In addition to the organization of users and their organizational groups, the annotation
of experiments is important with respect to the long-term organization of the LAD environment. Experiments should always be assigned meaningful and consistent names,
descriptions, and channel descriptions. Assignment of experiments to proper categories
and subcategories greatly enhances the search and location facilities that are provided for
these experiment descriptors. During the submission of an experiment, slide names and
numbers should always be carefully recorded to allow LAD to act as an inventory control
mechanism for microarray utilization within a research environment.
Microarray Data
Visualization and
Analysis with LAD
Finally, print management is a key component of LAD best-practice implementation. The
specific 384-well print plates used, as well as the order in which they are used, is often
duplicated between microarray print runs. This can often make it tempting to utilize the
7.10.38
Supplement 8
Current Protocols in Bioinformatics
same print global resource for microarray slides that were not spotted at the same time.
Curators are discouraged from allowing this user action. Significant information can be
mined from LAD with respect to print-induced data bias if and only if the primary data
for a microarray experiment are correctly associated with slides that are truly members
of the same print run.
Suggestions for Further Analysis
Data export to other analysis platforms
The majority of the protocols in this unit are dedicated to LAD’s wide variety of intrinsic
analytical toolsets. These toolsets are expansive and cover a significant breadth of the
typical techniques used in microarray data analysis and quality control. The research
field of DNA microarray analysis, nonetheless, is constantly exploring new data-mining
techniques, visualization tools, and statistical-processing methodologies. How is LAD to
successfully interact with these applications if the experimental data are sequestered in
the relational database?
The export of microarray data is detailed within the steps of the protocols provided. Figure
7.10.12 demonstrates that single experiment data analysis can include the export of data
into flat tab-delimited text file and GFF formats. Additionally, Figure 7.10.14 details the
part of LAD single-experiment manipulation that allows for the retrieval of the original
GenePix Pro GPR file that contributed the primary experimental data.
The multiexperiment analysis protocol (Basic Protocol 5) showed (see Fig. 7.10.26) that
multiexperiment data are exportable from LAD through the use of the PreClustering and
Tab Delimited text files.
In these ways, LAD can be used in concert with new microarray analysis applications
by providing for the retrieval of primary, filtered, or even annotated datasets for offline
analysis and visualization.
LITERATURE CITED
Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge,
W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C., Kim, I.F., Markowitz, V.,
Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo,
J., and Vingron, M. 2001. Minimum information about a microarray experiment (MIAME): Toward
standards for microarray data. Nat. Genet. 29:365-371.
Cherry, J.M., Adler, C., Ball, C., Chervitz, S.A., Dwight, S.S., Hester, E.T., Jia, Y., Juvik, G., Roe, T.,
Schroeder, M., Weng, S., and Botstein, D. 1998. SGD: Saccharomyces Genome Database. Nucl. Acids
Res. 26:73-79.
Dudoit, S., Gentleman, R.C., and Quackenbush, J. 2003. Open source software for the analysis of microarray
data. BioTechniques Suppl:45-51.
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide
expression patterns. Proc. Natl. Acad. Sci. U.S.A 95:14863-14868.
Hahn, J.S., Hu, Z., Thiele, D.J., and Iyer, V.R. 2004. Genome-wide analysis of the biology of stress responses
through heat shock transcription factor. Mol. Cell. Biol. 24:5249-5256.
Iyer, V.R., Horak, C.E., Scafe, C.S., Botstein, D., Snyder, M., and Brown, P.O. 2001. Genomic binding sites
of the yeast cell-cycle transcription factors SBF and MBF. Nature 409:533-538.
Kerr, M.K., Martin, M., and Churchill, G.A. 2000. Analysis of variance for gene expression microarray data.
J. Comput. Biol. 7:819-837.
Killion, P.J., Sherlock, G., and Iyer, V.R. 2003. The Longhorn Array Database (LAD): An open-source,
MIAME compliant implementation of the Stanford Microarray Database (SMD). BMC Bioinformatics
4:32.
Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett,
N., Kanin, E., Volkert, T.L., Wilson, C.J., Bell, S.P., and Young, R.A. 2000. Genome-wide location and
function of DNA binding proteins. Science 290:2306-2309.
Analyzing
Expression
Patterns
7.10.39
Current Protocols in Bioinformatics
Supplement 8
Sherlock, G., Hernandez-Boussard, T., Kasarskis, A., Binkley, G., Matese, J.C., Dwight, S.S., Kaloper, M.,
Weng, S., Jin, H., Ball, C.A., Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D., and Cherry, J.M.
2001. The Stanford Microarray Database. Nucl. Acids Res. 29:152-155.
Yang, Y.H., Dudoit, S., Lin, D.M., Peng, V., Ngai, J., and Speed, T.P. 2002. Normalization for cDNA
microarray data: A robust composite method addressing single and multiple slide systematic variation.
Nucl. Acids Res. 30:e15.
Contributed by Patrick J. Killion and Vishwanath R. Iyer
University of Texas at Austin
Austin, Texas
Appendices A, B, and C appear on following pages.
Microarray Data
Visualization and
Analysis with LAD
7.10.40
Supplement 8
Current Protocols in Bioinformatics
APPENDIX A: USER ACCOUNT CREATION
The default “curator” account is necessary to manage and configure LAD (see Appendix C). This section of the unit will detail the process by which other users can
request an account and the “curator” can approve their request thereby creating the requested account.
Necessary Resources
To perform the procedures described below, one needs a Web browser window at the
LAD front page (see Basic Protocol 1 and Fig. 7.10.2). A terminal window on the LAD
server, logged in as the Linux system user root, is also necessary.
User Creation
User creation within LAD generally happens as a three-step process. First, a prospective
new user navigates to the page accessed be the “register” link on the LAD front page (Fig.
7.10.1) to request an account. Secondly, a system curator creates the account by accessing
the saved information from within LAD. Finally, the server administrator creates a system
account so that the new user will have file system space to upload experimental files.
User self-registration
New users may request an account that a system curator can either approve or deny. By
selecting the register link from the LAD front page (see Fig. 7.10.1 and Basic Protocol 1,
step 1), one will obtain screen that is shown in Figure 7.10.36.
Figure 7.10.36
LAD user registration page.
Analyzing
Expression
Patterns
7.10.41
Current Protocols in Bioinformatics
Supplement 8
The following fields are required and must be provided by a prospective new user:
First Name
Last Name
Office/Lab Phone
Office/Lab FAX
Email Address
Institution
Project Description
Organism of Study
Laboratory.
Complete the required fields and press Submit to store the prospective user information
for consideration by the system curator.
Navigate to LAD login screen
Select the Login link on the LAD front page (Fig. 7.10.1) to proceed to the authentication
gateway.
Authenticate with LAD
Utilize the LAD “curator” account that was created during the installation phase
(Appendix C). Fill in the login screen (Fig. 7.10.37) with the following information:
User Name: curator
Password: the password assigned to “curator” during installation (Appendix C).
The LAD main menu as pictured in Figure 7.10.38 should now be visible. This view
contains many options that the average LAD user will not see (only accounts with curator
permissions will see full administrative options; compare Fig. 7.10.38 to Fig. 7.10.2).
Microarray Data
Visualization and
Analysis with LAD
Figure 7.10.37
Login screen.
7.10.42
Supplement 8
Current Protocols in Bioinformatics
Figure 7.10.38
LAD main menu (curator).
Approval of the prospective user
After having logged in as the user “curator,” from the screen pictured in Figure 7.10.38
select the menu item from the left column entitled Users. From the screen which then
appears, a system curator may either create a user or approve a prospective user who
has registered for account consideration. If there is a user record for consideration there
will be a message stating “The following users are still pending entry” with a link to
prospective user records.
Select the link for the user record created in the initial step (see “User self-registration”).
One of two actions may be performed. Pressing the Delete Record enables the curator to
reject the application, thereby sending an e-mail of account denial to the prospective user.
Pressing the Submit button enables the curator to create the account for the prospective
user. This will result in the new user being sent an e-mail of account approval. Note that
in order to create the account one will need to provide a password for the new user, which
will be included in the e-mail. It is also recommended that one carefully proofread and
correct any other incorrect information the user has provided in the account application.
The approval screen also contains very important information regarding LAD system
permissions the new user will be granted:
Update Gene: will give the new user curator permissions.
Update Print: will allow a user to modify a print record.
Update User: will allow a user to update other user records.
Analyzing
Expression
Patterns
7.10.43
Current Protocols in Bioinformatics
Supplement 8
Restricted User: will restrict a user to view-only permissions with respect to experimental
data. This option should be unselected for any user that will be allowed to upload experiments. A local user will typically be unrestricted, that is, they will be allowed to load
experiments, but they will not have permission to update genes, prints, or users. These
administrative functions are best left to a curator who can be responsible for maintaining
consistency.
User System Accounts
As previously mentioned, every unrestricted user must be given a system account on the
LAD server.
Creation of system account
LAD includes a script to help with the creation of these accounts. In order to utilize
this script, one will need to have a terminal window open and must use the Linux su
command to become the root user.
The script has the following usage:
/lad/install/addSysAccount.pl [username] [password]
[home dirs]
Hypothetically, assuming that one created a user name pkillion with a password
lad4me, one would now execute the command:
/lad/install/addSysAccount.pl pkillion lad4me /home
This example command assumes that user home directories are located under the directory
/home. A new user account and its respective system account have now been created.
APPENDIX B: GLOBAL RESOURCE CREATION AND MANAGEMENT
LAD requires the creation and management of many global resources that are continually
accessed throughout the system. This unit has already described how users are registered
and approved. Users are just one type of record that defines the entire LAD system of
interacting objects. Other resources such as user groups, organisms, plates, prints, printtip configurations, experimental categories, and sequence types are but a few of the other
objects that are critical to the organized storage and analysis of microarray data. This
section of the unit details the most important of these global resources and provides
examples of their creation and management.
Necessary Resources
Hardware
Computer capable of running an up-to-date Web browser
Software
Web Browser: Microsoft Internet Explorer version 6.0 or higher or Mozilla version
1.6 or higher
Terminal window to LAD server
Microarray Data
Visualization and
Analysis with LAD
Files
Sample files for the creation of yeast sequences, plates, and print. Each of these
files is in the directory: /lad/install/yeast_example_print/
7.10.44
Supplement 8
Current Protocols in Bioinformatics
Sample SUID creation file [yeast_pre_suids.txt]
Sample plate creation file
[yeast_retrieve_suids_into_plate.txt]
Experiment Group Management
In addition to the numerical analysis and visualization of microarray data, LAD serves as
a very powerful environment for the long-term warehousing and organization of experimental results. This many not seem like a very important function when one first begins
using a microarray database. However, as the total number of database users begins to
grow, the frequency of experiment submission increases, experiment names and details
start to become uninformative and inconsistent, and subsequent disorganization of results
begins to set in.
This appendix will serve to present guidelines by how to locate one’s experiments, organize them effectively, and set up the best practices for the long-term organization of
LAD as a functional warehouse of data.
Users and Groups
Users and Groups can be considered examples of these global resources. It has already
been described how users are created and utilized through the creation of the noncurator pkillion account. The following will demonstrate the mechanism by which new
groups are created and how users can be assigned and removed from experimental groups.
Experiment group creation
From the LAD main menu as depicted in Figure 7.10.38, select the Experiment Groups
link. The screen depicted in Figure 7.10.39. will appear. On this screen enter Iyer Lab
in the Group Name field. Enter Members of the Iyer Lab, University of
Texas at Austin in the Description field. Press the Submit button to create this new
experimental group. Select the Longhorn Array Database link at the top of the screen
which then appears to return to the LAD main menu.
List users assigned to experiment group
From the LAD main menu as depicted in Figure 7.10.38, select the User Group link. It
will be seen that LAD has four experiment groups defined:
Figure 7.10.39
Experiment group creation entry form.
Analyzing
Expression
Patterns
7.10.45
Current Protocols in Bioinformatics
Supplement 8
Default Group
HS_CURATOR
SC_CURATOR
MM_CURATOR
The first group is the default group that is created for noncurator users by the LAD
installation. The other groups were created by the LAD installation when their respective
organisms were created. The addition of organisms to the database will create subsequent
experiment groups for curators of those organisms. A noncurator user would normally
not be assigned to one of these experiment groups. From the LAD main menu as depicted
in Figure 7.10.38 select the Default Group link. One should see that this group has two
members, curator and pkillion.. Select the Longhorn Array Database link at the
top of the screen to return to the LAD main menu.
Consequences of group membership
What consequence does group membership have throughout the LAD system? Basically,
group membership controls which submitted experiments can be seen by a user.
The basic rules are:
a. All users that are comembers of one’s default group will be able to see one’s
experiments unless one specifically removes that permission.
b. Membership in additional groups (above and beyond one’s default group) will
allow one to see the experiments that have been submitted by users whose default
group is that additional group.
User assignment to a default group
A user’s default group can only be modified through their user profile. The following
procedures are used to modify the default group for the pkillion user.
From the LAD main menu as depicted in Figure 7.10.38 select the Users link. To the left
of the pkillion user record, select the Edit icon. A screen appears that will allow one
to edit the pkillion user. Change the Default Group to a value of Iyer Lab. Press the
Submit button to change the default group. Select the Longhorn Array Database link at
the top of the screen to return to the LAD main menu.
User assignment to additional experiment groups
As previously detailed, there may be times that one wishes to assign users to groups above
and beyond their default group. From the LAD main menu as depicted in Figure 7.10.38
select the Users into Experiment Groups link. Select the UserID “pkillion” and select the
Group Name Default Group. Press the Submit button to add the additional group. Select
the Longhorn Array Database link at the top of the screen to return to the LAD main
menu.
Microarray Data
Visualization and
Analysis with LAD
User deletion from additional experiment groups
Additionally, there may be times that one wishes to remove users from their additional
groups. From the LAD main menu as depicted in Figure 7.10.38 select the Users from
Experiment Groups link. Select the UserID “pkillion.” Press the Process to Group Selection button to see a list of groups of which this user is a member. Select the Group
Name Default Group. Press the Submit button to remove the additional group. Select the
Longhorn Array Database link at the top of the screen to return to the LAD main menu.
7.10.46
Supplement 8
Current Protocols in Bioinformatics
Figure 7.10.40
LAD global resources links from the LAD main menu (also see Fig. 7.10.38).
Complete user deletion from a specific group
Finally, there may be times that one wishes to remove all users from an additional group.
From the LAD main menu as depicted in Figure 7.10.38 select the link Remove All
Users From An Experiment Group link. Select the Group Name Iyer Lab. Press the
Submit button to remove all users from the Iyer Lab group. Select the Longhorn
Array Database link at the top of the screen to return to the LAD main menu.
Preconfigured Global Resources
Many resources are preconfigured by the default relational database schema that is instantiated during LAD installation. Figure 7.10.40 shows a detail view of the links for
these resources (which are accessible through the LAD main menu in Figure 7.10.38). By
navigating to a few of these resources, the reader will become familiar with the elements
that define the overall LAD environment.
In order to complete the following steps, one will need a Web browser window logged
into LAD as “curator,” at the main menu (as depicted in Fig. 7.10.38) and a terminal
window on the LAD server, logged in as the Linux system user “curator.”
Organism List
Select the Organism List link from LAD the main menu (Fig. 7.10.38). Organisms
are generally associated with nearly everything in the LAD environment. Experiments,
plates, prints, and sequences are just a few examples. One can see that LAD comes
preloaded with Saccharomyces cerevisiae, Mus musculus, and Homo sapiens as default
organisms.
Analyzing
Expression
Patterns
7.10.47
Current Protocols in Bioinformatics
Supplement 8
Sequence Types
Click the browser back button to return to the main menu. Select the Sequence Types
link. Sequences are a key element with respect to data storage and analysis. In order
to maintain annotation information for gene sequences on microarrays separately from
the experimental values for spots, sequences are abstracted within LAD. This will be
covered in much more detail under Sequence Creation, below. For now, it is important to
recognize that LAD comes predefined to support a variety of sequence types, this being
one of the variables that describes an individual sequence within the database.
Tip Configurations
Return to the main menu using the Back button and select the Tip Configurations link. A
tip configuration is one of the subcomponents that help define a print and its respective
geometry. Note that LAD comes predefined with support for 16-, 32-, and 48-pin-tip
configurations.
Categories
Return to the main menu and select the Categories link. Categories are utilized as a
descriptor to help organize experiments. They can be utilized as search keys in many of the
data-analysis pipelines. As with all of these global resources that are being demonstrated,
LAD system curators can freely add new categories to the database as desired.
Print Creation
Before experimental data can be uploaded into LAD a print must be defined for the
microarray chip that was used. The first question that one may ask is “What is a print”? A
print, in essence, is an abstract declaration of everything that went into constructing the
actual microarray chip that was utilized in an experiment. For most microarray chips this
means, first, that DNA of some type was arrayed into 384-well printing plates. Each well
of each of these plates most likely contains a unique sequence. These plates were serially
used, in some order, by the microarray robot to print some number, perhaps hundreds, of
duplicate slides in a unique geometry that depends upon the tip configuration utilized, as
well as other variables controlled by the robot software.
The following discussion will explore the LAD resources that are available to define each
step of the process described above. This will be done using sample files included with
LAD that aid in the creation of a sample print for a set of Saccharomyces cerevisiae
microarray chips.
SUID creation
The first step in defining a print is to define all of the sequences that will be utilized in
the print. Sequences within LAD are typically referred to as SUIDs. Each sequence that
is uploaded gets a unique name and, once uploaded, need never be defined or uploaded
again.
Uploading SUID file to server
Using the terminal window copy the file:
/lad/install/yeast_example_print/yeast_pre_suids.txt
Microarray Data
Visualization and
Analysis with LAD
to the location:
/home/curator/ORA-OUT/
7.10.48
Supplement 8
Current Protocols in Bioinformatics
This may be done with the following command:
cp --f
/lad/install/yeast_example_print/yeast_pre_suids.txt
/home/curator/ORA-OUT/
It is highly recommended that one inspect this file with a text editor at a later date in order
to learn about its overall structure and content.
Uploading SUIDs to the database
In the LAD main menu (Fig. 7.10.38) page select the “Sequence IDs (SUIDs)(by file)”
link. Select Saccharomyces cerevisiae from the Organism pull-down menu. Edit the text
field at the bottom of the screen to the value:
/home/curator/ORA-OUT/yeast_pre_suids.txt
Select the Assign SUIDs button to initiate the creation of sequences within the database.
This process may take a few minutes to complete.
SUID extraction to plate creation file
In this appendix, a list of sequences has been submitted for storage to the database. In
order to model a microarray print, it is now necessary to be able to store a set of plates containing these sequences. There is one intermediate step that must be performed, however.
When the sequences were uploaded to the database, they received a unique identification
that is independent of the name given to each of them. In order to upload plates, it is
first necessary to take the file that contains the plate information and have the database
associate the sequence names in it to the independent keys it now has for each of these
sequences.
Plate file preparation
Using the terminal window copy the file:
/lad/install/yeast_example_print/
yeast_retrieve_suids_into_plate.txt
to the location:
/home/curator/ORA-OUT/
This may be done with the following command:
cp --f /lad/install/yeast_example_print/
yeast_retrieve_suids_into_plate.txt
/home/curator/ORA-OUT/
It is highly recommended that one inspect this file with a text editor at a later date in order
to learn about its overall structure and content.
Figure 7.10.41 displays the links that will now be visited in succession for the
purpose of creating a print; these links are selected from the LAD main menu
(Fig. 7.10.38).
Analyzing
Expression
Patterns
7.10.49
Current Protocols in Bioinformatics
Supplement 8
Figure 7.10.41
LAD print creation links from the LAD main menu (also see Fig. 7.10.38).
SUID extraction into plate file
Select the Get SUIDs for Name in Plate File link on the LAD main menu page (Figure
7.10.38). Select Saccharomyces cerevisiae from the organism pull-down menu. Edit the
text field at the bottom of the screen to the value:
/home/curator/ORAOUT/yeast_retrieve_suids_into_plate.txt
Select the Get SUIDs button to initiate the creation of sequences within the database.
This process may take a few minutes to complete.
The following file will be created:
/home/curator/ORAOUT/yeast_retrieve_suids_into_plate.txt.SUID.oral
This new file contains the same information as the file that was submitted to the process,
but with one new column of information—the database unique identifier that defines
each of the sequences. This file will now be utilized directly in the next step of the print
creation process.
Submission of plate creation file
A plate file is now available that is complete for database submission. Next, it is necessary
to create plates within the database. These plates are virtual representations of the 384well plates utilized by the microarray robot. Once entered into the database they can be
continually reused to create future prints.
Plate file submission
Select the “Enter Plates (by file)” link on the LAD main menu page. Select Saccharomyces
cerevisiae from the Organism pull-down menu. Edit the text field at top of the screen to
the value:
/home/curator/ORAOUT/yeast_retrieve_suids_into_plate.txt.SUID.oral
Set Plate Name Prefix to the value SC. Set Project to the value Yeast and set Plate
Source to the value Refrigerator One. The rest of the fields can be left as their default values. Select the Enter Plates button to begin the process of plate creation. It should
be noted that this is a CPU-intensive process. The database and the LAD application code
exercise many operations to check for the integrity of the plate information that has been
provided. The overall process may take some time to complete.
Microarray Data
Visualization and
Analysis with LAD
7.10.50
Supplement 8
Current Protocols in Bioinformatics
Extraction of plates
Plates have now been submitted to and created within LAD. One can now extract these
plates from LAD in the form of a plates file that can be submitted for the creation of a
print. The question may be asked “Why is it necessary to extract plates from the database
to submit a print for creation”? The answer is simply that every print performed with a
microarray robot may utilize the same plate set. Through design, omission, or mistake,
however, the order in which plates are used by the microarray robot may change from
print to print. For this reason LAD provides the ability to extract plates from the database
to a text file that may or may not be reorganized before print creation.
Plate file creation
Go to the Select Plates link in the LAD main menu (Fig. 7.10.38). Select Saccharomyces
cerevisiae from the Organism pull-down menu. Select SC from the next column. Edit the
file name to:
/home/curator/ORA-OUT/sc_plates.txt
Press the Select Plates button in order to extract the plates into the named text file.
Creation of a print
It is now finally possible to create the print within LAD. If it were desirable to
change the order of the plates, one would simply edit the file /home/curator/ORAOUT/sc_plates.txt that was produced under Plate File Creation, above. That will
not be done for this example, however.
Select the Enter Print and Spotlist link on the LAD main menu page
Select Saccharomyces cerevisiae from the Organism pull-down menu
Edit the File Name text field to a value of
/home/curator/ORA-OUT/sc_plates.txt
Set the Printer pull-down to a value of Default Printer
Set the Print Name text field to a value of SC1
Set the Tip Configuration drop-down to a value of 32 Tip
Set the Number of Slides text field to a value of 400
Set the Columns per Sector text field to a value of 24
Set the Rows per Sector text field to a value of 19
Set the Column Spacing text field to a value of 175
Set the Row Spacing text field to a value of 175
Set the Description text area to a value of Sample Print.
Also, be sure to select the option Rotate Plates, which is a flag that informs LAD that
the 384-well plate was parallel to the slides on the platter during printing. Select the
Enter Print button to begin the creation of the print. It should be noted that this is a CPUintensive process. The database and the LAD application code exercise many operations
to check for the integrity of the print information that has been provided. The overall
process may take some time to complete.
APPENDIX C: INSTALLATION OF THE LONGHORN ARRAY DATABASE
LAD is designed and intended to run on an Intel-compatible PC running a modern Linux
distribution. The authors of this unit have tested and developed LAD on Mandrake Linux
versions 8.0, 8.2, 9.0, 9.1, and 9.2. They have also utilized RedHat Linux versions 7.2
and 8.0 in previous test installations.
Analyzing
Expression
Patterns
7.10.51
Current Protocols in Bioinformatics
Supplement 8
This section will include literal and detailed instructions based on the Mandrake 9.1
platform. Please adapt for the chosen Linux distribution. Please note that RPM version
numbers presented in this section will vary greatly between Linux distributions. Always
be sure to simply utilize the most recent version of any package available from either the
installation or update mechanism of a particular Linux distribution.
Installation Hardware Overview
LAD was conceptualized and designed to be a cost-efficient yet immensely scalable
microarray analysis platform and data warehouse. As with all server resources deployed
for the purpose of community operation, care must be taken to deploy LAD on appropriate
hardware for its intended and projected utilization. Dedication of insufficient hardware
relative to the average load will lead to insufficient performance and parallel user capacity.
In general there are five primary variables at play when designing a hardware infrastructure for LAD:
Performance level and number of CPUs
System RAM quantity
System hard-disk capacity
System hard-disk performance
Overall Linux compatibility.
First, the number of CPUs and their speed will primarily impact the multiuser experience
of the database system. The average single user will not be able to detect the difference
between a single low-end Intel Celeron processor and dual or quad high-end Pentium IV
processor system. Conversely, ten simultaneous users will dramatically experience the
difference between these two choices.
The total amount of system RAM follows this pattern as well. A single user will most likely
not benefit from high levels of RAM. Multiple users, launching multiple parallel database
and analytical processes, will definitely utilize increasing amounts of dedicated RAM.
There is a caveat to this generalization, however. Relational databases like PostgreSQL
typically attempt to cache substantial amounts of their data to system RAM in order
to increase performance for queries that are consistently accessing frequently utilized
tables and records. Therefore, the database system will operate with increasing levels of
performance relative to amount of system RAM it is able to utilize as dynamic cache.
The Linux system administrator will want to be sure to clearly profile the relational
database performance to determine if it could benefit from an increase in the available
RAM regardless of the size of the simultaneous user base.
Hard disk capacity is directly relevant to the number of experiments that will be loaded
into the system. Be sure to plan for future growth of the user base in this regard. Larger
disk capacity will not increase performance for small or large user bases. On the other
hand, the technology that is utilized in the hard disk subsystem can have a dramatic affect
upon overall system performance and throughput. For example, a single large-capacity
IDE drive combined with a powerful dual processors and a relatively large amount of
system RAM will most likely create a system that is often described as I/O (input/output)
bound. The total system performance will be constrained by this weak link (the IDE disk)
in the overall chain of system interactivity. A higher-throughput I/O system, e.g., a RAID
(Redundant Array of Inexpensive Disks)–based system, which combines the power and
capacity of multiple drives in parallel, will ensure that the hard disk I/O is not the limiting
variable in overall system performance.
Microarray Data
Visualization and
Analysis with LAD
7.10.52
Supplement 8
Current Protocols in Bioinformatics
The final consideration in the design of a system to be dedicated to LAD operation is
Linux compatibility. LAD is documented to support many distributions of Linux (RedHat,
Mandrake, Gentoo). Care should be taken to be sure to match a distribution that will
run LAD with a distribution that includes proper hardware support for one’s chosen
technology infrastructure. A specific product, such as an internal RAID controller, may
require experimental drivers to operate within the Linux environment. Other vendors and
hardware providers, however, may provide a more tested and production-worthy version
of this same type of RAID controller that has been documented to interoperate with
standard kernels and nonexperimental drivers. One will want to be sure to choose options
that are verified in production level Linux deployments as opposed to newer, untested,
and experimental hardware.
Given these variables, the Necessary Resources section below will detail three options
for LAD deployment. Consider these options to be generic representations of the much
wider spectrum of choices in designing a system that is appropriate for one’s long term
needs.
Installation and Administration Skill Set
Installation and administration of LAD does not require a high degree of technical skill
or knowledge. It does, however, require a basic administrative knowledge of Linux,
PostgreSQL, and general system maintenance functions. Users should be sure that they
are comfortable with Apache, PostgreSQL, and general Linux file and system operations
before attempting to install LAD. If one is not confident that one has this skill set, one
will most likely not be comfortable with the operations that are required to maintain LAD
over a longer term.
Necessary Resources
Hardware
Low-Range Performance:
Intel Pentium III ≥500 MHz, or Linux-compatible equivalent thereof
≥20 Gb disk space available
500 Mb RAM
Mid-Range Performance:
Intel Pentium IV ≥1.0 GHz, or Linux-compatible equivalent thereof
≥300 Gb disk space available
5 Gb RAM
High-Range Performance:
Intel Pentium IV Multi-CPU ≥2.5 GHz, or Linux-compatible equivalent thereof
High-performance motherboard and system chipset
≥1.0 terabyte (Tb) RAID-managed disk space available
20 Gigabytes RAM
Software
Mandrake Linux 9.1 Installation CDs
Web Browser:
Microsoft Internet Explorer version 6.0 or higher
Mozilla version 1.6 or higher
Files
This entire document is authored specific to the release of LAD version 1.4. It is
called for within this protocol and can be downloaded from the LAD Web site.
Analyzing
Expression
Patterns
7.10.53
Current Protocols in Bioinformatics
Supplement 8
Fulfillment of Server Prerequisites
Follow the installation instructions provided by Mandrake to begin the deployment of
Mandrake Linux 9.1. Special care should be taken during initial partitioning of the file
system to create a structure that allows for the future growth of the relational database
PostgreSQL as well as the LAD experimental data file archive.
Installation initiation
Complete all steps of the Mandrake Linux installation up to the selection of Software
Groups. Mandrake Linux installation allows one to select groups of software to be installed, thus greatly simplifying the installation of commonly utilized package sets. Select
the following groups for full installation:
Internet Station
Network Computer (client)
Configuration
Console Tools
Development
Documentation
Mail
Database
Firewall/Router
Network Computer Server
KDE.
Group selection
In addition to these complete groups, one will need to select individual packages for
installation. For this reason, be sure to select the checkbox that calls for individual package
selection. With the groups selected, proceed to the next step.
Apache RPM packages
Note that the Web group was not installed during selection of group packages for installation. This is because, currently, LAD does not support Apache version 2.x for operation.
For this reason Apache 1.x must be installed manually from the appropriate RPM packages.
The Apache 1.x packages are:
apache-1.3.27-8mdk
apache-devel-1.3.27-8mdk
apache-modules-1.3.27-8mdk
apache-mod_perl-1.3.27_1.27-7mdk
apache-source-1.3.27-8mdk
perl-Apache-Session-1.54-4mdk.
Apache package selection
Find and select each of these packages. Do not yet begin installation of the selected
groups or packages.
Microarray Data
Visualization and
Analysis with LAD
7.10.54
Supplement 8
Current Protocols in Bioinformatics
PostgreSQL RPM packages
PostgreSQL is the open-source relational database system (RDBMS) that LAD utilizes
for its data persistence and organization. Most of the required PostgreSQL packages are
installed by selecting the Database group during installation. Be sure that the following RPM packages are installed before attempting to configure PostgreSQL for LAD
operation.
Required PostgreSQL packages:
postgresql-7.3.2-5mdk
postgresql-server-7.3.2-5mdk
perl-DBD-Pg-1.21-2mdk
pgaccess-0.92.8.20030117-2mdk
PostgreSQL package selection
Find and select each of these packages. Do not yet begin installation of the selected
groups or packages.
Graphics libraries: LAD utilizes several types of graphic file formats for its intrinsic
image manipulation functionalities. Additionally, several graphic libraries are utilized to
build LAD binary applications from source code during the installation process:
freetype-1.3.1-18mdk
freetype-devel-1.3.1-18mdk
freetype-tools-1.3.1-18mdk
freetype2-2.1.3-12mdk
freetype2-devel-2.1.3-12mdk
freetype-static-devel-2.1.3-12mdk
freetype2-tools-2.1.3-11mdk
libxpm4-3.4k-23mdk
libxpm4-devel-3.4k-23mdk
libpng3-1.2.5-2mdk
libpng3-devel-1.2.5-2mdk
libnetpbm9-9.24-4.1mdk
libnetpbm9-devel-9.24-4.1mdk
libnetpbm9-static-devel-9.24-4.1mdk
netpbm-9.24-4.1mdk
ImageMagick-5.5.4.4-7mdk.
Image library and application selection: Be sure to install all of the listed applications
and libraries.
Finish installation of Mandrake Linux
With all prerequisite groups and packages selected installation of Mandrake Linux may
now be completed.
Server Configuration
It is assumed that at this stage of the protocol the server has been fully installed with
Mandrake Linux and cleanly rebooted into normal operation.
Analyzing
Expression
Patterns
7.10.55
Current Protocols in Bioinformatics
Supplement 8
PostgreSQL configuration
For efficient and acceptable performance LAD requires a few PostgreSQL configuration
issues be attended to. By default, PostgreSQL is not configured to store or manage largerthan-average datasets. By their very nature, microarray data warehouses are larger than
the average dataset. The typical LAD deployment is no exception in this matter.
Here are a few simple changes that can be made to a PostgreSQL configuration file to
ensure proper performance. These values will differ on machines with differing amounts
of total system RAM. Values have been included that are appropriate for a machine with
∼1 Gb RAM.
Configuration of [postgresql.conf]: Add/Edit
[/var/lib/pgsql/data/postgresql.conf]:
the
following
values
tcpip_socket = true
port = 5432
shared_buffers = 100000
sort_mem = 20000
vacuum_mem = 20000.
These values will most likely not be accepted by your Linux kernel as valid as most
systems have a low shmmax by default. For this reason, configure [postgresql]
by adding the following line (after line that says something like PGVERSION=7.3) to
[/etc/init.d/postgresql]:
echo “1999999999” > /proc/sys/kernel/shmmax
Be sure to restart the PostgreSQL server by executing (as root):
/etc/init.d/postgresql stop
/etc/init.d/postgresql start
Configuration of [.pgpass]: LAD’s Web operations will operate in either a “trust” or
password-protected PostgreSQL environment. Scripts found in /lad/custom-bin/
and /lad/mad/bin/, however, may not operate in a password-protected PostgreSQL
environment. It is recommended that one create a .pgpass for any user who has
either manual or cron-related jobs that execute these programs. See the PostgreSQL
documentation on this issue for more details.
Apache configuration
The default configuration of the Apache Web server needs to be altered slightly for
appropriate LAD operation.
Configuration of [common-httpd.conf] (options): One should also edit the following entries to one’s HTTP configuration file [/etc/httpd/conf/commonhttpd.conf]:
Microarray Data
Visualization and
Analysis with LAD
<Directory /var/www/html>
Options -Indexes FollowSymLinks MultiViews Includes
</Directory>``;
<Directory />
Options -Indexes FollowSymLinks MultiViews Includes
</Directory>``;
7.10.56
Supplement 8
Current Protocols in Bioinformatics
Many operations in LAD can take a significant amount of time to complete. The Apache
Web server tends to come preconfigured to allow operations to run for a maximum time
of 5 min. For this reason, one will want to edit the Apache configuration to allow for the
successful completion of longer processes.
Configuration of [common-httpd.conf] (timeout): The overall browser timeout
should be adjusted with the HTTP configuration file [/etc/httpd/conf/commonhttpd.conf]. The value Timeout 300 should be edited to a value of Timeout 100000.
Be sure to restart the Apache Web server by executing (as root):
/etc/init.d/httpd stop
/etc/init.d/httpd start
Configuration of file system: The Apache Web server installation must be altered for
LAD operation. LAD must own the default html and cgi-bin directories in order
for it to function properly. Execute the following commands to allow LAD to own these
directories:
mv
mv
ln
ln
/var/www/html /var/www/html_pre_lad
/var/www/cgi-bin /var/www/cgi-bin_pre_lad
-s /lad/www-data/html /var/www/html
-s /lad/www-data/cgi-bin /var/www/cgi-bin
LAD Installation
With PostgreSQL and Apache properly configured one is now ready to begin the installation of LAD.
PostgreSQL connectivity diagnostic test
Be sure that the root user has the ability to create a PostgreSQL database. The following
commands can be used as a diagnostic for this goal:
createdb test_lad
psql test_lad
dropdb test_lad
If these tests fail, one will need to perform the following commands to allow root to
interact with the PostgreSQL server:
su postgres
createuser --adduser --createdb root
exit
Now you should again attempt the PostgreSQL test commands as listed above.
LAD installation file download
The installation file [lad_1.4.tar.gz] is available at the LAD distribution site
http://www.longhornarraydatabase.org. Using the Web browser, download the file to
the local computer and save it to the some directory that will be accessible to the root
user.
Installation invocation
To start the installation, it is first necessary to open a command terminal. Now, through
the Linux su command, become the root user. Finally, switch to the directory in which
Analyzing
Expression
Patterns
7.10.57
Current Protocols in Bioinformatics
Supplement 8
Figure 7.10.42
Terminal display of LAD installation program.
the LAD installation file was placed. The archived contents of this file will now be
decompressed and expanded.
Execute as the root user:
gzip --d lad_1.4.tar.gz
tar --xvf lad_1.4.tar
cd lad/install/
./install.pl
The menu in Figure 7.10.42 should now be seen within the terminal window:
The LAD installation program is directed by a simple terminal window program that
allows one to step through a list of sequential operations. Steps may be repeated if
problems are encountered only if prerequisite steps are re-executed beforehand. Each
step of the installation process will be detailed in the following paragraphs.
PostgreSQL configuration
Type 1 and press the Enter key to display the PostgreSQL configuration information.
This step is simply a reminder to perform the PostgreSQL custom configuration that was
previously discussed in this section. When complete, press the Enter key again to return
to the main installation menu.
Microarray Data
Visualization and
Analysis with LAD
Linux prerequisites
Type 2 and press the Enter key to execute a check for Linux system prerequisites. This
check is most appropriate for Mandrake Linux, version 9.1. When complete, press the
Enter key to return to the main installation menu.
7.10.58
Supplement 8
Current Protocols in Bioinformatics
Figure 7.10.43
Example /etc/crontab file with a suggested schedule.
Install LAD files
Type 3 and press the Enter key to install the LAD file tree to /lad/. Previous installations
will be removed if they are detected. Be sure to make backup copies of these files,
especially the LAD file archive /lad/mad/archive/, if performing an upgrade.
When complete, press the Enter key again to return to the main installation menu.
Configure LAD server
Type 4 and press the Enter key to configure the LAD server. A short list of questions will
be asked that are important with respect to system configuration. Default values are given
within brackets. New values may be simply typed. Press the Enter key after each question.
When complete press the Enter key again to return to the main installation menu.
Configure LAD database
Type 5 and press the Enter key to create the PostgreSQL instance of the LAD relational
database. Previous installations will be removed if they are detected. When complete,
press the Enter key to return to the main installation menu.
Setup CURATOR account
LAD requires that each unrestricted user (users that have permission to upload experiments) have a system account with a user directory visible by the LAD server. Be sure
to save the password provided for this “curator” account. Type 6 and press the Enter key
to create a system account for the default LAD curator. When complete press the Enter
key to return to the main installation menu.
Build LAD binaries
LAD has several applications that must be compiled from source code before it will
successfully operate. Type 7 and press the Enter key to build these applications. The screen
will show the applications being compiled. Keep in mind that warnings are expected but
error conditions should be addressed. The most common cause of error is unfulfilled
prerequisite libraries. When complete, press the Enter key to return to the main installation
menu.
Analyzing
Expression
Patterns
7.10.59
Current Protocols in Bioinformatics
Supplement 8
Install GD.pm
LAD utilizes a specially modified version of the Perl module GD for many of its image
manipulations. Type 8 and press the Enter key to build and install this version of Perl
module GD. When complete, press the Enter key to return to the main installation menu.
HTTP configuration
The HTTP configuration screen is a simple reminder that the Apache Web server installation must be modified after LAD installation. This step is simply a reminder to
perform the Apache Web server custom configuration that was previously discussed in
this section. Type 9 and press the Enter key to read these instructions. When complete
press Enter key to return to the main installation menu.
Scheduled Tasks
LAD has several system scripts that should be automatically run on a regular schedule.
One will probably wish to utilize Linux cron to manage the execution of these scripts.
Figure 7.10.43 shows an example /etc/crontab file with a suggested schedule.
Microarray Data
Visualization and
Analysis with LAD
7.10.60
Supplement 8
Current Protocols in Bioinformatics
Gene Expression Analysis via
Multidimensional Scaling
UNIT 7.11
A first step in studying the gene expression profiles derived from a collection of experiments is to visualize the characteristics of each sample in order to gain an impression as
to the similarities and differences between samples. A standard measure of sample similarity is the Pearson correlation coefficient (see Commentary). While popular clustering
algorithms may imply the intrinsic relationship between genes and samples, a visual
appreciation of this relationship is not only essential for hypothesis generation (Khan
et al., 1998; Bittner et al., 2000; Hendenfalk, 2001), but is also a powerful tool for communicating the utility of a microarray experiment. Visualization of similarities between
samples provides an alternative to the hierarchical clustering algorithm. The multidimensional scaling (MDS) technique is one of the methods that convert the structure in the
similarity matrix to a simple geometrical picture: the larger the dissimilarity between
two samples (evaluated through gene expression profiling), the further apart the points
representing the experiments in the picture should be (Green and Rao, 1972; Schiffman
et al., 1981; Young, 1987; Green et al., 1989; Borg and Groenen, 1997; Cox and Cox,
2000).
The installation and implementation of the MDS method for gene expression analysis
is described in the Basic Protocol, which explains how to obtain a set of MATLAB
scripts and includes step-by-step procedures to enable users to quickly obtain results. To
demonstrate the functions of the MDS program, a publicly available data set is used to
generate a set of MDS plots; the interpretation of the plots is described in Guidelines for
Understanding Results. The mathematical fundamentals of the method are described in
the Commentary section. A diagram of the expression data flow described in this unit is
shown in Figure 7.11.1. MDS plots enable statisticians to communicate with biologists
with ease, which, in turn, helps biologists form new hypotheses.
Figure 7.11.1
Flow chart illustrating the data flow of the MDS program.
Contributed by Yidong Chen and Paul S. Meltzer
Current Protocols in Bioinformatics (2005) 7.11.1-7.11.9
C 2005 by John Wiley & Sons, Inc.
Copyright Analyzing
Expression
Patterns
7.11.1
Supplement 10
BASIC
PROTOCOL
USING THE MDS METHOD FOR GENE EXPRESSION ANALYSIS
The following protocol describes how to install and execute the MDS program. A publicly
available and downloadable data set is used to illustrate each step and the interpretation
of the output generated by the MDS program.
Necessary Resources
Hardware
The distributed program was tested under MATLAB v. 6.2 for PowerPC Macintosh
computer with Mac OS X v. 10.2, and additional machine support may be
included that will be specified in a README file provided with the distribution.
Software
MATLAB implementation of MDS, provided for distribution at http://research.
nhgri.nih.gov/microarray/MDS Supplement/; note that the URL is case-sensitive
along with a data set from Bittner et al. (2000).
MATLAB license, purchased from Mathworks Inc., (http://www.mathworks.com).
The Statistics Toolbox for MATLAB is required for this implementation.
The newer version of Statistics Toolbox has its own implementation of classical
MDS:
[y, e] = cmdscale(D)
where D is the input n×n distance matrix, y is the coordinates of n points in
p-dimension (p<n), and e is the vector of eigenvalues from which one can
determine the best reduced dimension of the system. However, the result
according to this function does not necessarily minimize the stress defined by
Equation 7.11.6 (see Background Information).
Files
Data matrix: The user can use either a ratio matrix or an intensity matrix. The
expression ratios (normalized ratio of mean intensity of sample 1 to sample 2,
for example) or intensities of each experiment are organized in columns, while
each row provides expression data for each gene. A typical data table is shown in
Figure 7.11.2. The data matrix may contain missing values, meaning that no
value was provided in an experiment for a given gene (most likely because the
Multidimensional
Scaling
Figure 7.11.2 Data matrix format (tab-delimited text file displayed as Microsoft Excel spreadsheet). The first row of the matrix contains the experiment names and the first column contains
the IDs for each gene. It is recommended that white-space characters be excluded from IDs and
experiment names.
7.11.2
Supplement 10
Current Protocols in Bioinformatics
Figure 7.11.3 Color assignment table format (tab-delimited text file displayed as Microsoft Excel
spreadsheet). The first row has the color assigmnets (r = red, b = blue, y = yellow) and the second
row has the gene ID.
measurement was not reliable). The data matrix might contain no values for an
entire experiment if no probe was printed at one particular location for a batch of
microarray slides. These kinds of missing entries will be discarded during the
computation of the pairwise Pearson’s correlation coefficient. Data matrices
(whether ratio, intensity, or quality) are stored as tab-delimited text.
Quality matrix (optional): This file is organized similarly to the data matrix, but
each element of the quality matrix takes a value from 0 to 1, indicating the
lowest measurement quality (0) to highest measurement quality (1). Many
image analysis software programs or data-filtering schemes do not provide
quantitative quality assessment; instead, an incomplete data matrix (containing
missing values) may be provided. Inside MATLAB, a default quality matrix of 1
is automatically generated when no quality matrix is supplied. Any detected
missing value will be assigned a quality 0 at the same location in quality matrix.
Color assignment file (optional): This is a simple text file and the format is shown
in Figure 7.11.3. Each sample is assigned to a color code to provide some visual
assistance. Currently, the program only takes the following colors: r, g, b, c, m,
y, k, and w, representing red, green, blue, cyan, magenta, yellow, black, and
white, respectively.
Installing the MDS program
It is assumed that the base MATLAB program and Statistics Toolbox were installed prior
to this step. Users are encouraged to read the further installation instructions provided
in the README file that can be downloaded from the Web site mentioned in Necessary
Resources.
1. Download the software from http://research.nhgri.nih.gov/microarray/MDS
Supplement/.
2. To install the package, drag the MDS folder to any location on the local hard drive, then
set the appropriate MATLAB default path to include the MDS folder (see README
file for details).
Executing the MDS program
Upon installing MATLAB MDS scripts, one is ready to execute the program without
further compilation. To demonstrate the application of MDS algorithm listed below, the
authors have provided a set of gene expression data derived from a melanoma study
(Bittner et al., 2000) downloadable from the aforementioned Web site. Briefly, a set of
Analyzing
Expression
Patterns
7.11.3
Current Protocols in Bioinformatics
Supplement 10
Figure 7.11.4
MDS graphical user interface showing various options.
38 microarray gene expression profiles, including 31 melanoma cell lines and tissue
samples and seven controls against a common reference sample are included. There are
total of 3614 gene expression data points per experiment after careful quality control.
The original data can also be found in following Web sites along with detailed description: the NHGRI microarray Web site (http://research.nhgri.nih.gov/microarray/)
for selected publications; and the NCBI GEO site http://www.ncbi.nih.gov/geo/
with the GEO accession number GSE1. Two files are available: (1) the gene expression ratio file Melanoma3614ratio.data; and (2) the color assignment
file Melanoma3614color.txt. These two files will be used in the following
steps.
3. Launch the MATLAB environment and type:
> MDSDialog
A dialog window should appear as shown in Figure 7.11.4.
4. For Input File Types, choose the “Ratio-based” radio button.
5. For Input Files:
a. Check Data Matrix File, and a standard file input dialog box will be displayed.
Navigate to where the file Melanoma3614ratio.data is located, highlight
the file, and click the Open button.
b. Since there is no quality matrix provided in the example data set, leave the Quality
Matrix File check box unchecked.
c. Check Color Assignment File, navigate to where the file Melanoma
3614color.txt locates, highlight the file, and click the Open button.
6. For Preprocessing:
a. Check “Choose to round =” and use the default value 50 to automatically round
down every ratio greater than 50 to 50, or round up any ratio less than 1/50 to 1/50.
b. Check “Log-transform.”
Multidimensional
Scaling
7.11.4
Supplement 10
Current Protocols in Bioinformatics
7. For the Similarity/Distance choice, choose the default, “Pearson correction.”
8. For Output choice:
a. Check “With sample name,” so that the final display will have name attached to
each object.
b. Check “AVI Movie” to request a AVI movie output.
9. For Display mode, select the “3-D” radio button.
10. Click the OK button, and the MDS program will execute.
11. The user may rotate the coordinate system by holding the mouse button down while
moving the cursor on the screen, after the program ends.
GUIDELINES FOR UNDERSTANDING RESULTS
In many expression profiling studies, one of the common questions is the relationship between samples. Sometimes, the class designations of samples are known (e.g.,
cancer types, stage of cancer progression, treatment, etc.), while in other cases there
are merely array-quality-related classifications (e.g., fabrication batch or hybridization
date). If it is assumed that similar molecular changes or similar microarray measurements can result from samples having similar phenotypes or from a particular fabrication batch, it is desirable to visualize the relationships between samples before committing to a particular statistical analysis. To visually observe the similarity between
test samples, one can simply use an expression ratio matrix that was deemed to be
of good measurement quality. To illustrate the result, the authors have employed the
data files from Bittner et al. (2000) supplied with this distribution. The expression
ratio matrix file Melanoma3614ratio.data contains 3614 genes that passed the
measurement quality criterion. The authors have also used the color assignment file
Melanoma3614color.txt, where samples with red color represent less aggressive
melanoma and those with blue color represent more aggressive; yellow signifies control
samples (see Bittner et al., 2000). Following the default setting (log-transform, rounding
at 50, Pearson correlation), a 3-D MDS plot is generated as shown in Figure 7.11.5A.
One of the immediate questions is which gene’s (or set of genes’) expression levels
correspond to this clustering effect? It should be easy to see that when one assigns
red color to samples whose expression ratio of WNT5A (CloneID: 324901) is
less than 0.5, and otherwise blue color (color assignment file is provided as
Melanoma3614WNT5A.txt), then the MDS plot, shown in Figure 7.11.5B, resembles
Figure 7.11.5A. Using a subset of genes (276 total) derived from Bittner et al. (2000),
MelanomaSigGene276ratio.txt (notice that order of columns in this file is different from Melanoma3614ratio.data), the authors applied the same algorithm
along with the color assignment file MelanomaSigGene276color.txt. The MDS
result is shown in Figure 7.11.5C. Clearly, Figure 7.11.5C shows a remarkable clustering
effect in the less aggressive class (19 samples), but the same effect is not observed with
the other two groups (aggressive class and control class). Another excellent example of
the application of MDS techniques was by Hedenfalk et al. (2001). In their study of breast
cancer samples, cancer subtypes BRCA1, BRCA2, and sporadic were predetermined by
genotyping, a statistical analysis approach similar to that discussed before was employed,
and MDS plots with a set of discriminative genes demonstrated the power of MDS visualization technique. Specific methods of selecting significantly expressed genes is beyond
the scope of this unit; users should refer to publications by Bittner et al. (2000), Hedenfalk
et al. (2001), Tusher et al. (2001), and Smyth (2004).
Analyzing
Expression
Patterns
7.11.5
Current Protocols in Bioinformatics
Supplement 10
Figure 7.11.5 Three-dimensional MDS plots. (A) Generated from 3614 genes that passed measurement quality criterion; (B) Color overlay with WNT5A genes’ expression ratio (red = expression
ratio <0.5, blue = all others); (C) MDS plot with 276 discriminative genes (derived from Bittner
et al., 2000). This black and white facsimile of the figure is intended only as a placeholder; for
full-color version of figure go to http://www.interscience.wiley.com/c p/colorfigures.htm.
COMMENTARY
Background Information
Multidimensional
Scaling
Similarity and distance between samples
A gene expression data set is typically presented in a matrix form, X = {x1 , x2 , . . . , xM },
where vector xi represents the ith microarray
experiment, with a total of M experiments in
the data set. Each experiment consists of n
measurements, either gene expression intensity or expression ratio, denoted by xi = {xi1 ,
xi2 , . . ., xin }. The matrix X can also be viewed
in the row direction, which provides profiles
of genes across all samples. Samples may be
ordered according to time or dosage, or may be
unordered collections. Without the loss of generality, the following discussion will present
MDS concepts and algorithms by exploring
the relationship of experimental samples. In
addition, it will be assumed that the microarray data measurements (for expression ratio
or intensity) are log-transformed. Some other
data-transformation methods have also been
proposed in order to preserve some expected
statistical characteristics for higher-level analysis. For detailed discussion of data transformations, readers are referred to the literature
(Durbin et al., 2002; Huber et al., 2002).
Pearson correlation coefficient
One of the most commonly used similarity
measure is the Pearson correlation coefficient.
Given a matrix X, a similarity matrix of all
columns, S = {sij }, is defined as,
Equation 7.11.1
where µxi , µxj , (and σ xi , σ xj ) are the mean (and
standard deviation) over all genes in array i and
j, respectively. The value of correlation coefficient is between −1 and 1, with 1 signifying perfectly correlated, 0 not correlated at all,
and −1 inversely correlated. The distance between two samples i and j is simply defined as
7.11.6
Supplement 10
Current Protocols in Bioinformatics
dij = 1 − sij . Commonly, a normalization procedure is applied to microarray data before it
is entered into similarity calculation, and when
gene expression ratios are under consideration,
the mean expression ratio, µxi , should be zero.
An uncentered correlation coefficient is defined as:
where rik is the rank of kth measurement in ith
microarray. The range of the Spearman correlation coefficient is also from −1 to 1.
Euclidean distance
Euclidean distance is a most commonly
used dissimilarity measure defined as:
Equation 7.11.4
Equation 7.11.2
Equation 7.11.2 assumes that the mean is
0, even when it is not. The difference between
Equations. 7.11.1 and 7.11.2 is that if two samples’ expression profiles have identical shape,
but with an offset relative to each other by a
fixed value, the standard Pearson correlation
from these two profiles will be 1 but will not
have an uncentered correlation of 1.
Spearman correlation coefficient
On many occasions when it is not convenient, accurate, or even possible to give actual
values to variables, the rank order of instances
of each variable is commonly employed. The
rank order may also be a better measure when
the relationship is nonlinear. Rather than using the raw data as in the Pearson correlation
coefficient, the Spearman correlation coefficient uses the rank of each data to perform the
calculation as given in Equation 7.11.1. The
Spearman correlation coefficient is defined as:
Equation 7.11.3
Notice that the Euclidean distance is not
bounded; therefore, normalization and dynamic range rescaling for each microarray are
very important steps before the distance calculation. Otherwise, the distance measure will
reflect the bias due to inadequate normalization. There are many other distance or similarity measures (see Everitt and Dunn, 1992
and Duda et al., 2001), which are not covered
exhaustively here.
Multidimensional scaling
As mentioned before, the similarity matrix
can be converted to distance matrix, D = {dij },
by D = 1 − S. A typical distance matrix is
shown in Table 7.11.1, where the diagonal elements are all zeros, the upper triangle part
is symmetrical to the lower triangle part, and
each element takes value from [−1, 1]. After this step, the original data dimension has
been effectively reduced to an M × M matrix,
and, clearly, the distance scale derived from
similarity matrix preserved the definition of
distance concept (origin at zero, ordered scale,
and ordered difference). The objective of multidimensional scaling is to further reduce the
Table 7.11.1 Partial Distance Matrix, D, Generated from the Provided Data Sample (Bittner et al.,
2000)
UACC
383
UACC
457
UACC
3093
UACC
2534
M92
−001
A-375
UACC
502
M91
−054
UACC
1256
UACC383
0
UACC457
0.4
0
UACC3093
0.37
0.32
0
UACC2534
0.44
0.49
0.41
0
M92-001
0.47
0.54
0.46
0.41
0
A-375
0.46
0.46
0.45
0.47
0.36
0
UACC502
0.57
0.57
0.52
0.54
0.42
0.5
0
M91-054
0.49
0.52
0.47
0.54
0.42
0.5
0.43
0
UACC1256
0.51
0.49
0.49
0.48
0.41
0.47
0.4
0.46
0
UACC091
0.48
0.56
0.5
0.53
0.47
0.48
0.49
0.41
0.42
Analyzing
Expression
Patterns
7.11.7
Current Protocols in Bioinformatics
Supplement 10
dimension of the matrix such that an observable graphic model is obtained with a minimum error. Typically, a Cartesian coordinate
system represents an observable graphic model
with the Euclidean distance measuring the
between-point distance.
To archive the objective of mapping the distance matrix to a graphical model, one would
like to find a target matrix X̂ = {x̂ ij } in 1-,
2-, or 3-dimensional space, and its accompanying distance matrix D̂ = {d̂ ij } derived from
Euclidean distance in p-dimensional space,
where di2 , d 2j , and di2j are the results of
summation of di2j over j, i, and both i and j,
respectively. By selecting the first k principal
components of X to form a target matrix X̂ =
{x̂ ij }, for i = 1, . . . , M and j = 1, . . . , k, the
classic MDS implementation is achieved. It is
noted that some implementations of MDS provide a second step to optimize against Equation
7.11.5 via an optimization procedure such
as the Steepest-Descend method. On many
occasions, the second step is quite significant
where stress T of MDS can be further reduced.
Readers who are interested in actual implementation are strongly advised to consult the
MDS literature listed at the end of this section.
Equation 7.11.5
such that the stress T, a measure of the
goodness-of-fit, is minimized. Here, the stress
T is defined by:
Public domain and commercial
software packages
There are a large number of software packages provide MDS functionality. Some popular applications are listed in Table 7.11.2.
Critical Parameters
Equation 7.11.6
If acceptably accurate representation, ranging from T = 20% (poor), to 5% (good),
to 2.5% (excellent), as suggested by Everitt
and Dunn (1992), can be found in threedimensional (or less) space, visualization via
MDS will be an extremely valuable way to gain
insight into the structure of data. However, due
to the high dimensionality of gene expression
profiling experiments, it is normally expected
that MDS plots will have a stress parameter T
in the range of 10% to 20%.
Implementation and applications
Many statistical software packages provide
their implementation of classic MDS algorithms. Without going into lengthy discussion
about various MDS definitions and their
mathematics, the authors of this unit provide a
concise version of a possible implementation
of classic MDS (Everitt and Dunn, 1992). In
the classical MDS setting, given the pair-wise
distance dij between all M experiments, the
coordinates of M points in M-dimensional
Euclidean space, X = (xij ), can be obtained by
finding the eigenvectors of B = XX , while the
matrix elements of B, {bij }, can be estimated
by:
Multidimensional
Scaling
Equation 7.11.7
Some important clarifications are provided
as follows.
1. If the data matrix contains raw expression-ratio or intensity data, a logarithm transform is hightly recommended. Typically, log2
transforms are applied to the ratio variable in
order to obtain a symmetric distribution around
zero in which two-fold increases and decreases
in the ration (2 and 0.5, respectively) will be
conveniently converted into 1 and −1, respectively. While the purpose of transforming ratio data is mainly to yield a symmetrical distribution, log-transformation of the intensity
mainly provides for the stabilization of variance across all possible intensity ranges. If not
satisfied with the log-transforms automatically
provided by the program, the user can supply a matrix with the transformation of choice,
in which case the “transformation” check box
should be unchecked.
2. Rounding is necessary to fix some extreme ratio or intensity problem. However,
when a pretransformed data matrix is supplied,
one should not use any data rounding.
3. The preferable distance metric is the
Pearson correlation coefficient. However other
distance measures (e.g., Spearman, Euclidean;
see Background Information) are provided for
comparison purpose.
4. Generation of an AVI movie requires
large RAM, and the program typically produces an AVI movie of ∼300 Mb. Other
compression techniques (QuickTime, MPEG,
etc) may be required in order to produce a
movie of manageable size.
7.11.8
Supplement 10
Current Protocols in Bioinformatics
Table 7.11.2 Software Packages Providing MDS Functionality
Application
Function
Availability
Platform
R
Classic multidimensional scaling
cmdscale(d, k, eig, add,
x.ret)
Free (http://www.r-project.org/)
Windows, Unix,
Mac OS X
S-plus
Classic multidimensional scaling
cmdscale(d, k, eig, add,
x.ret)
Commercial Insightful Corp.
(http://www.insightful.com)
Windows, Unix
BRB ArrayTools
Menu-driven microarray analysis
tools
Free (see licensing agreement)
Biometric Research Branch,
DCTD, NCI, NIH
(http://linus.nci.nih.gov/BRBArrayTools.html)
Windows
Partek Pro/Partek
Discover
Menu-driven data analysis and
visualization
Commercial Partek, Inc.
(http://www.partek.com)
Windows, Unix
xGobi/xGvis
Multivariate data visualization and
multidimensional scaling
Free (http://www.research.att.com/
areas/stat/xgobi/)
Windows, Unix,
Mac OS X
Literature Cited
Bittner, M., Meltzer, P., Chen, Y., Jiang, Y.,
Seftor, E., Hendrix, M., Radmacher, M.,
Simon, R., Yakhini, Z., Ben-Dor, A., Sampas,
N., Dougherty, E., Wang, E., Marincola, F.,
Gooden, C., Lueders, J., Glatfelter, A., Pollock,
P., Carpten, J., Gillanders, E., Leja, D., Dietrich,
K., Beaudry, C., Berens, M., Alberts, D., and
Sondak, V. 2000. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406:536-540.
Borg, I. and Groenen, P. 1997. Modern Multidimensional Scaling: Theory and Applications.
Springer, New York.
Cox, T.F. and Cox, M.A.A. 2000. Multidimensional
Scaling, 2nd ed. CRC Press, Boca Raton, Fla.
Durbin, B., Hardin, J., Hawkins D., and Rocke D.
2002. A variance-stabilizing transformation for
gene-expression microarray data. Bioinformatics 18:105-110.
Duda, R.O., Hart, E., and Stork, D.G. 2001. Pattern
Classification, 2nd ed. John Wiley & Sons,
New York.
Everitt, B.S. and Dunn, G., 1992. Applied Multivariate Data Analysis. Oxford University Press,
New York.
Green, P.E. and Rao V.R. 1972. Applied Multidimensional Scaling. Dryden Press, Hinsdale, Ill.
Green, P.E., Carmone, F.J., and Smith, S.M. 1989.
Multidimensional Scaling: Concepts and Applications. Allyn and Bacon, Needham Heights,
Mass.
Hedenfalk, I., Duggan, D., Chen, Y., Radmacher,
M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi, O.P., Wilfond,
B., Borg, A., and Trent, J. 2001. Gene expression
profiles in hereditary breast cancer. N. Engl. J.
Med. 344:539-548.
Huber, W., von Heydebreck, A., Sultmann, H.,
Poustka, A., and Vingron, M. 2002. Variance stabilization applied to microarray data calibration
and to the quantification of differential expression. Bioinformatics 18:S96-S104.
Khan, J., Simon, R., Bittner, M., Chen, Y., Leighton,
S.B., Pohida, T., Smith, P.D., Jiang, Y., Gooden,
G.C., Trent, J.M., and Meltzer, P.S. 1998. Gene
expression profiling of alveolar rhabdomyosarcoma with cDNA microarrays. Cancer Res.
58:5009-5113.
Schiffman, S.S., Reynolds, M.L., and Young, F.W.
1981. Introduction to Multidimensional Scaling:
Theory, Method and Applications. Academic
Press, Inc., New York.
Smyth, G., 2004. Linear Models and Empirical
Bayes Methods for Assessing Differential Expression in Microarray Experiments, Statistical
Applications in Genetics and Molecular Biology
Vol. 3: No. 1, Article 3. http://www.bepress.com/
sagmb/vol3/iss1/art3.
Tusher, V.G., Tibshirani, R., and Chu, G. 2001. Significance analysis of microarrays applied to the
ionizing radiation response. Proc. Natl. Acad.
Sci. U.S.A. 98:5116-5121.
Young, F.W. 1987. Multidimensional Scaling: History, Theory and Applications (R.M. Hamer,
ed.). Lawrence Erlbaum Associates, Hillsdale,
N.J.
Contributed by Yidong Chen and
Paul S. Meltzer
National Human Genome
Research Institute
National Institutes of Health
Bethesda, Maryland
Analyzing
Expression
Patterns
7.11.9
Current Protocols in Bioinformatics
Supplement 10
Using GenePattern for Gene Expression
Analysis
UNIT 7.12
Heidi Kuehn,1 Arthur Liberzon,1 Michael Reich,1 and Jill P. Mesirov1
1
Broad Institute of MIT and Harvard, Cambridge, Massachusetts
ABSTRACT
The abundance of genomic data now available in biomedical research has stimulated
the development of sophisticated statistical methods for interpreting the data, and of
special visualization tools for displaying the results in a concise and meaningful manner.
However, biologists often find these methods and tools difficult to understand and use
correctly. GenePattern is a freely available software package that addresses this issue by
providing more than 100 analysis and visualization tools for genomic research in a comprehensive user-friendly environment for users at all levels of computational experience
and sophistication. This unit demonstrates how to prepare and analyze microarray data
C 2008 by John Wiley &
in GenePattern. Curr. Protoc. Bioinform. 22:7.12.1-7.12.39. Sons, Inc.
Keywords: GenePattern r microarray data analysis r workflow r clustering r
classification r differential r expression analysis pipelines
INTRODUCTION
GenePattern is a freely available software package that provides access to a wide range
of computational methods used to analyze genomic data. It allows researchers to analyze
the data and examine the results without writing programs or requesting help from computational colleagues. Most importantly, GenePattern ensures reproducibility of analysis
methods and results by capturing the provenance of the data and analytic methods, the
order in which methods were applied, and all parameter settings.
At the heart of GenePattern are the analysis and visualization tools (referred to as
“modules”) in the GenePattern module repository. This growing repository currently
contains more than 100 modules for analysis and visualization of microarray, SNP,
proteomic, and sequence data. In addition, GenePattern provides a form-based interface
that allows researchers to incorporate external tools as GenePattern modules.
Typically, the analysis of genomic data consists of multiple steps. In GenePattern, this corresponds to the sequential execution of multiple modules. With GenePattern, researchers
can easily share and reproduce analysis strategies by capturing the entire set of steps
(along with data and parameter settings) in a form-based interface or from an analysis
result file. The resulting “pipeline” makes all the necessary calls to the required modules.
A pipeline allows repetition of the analysis methodology using the same or different data
with the same or modified parameters. It can also be exported to a file and shared with
colleagues interested in reproducing the analysis.
GenePattern is a client-server application. Application components can all be run on a
single machine with requirements as modest as that of a laptop, or they can be run on
separate machines allowing the server to take advantage of more powerful hardware. The
server is the GenePattern engine: it runs analysis modules and stores analysis results.
Two point-and-click graphical user interfaces, the Web Client, and the Desktop Client,
provide easy access to the server and its modules. The Web Client is installed with the
Current Protocols in Bioinformatics 7.12.1-7.12.39, June 2008
Published online June 2008 in Wiley Interscience (www.interscience.wiley.com).
DOI: 10.1002/0471250953.bi0712s22
C 2008 John Wiley & Sons, Inc.
Copyright Analyzing
Expression
Patterns
7.12.1
Supplement 22
server and runs in a Web browser. The Desktop Client is installed separately and runs
as a desktop application. In addition, GenePattern libraries for the Java, MATLAB, and
R programming environments provide access to the server and its modules via function
calls. The basic protocols in this unit use the Web Client; however, they could also be
run from the Desktop Client or a programming environment.
This unit demonstrates the use of GenePattern for microarray analysis. Many transcription
profiling experiments have at least one of the three following goals: differential expression
analysis, class discovery, or class prediction. The objective of differential expression
analysis is to find genes (if any) that are differentially expressed between distinct classes
or phenotypes of samples. The differentially expressed genes are referred to as marker
genes and the analysis that identifies them is referred to as marker selection. Class
discovery allows a high-level overview of microarray data by grouping genes or samples
by similar expression profiles into a smaller number of patterns or classes. Grouping genes
by similar expression profiles helps to detect common biological processes, whereas
grouping samples by similar gene expression profiles can reveal common biological
states or disease subtypes. A variety of clustering methods address class discovery by
gene expression data. In class prediction studies, the aim is to identify key marker genes
whose expression profiles will correctly classify unlabeled samples into known classes.
For illustration purposes, the protocols use expression data from Golub et al. (1999),
which is referred to as the ALL/AML dataset in the text. The data from this study
was chosen because it contains all three of the analysis objectives mentioned above.
Briefly, the study built predictive models using marker genes that were significantly
differentially expressed between two subtypes of leukemia, acute lymphoblastic (ALL)
and acute myelogenous (AML). It also showed how to rediscover the leukemia subtypes
ALL and AML, as well as the B and T cell subtypes of ALL, using sample-based
clustering. The sample data files are available for download on the GenePattern Web site
at http://www.genepattern.org/datasets/.
PREPARING THE DATASET
Analyzing gene expression data with GenePattern typically begins with three critical
steps.
Step 1 entails converting gene expression data from any source (e.g., Affymetrix or
cDNA microarrays) into a tab-delimited text file that contains a column for each sample,
a row for each gene, and an expression value for each gene in each sample. GenePattern
defines two file formats for gene expression data: GCT and RES. The primary difference
between the formats is that the RES file format contains the absent (A) versus present
(P) calls as generated for each gene by Affymetrix GeneChip software. The protocols
in this unit use the GCT file format. However, the protocols could also use the RES
file format. All GenePattern file formats are fully described in GenePattern File Formats
(http://genepattern.org/tutorial/gp fileformats.html).
Step 2 entails creating a tab-delimited text file that specifies the class or phenotype of
each sample in the expression dataset, if available. GenePattern uses the CLS file format
for this purpose.
Step 3 entails preprocessing the expression data as needed, for example, to remove
platform noise and genes that have little variation across samples. GenePattern provides
the PreprocessDataset module for this purpose.
Using
GenePattern for
Gene Expression
Analysis
7.12.2
Supplement 22
Current Protocols in Bioinformatics
Creating a GCT File
Four strategies can be used to create an expression data file (GCT file format; Fig. 7.12.1)
depending on how the data was acquired:
BASIC
PROTOCOL 1
1. Create a GCT file based on expression data extracted from the Gene Expression
Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) or the National Cancer Institute’s caArray microarray expression data repository (http://caarray.nci.nih.gov).
GenePattern provides two modules for this purpose: GEOImporter and caArrayImportViewer.
2. Convert MAGE-ML format data to a GCT file. MAGE-ML is the standard format for storing both Affymetrix and cDNA microarray data at the ArrayExpress
repository (http://www.ebi.ac.uk/arrayexpress). GenePattern provides the MAGEMLImportViewer module to convert MAGE-ML format data.
3. Convert raw expression data from Affymetrix CEL files to a GCT file. GenePattern
provides the ExpressionFileCreator module for this purpose.
4. Expression data stored in any other format (such as cDNA microarray data)
must be converted into a tab-delimited text file that contains expression measurements with genes as rows and samples as columns. Expression data can be
intensity values or ratios. Use Excel or a text editor to manually modify the
text file to comply with the GCT file format requirements. Excel is a popular choice for editing gene expression data files. However, be aware that (1) its
auto-formatting can introduce errors in gene names (Zeeberg et al., 2004) and
(2) its default file extension for tab-delimited text is .txt. GenePattern requires
a .gct file extension for GCT files. In Excel, choose Save As and save the file in
text (tab delimited) format with a .gct extension.
Table 7.12.1 lists commonly used gene expression data formats and the recommended
method for converting each into a GenePattern GCT file. For the protocols in this unit,
download the expression data files all aml train.gct and all aml test.gct
from the GenePattern Web site, at http://www.genepattern.org/datasets/.
Figure 7.12.1 all aml train.gct as it appears in Excel. GenePattern File Formats
(http://genepattern.org/tutorial/gp fileformats.html) fully describes the GCT file format.
Analyzing
Expression
Patterns
7.12.3
Current Protocols in Bioinformatics
Supplement 22
Table 7.12.1 GenePattern Modules for Translating Expression Data into GCT or RES File Formats
Source data
GenePattern modulea
Output filea
CEL files from Affymetrix
ExpressionFileCreator
GCT or RES
Gene Expression Omnibus (GEO) data
GEOImporter
GCT
MAGE-ML expression data from ArrayExpress
MAGEMLImportViewer
GCT
caArray expression data
caArrayImportViewer
GCT
N/A
N/A
b
Two-color ratio data
a N/A, not applicable.
b Two-color ratio data in text format files, such as PCL and CDT, can be opened in Excel or a text editor and modified to
match the GCT or RES file format.
BASIC
PROTOCOL 2
Creating a CLS File
Many of the GenePattern modules for gene expression analysis require both an expression
data file and a class file (CLS format). A CLS file (Fig. 7.12.2) identifies the class or
phenotype of each sample in the expression data file. It is a space-delimited text file that
can be created with any text editor.
The first line of the CLS file contains three values: the number of samples, the number of
classes, and the version number of file format (always 1). The second line begins with a
pound sign (#) followed by a name for each class. The last line contains a class label for
each sample. The number and order of the labels must match the number and order of
the samples in the expression dataset. The class labels are sequential numbers (0, 1, . . .)
assigned to each class listed in the second line.
For the protocols in this unit, download the class files all aml train.cls and
all aml test.cls from the GenePattern Web site at http://www.genepattern.
org/datasets/.
Figure 7.12.2 all aml train.cls as it appears in Notepad. GenePattern File Formats
(http://genepattern.org/tutorial/gp fileformats.html) fully describes the CLS file format.
BASIC
PROTOCOL 3
Using
GenePattern for
Gene Expression
Analysis
Preprocessing Gene Expression Data
Most analyses require preprocessing of the expression data. Preprocessing removes
platform noise and genes that have little variation so the analysis can identify interesting variations, such as the differential expression between tumor and normal tissue.
GenePattern provides the PreprocessDataset module for this purpose. This module can
perform one or more of the following operations (in order):
1. Set threshold and ceiling values. Any expression value lower than the threshold value
is set to the threshold. Any value higher than the ceiling value is set to the ceiling
value.
7.12.4
Supplement 22
Current Protocols in Bioinformatics
2. Convert each expression value to the log base 2 of the value. When using ratios
to compare gene expression between samples, this transformation brings up- and
down-regulated genes to the same scale. For example, ratios of 2 and 0.5, indicating
two-fold changes for up- and down-regulated expression, respectively, become +1
and −1 (Quackenbush, 2002).
3. Remove genes (rows) if a given number of its sample values are less than a given
threshold. This may be an indication of poor-quality data.
4. Remove genes (rows) that do not have a minimum fold change or expression
variation. Genes with little variation across samples are unlikely to be biologically
relevant to a comparative analysis.
5. Discretize or normalize the data. Discretization converts continuous data into a small
number of finite values. Normalization adjusts gene expression values to remove
systematic variation between microarray experiments. Both methods may be used to
make sample data more comparable.
For illustration purposes, this protocol applies thresholds and variation filters (operations
1, 3, and 4 in the list above) to expression data, and Basic Protocols 4, 5, and 6 analyze
the preprocessed data. In practice, the decision of whether to preprocess expression data
depends on the data and the analyses being run. For example, a researcher should not
preprocess the data if doing so removes genes of interest from the result set. Similarly,
while researchers generally preprocess expression data before clustering, if doing so removes relevant biological information, the data should not be preprocessed. For example,
if clusters based on minimal differential gene expression are of biological interest, do
not filter genes based on differential expression.
Necessary Resources
Hardware
Computer running MS Windows, Mac OS X, or Linux
Software
GenePattern software, which is freely available at http://www.genepattern.org/, or
browser to access GenePattern on line (the Support Protocol describes how to
start GenePattern)
Modules used in this protocol: PreprocessDataset (version 3)
Files
The PreprocessDataset module requires gene expression data in a tab-delimited
text file (GCT file format, Fig. 7.12.1) that contains a column for each sample
and a row for each gene. Basic Protocol 1 describes how to convert various gene
expression data into this file format.
As an example, this protocol uses the ALL/AML leukemia training dataset
(Golub et al., 1999) consisting of 38 bone marrow samples (27 ALL, 11 AML).
Download the data file (all aml train.gct) from the GenePattern Web site
at http://www.genepattern.org/datasets/.
1. Start PreprocessDataset: select it from the Modules & Pipelines list on the GenePattern start page (Fig. 7.12.3). The PreprocessDataset module is in the Preprocess &
Utilities category.
GenePattern displays the parameters for the PreprocessDataset module (Fig. 7.12.4). For
information about the module and its parameters, click the Help link at the top of the
form.
Analyzing
Expression
Patterns
7.12.5
Current Protocols in Bioinformatics
Supplement 22
Figure 7.12.3 GenePattern Web Client start page. The Modules & Pipelines pane lists all modules installed on the GenePattern server. For illustration purposes, we installed only the modules
used in this protocol. Typically, more modules are listed.
Figure 7.12.4
parameters.
PreprocessDataset parameters. Table 7.12.2 describes the PreprocessDataset
Using
GenePattern for
Gene Expression
Analysis
7.12.6
Supplement 22
Current Protocols in Bioinformatics
Table 7.12.2 Parameters for PreprocessDataset
Parameter
Description
input filename
Gene expression data (GCT or RES file format)
output file
Output file name (do not include file extension)
output file format
Select a file format for the output file
filter flag
Whether to apply thresholding (threshold and ceiling parameter) and variation filters (minchange,
mindelta, num excl, and prob thres parameters) to the dataset
preprocessing flag
Whether to discretize (max sigma binning parameter) the data, normalize the data, or both (by
default, the module does neither)
minchange
Exclude rows that do not meet this minimum fold change: maximum-value/minimum-value <
minchange
mindelta
Exclude rows that do not meet this minimum variation filter: maximum-value – minimum-value
< mindelta
threshold
Reset values less than this to this value: threshold if < threshold
ceiling
Reset values greater than this to this value: ceiling if > ceiling (by default, the ceiling is 20,000)
max sigma binning
Used for discretization (preprocessing flag parameter), which converts expression values to
discrete values based on standard deviations from the mean. Values less than one standard
deviation from the mean are set to 1 (or –1), values one to two standard deviations from the mean
are set to 2 (or –2), and so on. This parameter sets the upper (and lower) bound for the discrete
values. By default, max sigma binning = 1, which sets expression values above the mean to 1 and
expression values below the mean to –1.
prob thres
Use this probability threshold to apply variation filters (filter flag parameter) to a subset of the
data. Specify a value between 0 and 1, where 1 (the default) applies variation filters to 100% of
the dataset. We recommend that only advanced users modify this option.
num excl
Exclude this number of maximum (and minimum) values before the selecting the
maximum-value (and minimum-value) for minchange and mindelta. This prevents a gene that has
“spikes” in its data from passing the variation filter.
log base two
Converts each expression value to the log base 2 of the value; any negative or 0 value is marked
“NaN”, indicating an invalid value
number of columns
above threshold
Removes underexpressed genes by removing rows that do not have at least a given number of
entries (this parameter) above a given value (column threshold parameter).
column threshold
Removes underexpressed genes by removing rows that do not have at least a given number of
entries (column threshold parameter) above a given value (this parameter).
2. For the “input filename” parameter, select gene expression data in the GCT file
format.
For example, use the Browse button to select all aml train.gct.
3. Review the remaining parameters to determine which values, if any, should be
modified (see Table 7.12.2).
For this example, use the default values.
4. Click Run to start the analysis.
GenePattern displays a status page. When the analysis completes, the status page lists
the analysis result files: the all aml train.preprocessed.gct file contains the
preprocessed gene expression data; the gp task execution log.txt file lists the
parameters used for the analysis.
5. Click the Return to Modules & Pipelines Start link at the bottom of the status page
to return to the GenePattern start page.
Analyzing
Expression
Patterns
7.12.7
Current Protocols in Bioinformatics
Supplement 22
BASIC
PROTOCOL 4
DIFFERENTIAL ANALYSIS: IDENTIFYING DIFFERENTIALLY
EXPRESSED GENES
This protocol focuses on differential expression analysis, where the aim is to identify
genes (if any) that are differentially expressed between distinct classes or phenotypes.
GenePattern uses the ComparativeMarkerSelection module for this purpose (Gould et al.,
2006).
For each gene, the ComparativeMarkerSelection module uses a test statistic to calculate
the difference in gene expression between the two classes and then estimates the significance (p-value) of the test statistic score. Because testing tens of thousands of genes
simultaneously increases the possibility of mistakenly identifying a non-marker gene
as a marker gene (a false positive), ComparativeMarkerSelection corrects for multiple
hypothesis testing by computing both the false discovery rate (FDR) and the family-wise
error rate (FWER). The FDR represents the expected proportion of non-marker genes
(false positives) within the set of genes declared to be differentially expressed. The FWER
represents the probability of having any false positives. It is in general stricter or more
conservative than the FDR. Thus, the FWER may frequently fail to find marker genes
due to the noisy nature of microarray data and the large number of hypotheses being
tested. Researchers generally identify marker genes based on the FDR rather than the
more conservative FWER.
Measures such as FDR and FWER control for multiple hypothesis testing by “inflating” the nominal p-values of the single hypotheses (genes). This allows for controlling
the number of false positives but at the cost of potentially increasing the number of
false negatives (markers that are not identified as differentially expressed). We therefore
recommend fully preprocessing the gene expression dataset as described in Basic Protocol 3 before running ComparativeMarkerSelection, to reduce the number of hypotheses
(genes) to be tested.
ComparativeMarkerSelection generates a structured text output file that includes the test
statistic score, its p-value, two FDR statistics, and three FWER statistics for each gene.
The ComparativeMarkerSelectionViewer module accepts this output file and displays the
results interactively. Use the viewer to sort and filter the results, retrieve gene annotations
from various public databases, and create new gene expression data files from the original
data. Optionally, use the HeatMapViewer module to generate a publication quality heat
map of the differentially expressed genes. Heat maps represent numeric values, such as
intensity, as colors making it easier to see patterns in the data.
Necessary Resources
Hardware
Computer running MS Windows, Mac OS X, or Linux
Software
GenePattern software, which is freely available at http://www.genepattern.org/, or
browser to access GenePattern on line (the Support Protocol describes how to
start GenePattern)
Modules used in this protocol: ComparativeMarkerSelection (version 4),
ComparativeMarkerSelectionViewer (version 4), and HeatMapViewer
(version 8)
Files
Using
GenePattern for
Gene Expression
Analysis
The ComparativeMarkerSelection module requires two files as input: one for gene
expression data and another that specifies the class of each sample. The classes
usually represent phenotypes, such as tumor or normal. The expression data file
is a tab-delimited text file (GCT file format, Fig. 7.12.1) that contains a column
7.12.8
Supplement 22
Current Protocols in Bioinformatics
for each sample and a row for each gene. Classes are defined in another
tab-delimited text file (CLS file format, Fig. 7.12.2). Basic Protocols 1 and 2
describe how to convert various gene expression data into these file formats.
As an example, this protocol uses the ALL/AML leukemia training dataset (Golub
et al., 1999) consisting of 38 bone marrow samples (27 ALL, 11 AML).
Download the data files (all aml train.gct and all aml train.cls)
from the GenePattern Web site at http://www.genepattern.org/datasets/. This
protocol assumes that the expression data file, all aml train.gct, has been
preprocessed according to Basic Protocol 3. The preprocessed expression data
file, all aml train.preprocessed.gct, is used in this protocol.
Run ComparativeMarkerSelection analysis
1. Start ComparativeMarkerSelection by selecting it from the Modules & Pipelines list
on the GenePattern start page (this can be found in the Gene List Selection category).
GenePattern displays the parameters for the ComparativeMarkerSelection (Fig. 7.12.5).
For information about the module and its parameters, click the Help link at the top of the
form.
2. For the “input filename” parameter, select gene expression data in GCT file format.
For example, select the preprocessed data file,
all aml train.
preprocessed.gct in the Recent Job list, locate the PreprocessDataset module and its all aml train.preprocessed.gct result file, click the icon next
to the result file, and, from the menu that appears, select the Send to input filename
command.
3. For the “cls filename” parameter, select a class descriptions file. This file should be
in CLS format (see Basic Protocol 2).
For example, use the Browse button to select the all aml train.cls file.
4. Review the remaining parameters to determine which values, if any, should be
modified (see Table 7.12.3).
For this example, use the default values.
Figure 7.12.5 ComparativeMarkerSelection parameters. Table 7.12.3 describes the ComparativeMarkerSelection parameters.
Current Protocols in Bioinformatics
Analyzing
Expression
Patterns
7.12.9
Supplement 22
Table 7.12.3 Parameters for the ComparativeMarkerSelection Analysis
Parameter
Description
input file
Gene expression data (GCT or RES file format)
cls file
Class file (CLS file format) that specifies the phenotype of each sample in the
expression data
confounding
variable cls
filename
Class file (CLS file format) that specifies a second class—the confounding
variable—for each sample in the expression data. Specify a confounding
variable class file to have permutations shuffle the phenotype labels only
within the subsets defined by that class file. For example, in Lu et al. (2005), to
select features that best distinguish tumors from normal samples on all tissue
types, tissue type is treated as the confounding variable. In this case, the CLS
file that defines the confounding variable lists each tissue type as a phenotype
and associates each sample with its tissue type. Consequently, when
ComparativeMarkerSelection performs permutations, it shuffles the
tumor/normal labels only among samples with the same tissue type.
test direction
Determine how to measure differential expression. By default,
ComparativeMarkerSelection performs a two-sided test: a differentially
expressed gene might be up-regulated for either class. Alternatively, have
ComparativeMarkerSelection perform a one-sided test: a differentially
expressed gene is up-regulated for class 0 or up-regulated for class 1. A
one-sided test is less reliable; therefore, if performing a one-sided test, also
perform the two-sided test and consider both sets of results.
test statistic
Statistic to use for computing differential expression.
t-test (the default) is the standardized mean difference in gene expression
between the two classes:
μ − μb
a
σ2
σa2
+ b
na
nb
where μ is the mean of the sample, σ 2 is the variance of the population, and n
is the number of samples.
Signal-to-noise ratio is the ratio of mean difference in gene expression and
standard deviation:
μa − μb
σa + σb
where μ is the mean of the sample and σ is the population standard deviation.
Either statistic can be modified by using median gene expression rather than
mean, enforcing a minimum standard deviation, or both.
Using
GenePattern for
Gene Expression
Analysis
min std
When the selected test statistic computes differential expression using a
minimum standard deviation, specify that minimum standard deviation.
number of
permutations
Number of permutations used to estimate the p-value, which indicates the
significance of the test statistic score for a gene. If the dataset includes at least
eight samples per phenotype, use the default value of 1000 permutations to
estimate a p-value accurate to four significant digits. If the dataset includes
fewer than eight samples in any class a permutation test should not be used.
complete
Whether to perform all possible permutations. By default, complete is set to
“no” and number of permutations determines the number of permutations
performed. Because of the statistical considerations surrounding permutation
tests on small numbers of samples, we recommend that only advanced users
select this option.
continued
7.12.10
Supplement 22
Current Protocols in Bioinformatics
Table 7.12.3 Parameters for the ComparativeMarkerSelection Analysis, continued
Parameter
Description
balanced
Whether to perform balanced permutations. By default, balanced is set to “no”
and phenotype labels are permuted without regard to the number of samples
per phenotype (e.g., if the dataset has twenty samples in class 0 and ten
samples in class 1, for each permutation the thirty labels are randomly assigned
to the thirty samples). Set balanced to “yes” to permute phenotype labels after
balancing the number of samples per phenotype (e.g., if the dataset has twenty
samples in class 0 and ten in class 1, for each permutation ten samples are
randomly selected from class 0 to balance the ten samples in class 1, and then
the twenty labels are randomly assigned to the twenty samples). Balancing
samples is important if samples are very unevenly distributed across classes.
random seed
The seed for the random number generator
smooth p
values
Whether to smooth p-values by using Laplace’s Rule of Succession. By
default, smooth p-values are set to “yes”, which means p-values are always
<1.0 and >0.0
phenotype test
Tests to perform when the class file (CLS file format) has more than two
classes: “one versus all” or “all pairs”. The p-values obtained from the
one-versus-all comparison are not fully corrected for multiple hypothesis
testing.
output filename Output filename
5. Click Run to start the analysis.
GenePattern displays a status page. When the analysis completes, the status page
lists the analysis result files: the .odf file (all aml train.preprocessed.
comp.marker.odf in this example) is a structured text file that contains the analysis
results; the gp task execution log.txt file lists the parameters used for the
analysis.
6. Click the Return to Modules & Pipelines Start link at the bottom of the status page
to return to the GenePattern start page.
The Recent Jobs list includes the ComparativeMarkerSelection module and its result files.
View analysis results using the ComparativeMarkerSelectionViewer
The analysis result file from ComparativeMarkerSelection includes the test statistic
score, p-value, FDR, and FWER statistics for each gene. The ComparativeMarkerSelectionViewer module accepts this output file and displays the results in an interactive,
graphical viewer to simplify review and interpretation of the data.
7. Start the ComparativeMarkerSelectionViewer by clicking the icon next
to the ComparativeMarkerSelection analysis result file (in this example,
all aml train.preprocessed.comp.marker.odf); from the menu that
appears, select ComparativeMarkerSelectionViewer.
GenePattern displays the parameters for the ComparativeMarkerSelectionViewer module.
Because the module was selected from the file menu, GenePattern automatically uses the
analysis result file as the value for the first input file parameter.
8. For the “dataset filename” parameter, select the gene expression data file used for
the ComparativeMarkerSelection analysis.
For this example, select all aml train.preprocessed.gct. In the Recent Job
list, locate the PreprocessDataset module and its analysis result files; click the icon next
to the all aml train.preprocessed.gct result file, and, from the menu that
appears, select the Send to dataset filename command.
Analyzing
Expression
Patterns
7.12.11
Current Protocols in Bioinformatics
Supplement 22
Figure 7.12.6
ComparativeMarkerSelection Viewer.
9. Click the Help link at the top of the form to display documentation for the ComparativeMarkerSelectionViewer.
10. Click Run to start the viewer.
GenePattern displays the ComparativeMarkerSelectionViewer (Fig. 7.12.6).
In the upper pane of the visualizer, the Upregulated Features graph plots the genes in the
dataset according to score—the value of the test statistic used to calculate differential
expression. Genes with a positive score are more highly expressed in the first class. Genes
with a negative score are more highly expressed in the second class. Genes with a score
close to zero are not significantly differentially expressed.
In the lower pane, a table lists the ComparativeMarkerSelection analysis results for each
gene including the name, description, test statistic score, p-value, and the FDR and FWER
statistics. The FDR controls the fraction of false positives that one can tolerate, while
the more conservative FWER controls the probability of having any false positives. As
discussed in Gould et al. (2006), the ComparativeMarkerSelection module computes the
FWER using three methods: the Bonferroni correction (the most conservative method),
the maxT method of Westfall and Young (1993), and the empirical FWER. It computes
the FDR using two methods: the BH procedure developed by Benjamini and Hochberg
(1995) and the less conservative q-value method of Storey and Tibshirani (2003).
Using
GenePattern for
Gene Expression
Analysis
Apply a filter to view the differentially expressed genes
Due to the noisy nature of microarray data and the large number of hypotheses tested, the
FWER often fails to identify any genes as significantly differentially expressed; therefore,
researchers generally identify marker genes based on the false discovery rate (FDR). For
this example, marker genes are identified based on an FDR cutoff value of 0.05. An FDR
value of 0.05 indicates that a gene identified as a marker gene has a 1 in 20 (5%) chance
of being a false positive.
7.12.12
Supplement 22
Current Protocols in Bioinformatics
In the ComparativeMarkerSelectionViewer, apply a filter with the criterion FDR <= 0.05
to view the marker genes. To further analyze those genes, create a new derived dataset
that contains only the marker genes.
11. Select Edit>Filter Features>Custom Filter, then the Filter Features dialog window
appears.
Specify a filter criterion by selecting a column from the drop-down list and entering the
allowed values for that column. To add a second filter criterion, click Add Filter. After
entering all of the criterion, click OK to apply the filter.
12. Enter the filter criterion FDR(BH) >= 0 <= 0.05 and click OK to apply the
filter.
This example identifies marker genes based on the FDR values computed using the more
conservative BH procedure developed by Benjamini and Hochberg (1995). When the filter
is applied, the ComparativeMarkerSelectionViewer updates the display to show only those
genes that have an FDR(BH) value ≤0.05. Notice that the Upregulated Features graph
now shows only genes identified as marker genes.
13. Review the filtered results.
In the ALL/AML leukemia dataset, >500 genes are identified as marker genes based on
the FDR cutoff value of 0.05. Depending on the question being addressed, it might be
helpful to explore only a subset of those genes. For example, one way to select a subset
would be to choose the most highly differentially expressed genes, as discussed below.
Create a derived dataset of the top 100 genes
By default, the ComparativeMarkerSelectionViewer sorts genes by differential expression based on the value of their test statistic scores. Genes in the first rows have the
highest scores and are more highly expressed in the first class, ALL; genes in the last
rows have the lowest scores and are more highly expressed in the second class, AML.
To create a derived dataset of the top 100 genes, select the first 50 genes (rows 1 through
50) and the last 50 genes (rows 536 through 585).
14. Select the top 50 genes: Shift-click a value in row 1 and Shift-click a value in row 50.
15. Select the bottom 50 genes: Ctrl-click a value in row 585 and Ctrl-Shift-click a value
in row 536.
On the Macintosh, use the Command (cloverleaf) key instead of Ctrl.
16. Select File>Save Derived Dataset.
The Save Derived Dataset window appears.
17. Select the Use Selected Features radio button.
Selecting Use Selected Features creates a dataset that contains only the selected genes.
Selecting the Use Current Features radio button would create a dataset that contains the
genes that meet the filter criteria. Selecting Use All Features would create a dataset that
contains all of the genes in the dataset; essentially a copy of the existing dataset.
18. Click the Browse button to select a directory and specify the name of the file to hold
the new dataset.
A Save dialog window appears. Navigate to the directory that will hold the new expression
dataset file, enter a name for the file, and click Save. The Save dialog window closes and
the name for the new dataset appears in the Save Derived Dataset window.
For this example, use the file name all aml train top100.gct. Note that the viewer
uses the file extension of the specified file name to determine the format of the new file.
Thus, to create a GCT file, the file name must include the .gct file extension.
19. Click Create to create the dataset file and close the Save Derived Dataset window.
Analyzing
Expression
Patterns
7.12.13
Current Protocols in Bioinformatics
Supplement 22
20. Select File>Exit to close the ComparativeMarkerSelectionViewer.
21. In the GenePattern Web Client, click Modules & Pipelines to return to the
GenePattern start page.
View the new dataset in the HeatMapViewer
Use the HeatMapViewer (Fig. 7.12.7) to create a heat map of the differentially expressed
genes. The heat map displays the highest expression values as red cells, the lowest
expression values as blue cells, and intermediate values in shades of pink and blue.
22. Start the HeatMapViewer by selecting it from the Modules & Pipelines list on the
GenePattern start page (it is in the Visualizer category).
GenePattern displays the parameters for the HeatMapViewer.
23. For the “input filename” parameter, use the Browse button to select the gene expression dataset file created in steps 16 through 19.
24. Click Run to open the HeatMapViewer.
In the HeatMapViewer, the columns are samples and the rows are genes. Each cell
represents the expression level of a gene in a sample. Visual inspection of the heat map
(Fig. 7.12.7) shows how well these top-ranked genes differentiate between the classes.
Using
GenePattern for
Gene Expression
Analysis
7.12.14
Supplement 22
Figure 7.12.7
Heat map for the top 100 differentially expressed genes.
Current Protocols in Bioinformatics
To save the heat map image for use in a publication, select File>Save Image. The
HeatMapViewer supports several image formats, including bmp, eps, jpeg, png, and tiff.
25. Select File>Exit to close the HeatMapViewer.
26. Click the Return to Modules & Pipelines start link at the bottom of the status page
to return to the GenePattern start page.
CLASS DISCOVERY: CLUSTERING METHODS
One of the challenges in analyzing microarray expression data is the sheer volume of
information: the expression levels of tens of thousands of genes for tens or hundreds
of samples. Class discovery aims to produce a high-level overview of data by creating
groups based on shared patterns. Clustering, one method of class discovery, reduces the
complexity of microarray data by grouping genes or samples based on their expression
profiles (Slonim, 2002). GenePattern provides several clustering methods (described in
Table 7.12.4).
BASIC
PROTOCOL 5
In this protocol, the HierarchicalClustering module is first used to cluster the
samples and genes in the ALL/AML training dataset. Then the HierarchicalClusteringViewer module is used to examine the results and identify two large clusters (groups) of samples, which correspond to the ALL and AML phenotypes.
Table 7.12.4 Clustering Methods
Module
Description
HierachicalClustering Hierarchical clustering recursively merges items with other items or
with the result of previous merges. Items are merged according to their
pair-wise distance with closest pairs being merged first. The result is a
tree structure, referred to as a dendrogram. To view clustering results,
use the HierarchicalClusteringViewer.
KMeansClustering
K-means clustering (MacQueen, 1967) groups elements into a specified
number (k) of clusters. A center data point for each cluster is randomly
selected and each data point is assigned to the nearest cluster center.
Each cluster center is then recalculated to be the mean value of its
members and all data points are re-assigned to the cluster with the
closest cluster center. This process is repeated until the distance between
consecutive cluster centers converges. The result is k stable clusters.
Each cluster is a subset of the original gene expression data (GCT file
format) and can be viewed using the HeatMapViewer.
SOMClustering
Self-organizing maps (SOM; Tamayo et al., 1999) creates and iteratively
adjusts a two-dimensional grid to reflect the global structure in the
expression dataset. The result is a set of clusters organized in a
two-dimensional grid where similar clusters lie near each other and
provide an “executive summary” of the dataset. To view clustering
results, use the SOMClusterViewer.
NMFConsensus
Non-negative matrix factorization (NMF; Brunet et al., 2004) is an
alternative method for class discovery that factors the expression data
matrix. NMF extracts features that may more accurately correspond to
biological processes.
ConsensusClustering Consensus clustering (Monti et al., 2003) is a means of determining an
optimal number of clusters. It runs a selected clustering algorithm and
assesses the stability of discovered clusters. The matrix is formatted as a
GCT file (with the content being the matrix rather than gene expression
data) and can be viewed using the HeatMapViewer.
Analyzing
Expression
Patterns
7.12.15
Current Protocols in Bioinformatics
Supplement 22
Necessary Resources
Hardware
Computer running MS Windows, Mac OS X, or Linux
Software
GenePattern software, which is freely available at http://www.genepattern.org/, or
browser to access GenePattern on line (the Support Protocol describes how to
start GenePattern)
Modules used in this protocol: HierarchicalClustering (version 3) and
HierarchicalClusteringViewer (version 8)
Files
The HierarchicalClustering module requires gene expression data in a
tab-delimited text file (GCT file format, Fig. 7.12.1) that contains a column for
each sample and a row for each gene. Basic Protocol 1 describes how to convert
various gene expression data into this file format.
As an example, this protocol uses the ALL/AML leukemia training dataset (Golub
et al., 1999) consisting of 38 bone marrow samples (27 ALL, 11 AML).
Table 7.12.5 Parameters for the HierarchicalClustering Analysis
Parameter
Setting
input filename
all aml train.
Gene expression data (GCT or RES file format)
preprocessed.gct
column distance
measure
Pearson Correlation
(the default)
Method for computing the distance (similarity measure) between values
when clustering samples. Pearson Correlation, the default, determines
similarity/dissimilarity between the shape of genes’ expression profiles. For
discussion of the different distance measures, see Wit and McClure (2004).
row distance
measure
Pearson Correlation
(the default)
Method for computing the distance (similarity measure) between values
when clustering genes.
clustering method Pairwise-complete
linkage (the default)
Method for measuring the distance between clusters. Pairwise-complete
linkage, the default, measures the distance between clusters as the
maximum of all pairwise distances. For a discussion of the different
clustering methods, see Wit and McClure (2004).
log transform
No (the default)
Transforms each expression value by taking the log base 2 of its value. If
the dataset contains absolute intensity values, using the log transform helps
to ensure that differences between expressions (fold change) have the same
meaning across the full range of expression values (Wit and McClure,
2004).
row center
Subtract the mean
of each row
Method for centering row data. When clustering genes, Getz et al. (2006)
recommend centering the data by subtracting the mean of each row.
row normalize
Yes
Whether to normalize row data. When clustering genes, Getz et al. (2006)
recommend normalizing the row data.
column center
Subtract the mean of
each column
Method for centering column data. When clustering samples, Getz et al.
(2006) recommend centering the data by subtracting the mean of each
column.
column normalize Yes
output base name
<input.filename
basename>
(the default)
Description
Whether to normalize column data. When clustering samples, Getz et al.
(2006) recommend normalizing the column data.
Output file name
7.12.16
Supplement 22
Current Protocols in Bioinformatics
Download the data file (all aml train.gct) from the GenePattern Web site
at http://genepattern.org/datasets/. This protocol assumes the expression data
file, all aml train.gct, has been preprocessed according to Basic Protocol
3. The preprocessed expression data file, all aml train.preprocessed.gct, is used in this protocol.
Run the HierarchicalClustering analysis
1. Start HierarchicalClustering by looking in the Recent Jobs list and locating the
PreprocessDataset module and its all aml train.preprocessed.gct result
file; click the icon next to the result file; and from the menu that appears, select
HierarchicalClustering.
GenePattern displays the parameters for the HierarchicalClustering analysis. Because
the module was selected from the file menu, GenePattern automatically uses the analysis
result file as the value for the “input filename” parameter. For information about the
module and its parameters, click the Help link at the top of the form.
Note that a module can be started from the Modules & Pipelines list, as shown in the
previous protocol, or from the Recent Jobs list, as shown in this protocol.
2. Use the remaining parameters to define the desired clustering analysis (see
Table 7.12.5).
Clustering genes groups genes with similar expression patterns, which may indicate coregulation or membership in a biological process. Clustering samples groups samples with
similar gene expression patterns, which may indicate a similar biological or phenotype
subtype among the clustered samples. Clustering both genes and samples may be useful
for identifying genes that are coexpressed in a phenotypic context or alternative sample
classifications.
For this example, use the parameter settings shown in Table 7.12.5 to cluster both genes
(rows) and samples (columns). Figure 7.12.8 shows the HierarchicalClustering parameters set to these values.
Figure 7.12.8 HierarchicalClustering parameters. Table 7.12.5 describes the HierarchicalClustering parameters.
Analyzing
Expression
Patterns
7.12.17
Current Protocols in Bioinformatics
Supplement 22
3. Click Run to start the analysis.
GenePattern displays a status page. When the analysis is complete (3 to 4 min), the status
page lists the analysis result files: the Clustered Data Table (.cdt) file contains the
original data ordered to reflect the clustering, the Array Tree Rows (.atr) file contains the
dendrogram for the clustered columns (samples), the Gene Tree Rows (.gtr) file contains
the dendrogram for the clustered rows (genes) and the gp task execution log.txt
file lists the parameters used for the analysis.
4. Click the Return to Modules & Pipelines Start link at the bottom of the status page
to return to the GenePattern start page.
The Recent Jobs list includes the HierachicalClustering module and its result files.
View analysis results using the HierarchicalClusteringViewer
The HierarchicalClusteringViewer provides an interactive, graphical viewer for displaying the analysis results. For a graphical summary of the results, save the content of the
viewer to an image file.
Using
GenePattern for
Gene Expression
Analysis
Figure 7.12.9
HierarchicalClustering Viewer.
7.12.18
Supplement 22
Current Protocols in Bioinformatics
5. Start the HierarchicalClusteringViewer by looking in the Recent Jobs list and
clicking the icon next to the HierarchicalClustering result file (all aml train.
preprocessed .atr, .cdt, or .gtr); and from the menu that appears, select
HierarchicalClusteringViewer.
GenePattern displays the parameters for the HierarchicalClusteringViewer. Because the
module was selected from the file menu, GenePattern automatically uses the analysis
result files as the values for the input file parameters.
6. Click Run to start the viewer.
GenePattern displays the HierarchicalClusteringViewer (Fig. 7.12.9). Visual inspection
of the dendrogram shows the hierarchical clustering of the AML and ALL samples.
7. Close the viewer and, in GenePattern, click the Return to Modules & Pipelines Start
link at the bottom of the status page to return to the GenePattern start page.
CLASS PREDICTION: CLASSIFICATION METHODS
This protocol focuses on the class prediction analysis of a microarray experiment, where
the aim is to build a class predictor—a subset of key marker genes whose transcription
profiles will correctly classify samples. A typical class prediction method “learns” how to
distinguish between members of different classes by “training” itself on samples whose
classes are already known. Using known data, the method creates a model (also known as
a classifier or class predictor), which can then be used to predict the class of a previously
unknown sample. GenePattern provides several class prediction methods (described in
Table 7.12.6).
BASIC
PROTOCOL 6
For most class prediction methods, GenePattern provides two approaches for training
and testing class predictors: train/test and cross-validation. Both approaches begin with
an expression dataset that has known classes. In the train/test approach, the predictor
is first trained on one dataset (the training set) and then tested on another independent
dataset (the test set). Cross-validation is often used for setting the parameters of a model
predictor or to evaluate a predictor when there is no independent test set. It repeatedly
leaves one sample out, builds the predictor using the remaining samples, and then tests
it on the sample left out. In the cross-validation approach, the accuracy of the predictor
is determined by averaging the results over all iterations. GenePattern provides pairs of
modules for most class prediction methods: one for train/test and one for cross-validation.
This protocol applies the k-nearest neighbors (KNN) class prediction method to the
ALL/AML data. First introduced by Fix and Hodges in 1951, KNN is one of the simplest
classification methods and is often recommended for a classification study when there
is little or no prior knowledge about the distribution of the data (Cover and Hart, 1967).
The KNN method stores the training instances and uses a distance function to determine
which k members of the training set are closest to an unknown test instance. Once the
k-nearest training instances have been found, their class assignments are used to predict
the class for the test instance by a majority vote.
GenePattern provides a pair of modules for the KNN class prediction method: one for the
train/test approach and one for the cross-validation approach. Both modules use the same
input parameters (Table 7.12.7). This protocol first uses the cross-validation approach
(KNNXValidation module) and a training dataset to determine the best parameter settings
for the KNN prediction method. It then uses the train/test KNN module with the best
parameters identified by the KNNXValidation module to build a classifier on the training
dataset and to test that classifier on a test dataset.
Analyzing
Expression
Patterns
7.12.19
Current Protocols in Bioinformatics
Supplement 22
Table 7.12.6 Class Prediction Methods
Prediction method
Algorithm
CART
CART (Breiman et al., 1984) builds classification and regression trees for
predicting continuous dependent variables (regression) and categorical
predictor variables (classification). It works by recursively splitting the
feature space into a set of non-overlapping regions and then predicting the
most likely value of the dependent variable within each region. A
classification tree represents a set of nested logical if-then conditions on
the values of the features variables that allows for the prediction of the
value of the dependent categorical variable based on the observed values of
the feature variables. A regression tree is similar but allows for the
prediction of the value of a continuous dependent variable instead.
KNN
k-nearest-neighbors (KNN) classifies an unknown sample by assigning it
the phenotype label most frequently represented among the k nearest
known samples (Cover and Hart, 1967). In GenePattern, the user selects a
weighting factor for the “votes” of the nearest neighbors (unweighted: all
votes are equal; weighted by the reciprocal of the rank of the neighbor’s
distance: the closest neighbor is given weight 1/1, next closest neighbor is
given weight 1/2, and so on; or weighted by the reciprocal of the distance).
PNN
Probabilistic Neural Network (PNN) calculates the probability that an
unknown sample belongs to a given set of known phenotype classes
(Specht, 1990; Lu et al., 2005). The contribution of each known sample to
the phenotype class of the unknown sample follows a Gaussian
distribution. PNN can be viewed as a Gaussian-weighted KNN
classifier—known samples close to the unknown sample have a greater
influence on the predicted class of the unknown sample.
SVM
Support Vector Machines (SVM) is designed for multiple class
classification (Vapnik,1998). The algorithm creates a binary SVM
classifier for each class by computing a maximal margin hyperplane that
separates the given class from all other classes; that is, the hyperplane with
maximal distance to the nearest data point. The binary classifiers are then
combined into a multiclass classifier. For an unknown sample, the assigned
class is the one with the largest margin.
Weighted Voting
Weighted Voting (Slonim et al., 2000) classifies an unknown sample using
a simple weighted voting scheme. Each gene in the classifier “votes” for
the phenotype class of the unknown sample. A gene’s vote is weighted by
how closely its expression correlates with the differentiation between
phenotype classes in the training dataset.
Basic Protocol 3 describes how to preprocess the training dataset to remove platform
noise and genes that have little variation. Preprocessing the test dataset may result in a
test dataset that contains a different set of genes than the training dataset. Therefore, do
not preprocess the test dataset.
Necessary Resources
Hardware
Computer running MS Windows, Mac OS X, or Linux
Software
Using
GenePattern for
Gene Expression
Analysis
GenePattern software, which is freely available at http://www.genepattern.org/, or
browser to access GenePattern on line (the Support Protocol describes how to
start GenePattern)
7.12.20
Supplement 22
Current Protocols in Bioinformatics
Table 7.12.7 Parameters for k-Nearest Neighbors Prediction Modules
Parameter
Description
num features
Number of features (genes or probes) to use in the classifier.
For KNN, choose the number of features or use the Feature List Filename
parameter to specify which features to use. For KNNXValidation, the
algorithm chooses the feature list for each leave-one-out cycle.
feature selection
statistic
Statistic to use for computing differential expression. The genes most
differentially expressed between the classes will be used in the classifier to
predict the phenotype of unknown samples. For a description of the
statistics, see the test statistic parameter in Table 7.12.3.
min std
When the selected feature selection statistic computes differential
expression using a minimum standard deviation, specify that minimum
standard deviation
num neighbors
Number (k) of neighbors to consult when consulting the k-nearest neighbors
weighting type
Weight to give the “votes” of the k neighbors.
None: gives each vote the same weight.
One-over-k: weighs each vote by reciprocal of the rank of the neighbor’s
distance; that is, the closest neighbor is given weight 1/1, the next closest
neighbor is given weight 1/2, and so on.
Distance: weighs each vote by the reciprocal of the neighbor’s distance.
distance measure
Method for computing the distance (dissimilarity measure) between
neighbors (Wit and McClure, 2004)
Modules used in this protocol: KNNXValidation (version 5),
PredictionResultsViewer (version 4), FeatureSummaryViewer (version 3), and
KNN (version 3)
Files
Class prediction requires two files as input: one for gene expression data and
another that specifies the class of each sample. The classes usually represent
phenotypes, such as tumor or normal. The expression data file is a tab-delimited
text file (GCT file format, Fig. 7.12.1 that contains a column for each sample
and a row for each gene. Classes are defined in another tab-delimited text file
(CLS file format, Fig. 7.12.2). Basic Protocols 1 and 2 describe how to convert
various gene expression data into these file formats.
As an example, this protocol uses two ALL/AML leukemia datasets (Golub et al.,
1999): a training set consisting of 38 bone marrow samples
(all aml train.gct, all aml train.cls) and a test set consisting of 35
bone marrow and peripheral blood samples (all aml test.gct,
all aml test.cls). Download the data files from the GenePattern Web site
at http://genepattern.org/datasets/. This protocol assumes the training set
all aml train.gct has been preprocessed according to Basic Protocol 3.
The preprocessed expression data file, all aml train.preprocessed.gct, is used in this protocol.
Run the KNNXValidation analysis
The KNNXValidation module builds and tests multiple classifiers, one for each iteration
of the leave-one-out, train, and test cycle. The module generates two result files. The
feature result file (*.feat.odf) lists all genes used in any classifier and the number of
times that gene was used in a classifier. The prediction result file (*.pred.odf) averages
the accuracy of and error rates for all classifiers. Use the FeatureSummaryViewer module
to display the feature result file and the PredictionResultsViewer to display the prediction
result file.
Analyzing
Expression
Patterns
7.12.21
Current Protocols in Bioinformatics
Supplement 22
Figure 7.12.10 KNNXValidation parameters. Table 7.12.7 describes the parameters for the
k-nearest neighbors (KNN) class prediction method.
1. Start KNNXValidation by selecting it from the Modules & Pipelines list on the
GenePattern start page (it is in the Prediction category).
GenePattern displays the parameters for the KNNXValidation analysis (Fig. 7.12.10). For
information about the module and its parameters, click the Help link at the top of the
form.
2. For the “data filename” parameter, select gene expression data in the GCT file format.
For example, select the preprocessed data file, all aml train.preprocessed.
gct: in the Recent Job lists, locate the PreprocessDataset module and its all aml
train.preprocessed.gct result file; click the icon next to the result file; and from
the menu that appears, select the Send to data filename command.
3. For the “class filename” parameter, select the class data (CLS file format) file.
For this example, use the Browse button to select the all aml train.cls file.
4. Review the remaining parameters to determine which values, if any, should be
modified (see Table 7.12.7).
For this example, use the default values.
5. Click Run to start the analysis.
GenePattern displays a status page. When the analysis is complete, the status page lists
the analysis result files: the feature result file (*.feat.odf) lists the genes used in the
classifiers and the prediction result file (*.pred.odf) averages the accuracy of and
error rates for all of the classifiers. Both result files are structured text files.
Using
GenePattern for
Gene Expression
Analysis
View KNNXValidation analysis results
GenePattern provides interactive, graphical viewers to simplify, review, and interpret the
result files. To view the prediction results (*.pred.odf file), use the PredictionResultsViewer. To view the feature result file (*.feat.odf file), use the FeatureSummaryViewer.
7.12.22
Supplement 22
Current Protocols in Bioinformatics
6. Start the PredictionResultsViewer by looking in the Recent Jobs list, then clicking
the icon next to the prediction result file, all aml train.preprocessed.
pred.odf; and from the menu that appears, select PredictionResultsViewer.
GenePattern displays the parameters for the PredictionResultsViewer. Because the module
was selected from the file menu, GenePattern automatically uses the analysis result file
as the value for the input file parameter.
7. Click Run to start the viewer.
GenePattern displays the PredictionResultsViewer (Fig. 7.12.11). In this example, all
samples in the dataset were correctly classified.
8. Close the viewer and, in GenePattern, click the Return to Modules & Pipelines Start
link at the bottom of the status page to return to the GenePattern start page.
Figure 7.12.11 PredictionResults Viewer. Each point represents a sample, with color indicating
the predicted class. Absolute confidence value indicates the probability that the sample belongs
to the predicted class.
Analyzing
Expression
Patterns
7.12.23
Current Protocols in Bioinformatics
Supplement 22
9. Start the FeatureSummaryViewer by looking in the Recent Jobs list, and then clicking the icon next to the feature result file, all aml train.preprocessed.
feat.odf; from the menu that appears, select FeatureSummaryViewer.
GenePattern displays the parameters for the FeatureSummaryViewer. Because the module
was selected from the file menu, GenePattern automatically uses the analysis result file
as the value for the input file parameter.
10. Click Run to start the viewer.
GenePattern displays the FeatureSummaryViewer (Fig. 7.12.12). The viewer lists each
gene used in any classifier created by any iteration and shows how many of the classifiers
included this gene. Generally, the most interesting genes are those used by all (or most)
of the classifiers.
11. Close the viewer and, in GenePattern, click the Return to Modules & Pipelines Start
link at the bottom of the status page to return to the GenePattern start page.
Using
GenePattern for
Gene Expression
Analysis
Figure 7.12.12
FeatureSummary Viewer.
7.12.24
Supplement 22
Current Protocols in Bioinformatics
In this example, the default parameter values for the k-nearest neighbors (KNN) class
prediction method create class predictors that successfully predict the class of unknown
samples. However, in practice, the researcher runs the KNNXValidation module several
times with different parameter values (e.g., using the “num features” parameter values
of 10, 20, and 30) to find the most effective parameter values for the KNN method.
Run the KNN analysis
After using the cross-validation approach (KNNXValidation module) to determine which
parameter settings provide the best results, use the KNN module with those parameters
to build a model using the training dataset and test it using an independent test dataset.
The KNN module generates two result files: the model file (*.model.odf) describes
the predictor and the prediction result file (*.pred.odf) shows the accuracy of and
error rate for the predictor. Use a text editor to display the model file and the PredictionResultsViewer to display the prediction result file.
12. Start KNN by selecting it from the Modules & Pipelines list on the GenePattern start
page (it is in the Prediction category).
GenePattern displays the parameters for the KNN analysis (Fig. 7.12.13). For information
about the module and its parameters, click the help link at the top of the form.
13. For the “train filename” and “test filename” parameters, select gene expression data
in the GCT file format.
For this example, select all aml train.preprocessed.gct as the input file for
the “train filename” parameter. In the Recent Job list, locate the PreprocessDataset
module and its all aml train.preprocessed.gct result file; click the icon next
to the result file; and from the menu that appears, select the Send to train filename
command.
Next, use the browse button to select all aml test.gct as the input file for the “test
filename” parameter.
Figure 7.12.13 KNN parameters. Table 7.12.7 describes the parameters for the k-nearest neighbors (KNN) class prediction method.
Analyzing
Expression
Patterns
7.12.25
Current Protocols in Bioinformatics
Supplement 22
14. For the “train class filename” and “test class filename” parameters, select the class
data (CLS file format) for each expression data file.
For this example, use the Browse button to select all aml train.cls as the input file
for the “train class filename” parameter. Similarly, select all aml test.cls as the
input file for the “test class filename” parameter.
15. Review the remaining parameters to determine which values, if any, should be
modified (see Table 7.12.7).
For this example, use the default values.
16. Click Run to start the analysis.
GenePattern displays a status page. When the analysis is complete, the status page lists
the analysis result files: the model file (*.model.odf) contains the classifier (or model)
created from the training dataset and the prediction result file (*.pred.odf) shows the
accuracy of and error rate for the classifier when it was run against the test data. Both
result files are structured text files.
17. Click the Return to Modules & Pipelines Start link at the bottom of the status page
to return to the GenePattern start page.
The Recent Jobs list includes the KNN module and its result files.
View KNN analysis results
GenePattern provides interactive, graphical viewers to simplify review and interpretation
of the result files. To view the prediction results (*.pred.odf file), use the PredictionResultsViewer. To view the model file (*.model.odf), simply use a text editor.
18. Display the model file (all aml train.preprocessed.model.odf): in the
Recent Jobs list, click the model file.
GenePattern displays the model file in the browser. The classifier uses the genes in this
model to predict the class of unknown samples. Retrieving annotations for these genes
might provide insight into the underlying biology of the phenotype classes.
19. Click the Back button in the Web browser to return to the GenePattern start page.
20. Start the PredictionResultsViewer by looking in the Recent Jobs list and
then clicking the icon next to the prediction result file, all aml test.
pred.odf; and from the menu that appears, select PredictionResultsViewer.
GenePattern displays the parameters for the PredictionResultsViewer. Because the module
was selected from the file menu, GenePattern automatically uses the analysis result file
as the value for the input file parameter.
21. Click Run to start the viewer.
GenePattern displays the PredictionResultsViewer (similar to the one shown in
Fig. 7.12.11). The classifier created by the KNN algorithm correctly predicts the class
of 32 of the 35 samples in the test dataset. The classifier created by the Weighted Voting
algorithm (Golub et al., 1999) correctly predicted the class of all samples in the test
dataset. The error rate (number of cases correctly classified divided by the total number
of cases) is useful for comparing results when experimenting with different prediction
methods.
22. Close the viewer and, in GenePattern, click the Return to Modules & Pipelines Start
link at the bottom of the status page to return to the GenePattern start page.
Using
GenePattern for
Gene Expression
Analysis
7.12.26
Supplement 22
Current Protocols in Bioinformatics
PIPELINES: REPRODUCIBLE ANALYSIS METHODS
Gene expression analysis is an iterative process. The researcher runs multiple analysis
methods to explore the underlying biology of the gene expression data. Often, there is
a need to repeat an analysis several times with different parameters to gain a deeper
understanding of the analysis and the results. Without careful attention to detail, analyses
and their results can be difficult to reproduce. Consequently, it becomes difficult to share
the analysis methodology and its results.
BASIC
PROTOCOL 7
GenePattern records every analysis it runs, including the input files and parameter values
that were used and the output files that were generated. This ensures that analysis results
are always reproducible. GenePattern also makes it possible for the user to click on an
analysis result file to build a pipeline that contains the modules and parameter settings
used to generate the file. Running the pipeline reproduces the analysis result file. In
addition, one can easily modify the pipeline to run variations of the analysis protocol,
share the pipeline with colleagues, or use the pipeline to describe an analysis methodology
in a publication.
This protocol describes how to create a pipeline from an analysis result file, edit the
pipeline, and run it. As an example, a pipeline is created based on the class prediction
results from Basic Protocol 6.
Necessary Resources
Hardware
Computer running MS Windows, Mac OS X, or Linux
Software
GenePattern software, which is freely available at http://www.genepattern.org/, or
browser to access GenePattern on line (the Support Protocol describes how to
start GenePattern)
Modules used in this protocol: PreprocessDataset (version 3), KNN (version 3),
and PredictionResultsViewer (version 4)
Files
Input files for a pipeline depend on the modules called; for example, the input file
for the PreprocessDataset module is a gene expression data file
Create a pipeline from a result file
Creating a pipeline from a result file captures the analysis strategy used to generate
the analysis results. To create the pipeline, GenePattern records the modules used to
generate the result file, including their input files and parameter values. Tracking the
chain of modules back to the initial input files, GenePattern builds a pipeline that records
the sequence of events used to generate the result file. For this example, create a pipeline
from the prediction result file, all aml test.pred.odf, generated by the KNN
module in Basic Protocol 6.
1. Create the pipeline by looking in the Recent Jobs list, locating the KNN module and
its all aml test.pred.odf result file and then clicking the icon next to the
result file; from the menu that appears, select Create Pipeline.
GenePattern creates the pipeline that reproduces the result file and displays it in
a form-based editor (Fig. 7.12.14). The pipeline includes the KNN analysis, its input files, and parameter settings. The input file for the “train filename” parameter,
all aml train.preprocessed.gct, is a result file from a previous PreprocessDataset analysis; therefore, the pipeline includes a PreprocessDataset analysis to generate the all aml train.preprocessed.gct file.
Analyzing
Expression
Patterns
7.12.27
Current Protocols in Bioinformatics
Supplement 22
Figure 7.12.14 Create Pipeline for KNN classification analysis. The Pipeline Designer form
defines the steps that will replicate the KNN classification analysis. Click the arrow icon next to a
step to collapse or expand that step. When the form opens, all steps are expanded. This figure
shows the first step collapsed.
2. Scroll to the top of the form and edit the pipeline name.
Because the pipeline was created from an analysis result file, the default name of the
pipeline is the job number of that analysis. Change the pipeline name to make it easier to
find. For this example, change the pipeline name to KNNClassificationPipeline. (Pipeline
names cannot include spaces or special characters.)
Add the PredictionResultsViewer to the pipeline
The PredictionResultsViewer module displays the KNN prediction results. Use the
following steps to add this visualization module to the pipeline.
3. Scroll to the bottom of the form.
4. In the last step of the pipeline, click the Add Another Module button.
5. From the Category drop-down list, select Visualizer.
6. From the Modules list, select PredictionResultsViewer.
7. Rather than selecting a prediction result filename, use the prediction result file generated by the KNN analysis. Notice that GenePattern has selected this automatically:
next to Use Output From, GenePattern has selected 2. KNN and Prediction
Results.
8. Click Save to save the pipeline.
GenePattern displays a status page confirming pipeline creation.
Using
GenePattern for
Gene Expression
Analysis
9. Click the Continue to Modules & Pipelines Start link at the bottom of the status page
to return to the GenePattern start page.
The pipeline appears in the Modules & Pipelines list in the Pipeline category.
7.12.28
Supplement 22
Current Protocols in Bioinformatics
Run the pipeline
GenePattern automatically selects the new pipeline as the next module to be run.
10. Click Run to run the pipeline.
GenePattern runs each module in the pipeline, preprocessing the all aml train.gct
file, running the KNN class prediction analysis, and then displaying the prediction results.
11. Close the viewer and, in GenePattern, click the Return to Modules & Pipelines Start
link at the bottom of the status page to return to the GenePattern start page.
USING THE GenePattern DESKTOP CLIENT
GenePattern provides two point-and-click graphical user interfaces (clients) to access the
GenePattern server: the Web Client and the Desktop Client. The Web Client is automatically installed with the GenePattern server, the Desktop Client is installed separately.
Most GenePattern features are available from both clients; however, only the Desktop
Client provides access to the following ease-of-use features: adding project directories
for easy access to dataset files, running an analysis on every file in a directory by specifying that directory as an input parameter, and filtering the lists of modules and pipelines
displayed in the interface.
ALTERNATE
PROTOCOL 1
This protocol introduces the Desktop Client by running the PreprocessDataset and
HeatMapViewer modules. The aim is not to discuss the analyses, but simply to demonstrate the Desktop Client interface.
Necessary Resources
Hardware
Computer running MS Windows, Mac OS X, or Linux
Software
GenePattern software, which is freely available at http://www.genepattern.org.
Installing the Desktop Client is optional. If it is not installed with the
GenePattern software, the Desktop Client can be installed at any time from the
GenePattern Web Client. To install the Desktop Client from the Web Client,
click Downloads>Install Desktop Client and follow the on-screen instructions.
Modules used in this protocol: PreprocessDataset (version 3) and HeatMapViewer
(version 8)
Files
The PreprocessDataset module requires gene expression data in a tab-delimited
text file (GCT file format, Fig. 7.12.1) that contains a column for each sample
and a row for each gene. Basic Protocol 1 describes how to convert various gene
expression data into this file format.
As an example, this protocol uses an ALL/AML leukemia dataset (Golub et al.,
1999) consisting of 38 bone marrow samples (27 ALL, 11 AML). Download the
data file (all aml train.gct) from the GenePattern Web site at
http://genepattern.org/datasets/.
Start the GenePattern server
The GenePattern server must be started before the Desktop Client. Use the following steps
to start a local GenePattern server. Alternatively, use the public GenePattern server hosted
at http://genepattern.broad.mit.edu/gp/. For more information, refer to the GenePattern
Tutorial (http://www.genepattern.org/tutorial/gp tutorial.html) or GenePattern Desktop
Client Guide (http://www.genepattern.org/tutorial/gp java client.html).
Analyzing
Expression
Patterns
7.12.29
Current Protocols in Bioinformatics
Supplement 22
1. Double-click the Start GenePattern Server icon (GenePattern installation places icon
on the desktop).
On Windows, while the server is starting, the cursor displays an hourglass. On Mac OS
X, while the server is starting, the server icon bounces in the Dock.
Start the Desktop Client
2. Double-click the GenePattern Desktop Client icon (GenePattern installation places
icon on the desktop).
The Desktop Client connects to the GenePattern server, retrieves the list of available
modules, builds its menus, and displays a welcome message.
The Projects pane provides access to selected project directories (directories that hold
the genomic data to be analyzed). The Results pane lists analysis jobs run by the current
GenePattern user.
Open a project directory
3. To open a project directory, select File>Open Project Directory.
GenePattern displays the Choose a Project Directory window.
4. Navigate to the directory that contains the data files and click Select Directory.
For example, select the directory that contains the example data file, all aml train.
gct. GenePattern adds the directory to the Projects pane.
5. In the Projects pane, double-click the directory name to display the files in the
directory.
Run an analysis
6. To start an analysis, select it from the Analysis menu.
For example, select Analysis>Preprocess & Utilities>PreprocessDataset. GenePattern
displays the parameters for the PreprocessDataset module.
7. For the “input filename” parameter, select gene expression data in the GCT file
format.
For example, drag-and-drop the all aml train.gct file from the Project pane to the
“input filename” parameter box.
8. Review the remaining parameters to determine which values, if any, should be
modified (see Table 7.12.2).
For this example, use the default values.
9. Click Run to start the analysis.
GenePattern displays the analysis in the Results pane with a status of Processing. When
the analysis is complete, the output files are added to the Results pane and a dialog box
appears showing the completed job. Close the dialogue box. In the Results pane, doubleclick the name of the analysis to display the result files. This example generates two result
files: all aml train.preprocessed.gct, which is the new, preprocessed gene
expression data file, and gp task execution log.txt, which lists the parameters
used for the analysis.
Using
GenePattern for
Gene Expression
Analysis
Run an analysis from a result file
Research is an iterative process and the input file for an analysis is often the output file of a previous analysis. GenePattern makes this easy. As an example, the following steps use the gene expression file created by the PreprocessDataset module
(all aml train.preprocessed.gct) as the input file for the HeatMapViewer
module, which displays the expression data graphically.
7.12.30
Supplement 22
Current Protocols in Bioinformatics
10. To start the analysis, in the Results pane, right-click the result file and, from the
menu that appears, select the Modules submenu and then the name of the module to
run.
For example, in the Results pane, right-click the result file from the PreprocessDataset
analysis, all aml train.comp.marker.odf. From the menu that appears, select
Modules>HeatMapViewer.
GenePattern displays the parameters for the HeatMapViewer. Because the module was
selected from the file menu, GenePattern automatically uses the analysis result file as the
value of the first input filename parameter.
11. Click Run to start the viewer.
The first time a viewer runs on the desktop, a security warning message may appear. Click
Run to continue.
GenePattern opens the HeatMapViewer.
12. Close the HeatMapViewer by selecting File>Exit.
Notice that the HeatMapViewer does not appear in the Results pane. The Results pane
lists the analyses run on the GenePattern server. Visualizers, unlike analysis modules, run
on the client rather than the server; therefore, they do not appear in the Results pane.
USING THE GenePattern PROGRAMMING ENVIRONMENT
GenePattern libraries for the Java, MATLAB, and R programming environments allow applications to run GenePattern modules and retrieve analysis results. Each library
supports arbitrary scripting and access to GenePattern modules via function calls, as
well as development of new methodologies that combine modules in arbitrarily complex combinations. Download the libraries from the GenePattern Web Client by clicking
Downloads>Programming Libraries.
ALTERNATE
PROTOCOL 2
For more information about accessing GenePattern from a programming environment,
see the GenePattern Programmer’s Guide at http://www.genepattern.org/tutorial/gp
programmer.html.
SETTING USER PREFERENCES FOR THE GenePattern WEB CLIENT
GenePattern provides two point-and-click graphical user interfaces (clients) to access the
GenePattern server: the Web Client and the Desktop Client. The Web Client is automatically installed with the GenePattern server. Most GenePattern features are available from
both clients; however, only the Web Client provides access to GenePattern administrative features, such as configuring the GenePattern server and installing modules from the
GenePattern repository.
SUPPORT
PROTOCOL
Necessary Resources
Hardware
Computer running MS Windows, Mac OS X, or Linux
Software
GenePattern software, which is freely available at http://www.genepattern.org/, or
browser to access GenePattern on line
Files
Input files for the Web Client depend on the module called
Analyzing
Expression
Patterns
7.12.31
Current Protocols in Bioinformatics
Supplement 22
Table 7.12.8 GenePattern Account Settings
Setting
Description
Change Email
Change the e-mail address for your GenePattern account on this server
Change Password
Change the password for your GenePattern account on this server; by
default, GenePattern servers are installed without password protection
History
Specify the number of recent analyses listed in the Recent Jobs pane on
the Web Client start page
Visualizer Memory
Specify the Java virtual machine configuration parameters (such as VM
memory settings) to be used when running visualization modules; by
default, this option is used to specify the amount of memory to allocate
when running visualization modules (-Xmx512M)
Start the GenePattern server
The GenePattern server must be started before the Web Client. Use the following steps to
start a local GenePattern server. Alternatively, use the public GenePattern server hosted at
http://genepattern.broad.mit.edu/gp/. For more information, refer to the GenePattern Tutorial (http://www.genepattern.org/tutorial/gp tutorial.html) or GenePattern Web Client
Guide (http://www.genepattern.org/tutorial/gp web client.html).
1. Double-click the Start GenePattern Server icon (GenePattern installation places icon
on the desktop).
On Windows, while the server is starting, the cursor displays an hourglass. On Mac OS
X, while the server is starting, the server icon bounces in the Dock.
Start the Web Client
2. Double-click the GenePattern Web Client icon (GenePattern installation places icon
on the desktop).
GenePattern displays the Web Client start page (Fig. 7.12.3). Modules & Pipelines, at
the left of the start page, lists all available analyses. By default, analyses are organized
by category. Use the radio buttons at the top of the Modules & Pipelines list to organize
analyses by suite or list them alphabetically. A suite is a user-defined collection of pipelines
and/or modules. Suites can be used to organize pipelines and modules in GenePattern in
much the same way “play lists” can be used to organize an online music collection.
Recent Jobs, at the right of the start page, lists analysis jobs recently run by the current
GenePattern user.
Set personal preferences
3. Click My Settings (top right corner) to display your GenePattern account settings.
Table 7.12.8 lists the available settings.
4. Click History to modify the number of jobs displayed in the Recent Jobs list.
The Recent Jobs list provides easy access to analysis result files. Increasing the number
of jobs simplifies access to the files used in the basic protocols.
5. Increase the value (e.g., enter 10) and click Save.
6. Click the GenePattern icon in the title bar to return to the start page.
GUIDELINES FOR UNDERSTANDING RESULTS
Using
GenePattern for
Gene Expression
Analysis
This unit describes how to use GenePattern to analyze the results of a transcription
profiling experiment done with DNA microarrays. Typically, such results are represented
as a gene-by-sample table, with a measurement of intensity for each gene element on
7.12.32
Supplement 22
Current Protocols in Bioinformatics
the array for each biological sample assayed in the microarray experiment. Analysis of
microarray data relies on the fundamental assumption that “the measured intensities for
each arrayed gene represent its relative expression level” (Quackenbush, 2002).
Depending on the specific objectives of a microarray experiment, analysis can include
some or all of the following steps: data preprocessing and normalization, differential
expression analysis, class discovery, and class prediction.
Preprocessing and normalization form the first critical step of microarray data analysis.
Their purpose is to eliminate missing and low-quality measurements and to adjust the
intensities to facilitate comparisons.
Differential expression analysis is the next standard step and refers to the process of
identifying marker genes—genes that are expressed differently between distinct classes
of samples. GenePattern identifies marker genes using the following procedure. For
each gene, it first calculates a test statistic to measure the difference in gene expression
between two classes of samples, and then estimates the significance (p-value) of this
statistic. With thousands of genes assayed in a typical microarray experiment, the standard
confidence intervals can lead to a substantial number of false positives. This is referred
to as the multiple hypothesis testing problem and is addressed by adjusting the p-values
accordingly. GenePattern provides several methods for such adjustments as discussed in
Basic Protocol 4.
The objective of class discovery is to reduce the complexity of microarray data by grouping genes or samples based on similarity of their expression profiles. The general assumptions are that genes with similar expression profiles correspond to a common biological
process and that samples with similar expression profiles suggest a similar cellular state.
For class discovery, GenePattern provides a variety of clustering methods (Table 7.12.4),
as well as principal component analysis (PCA). The method of choice depends on the data,
personal preference, and the specific question being addressed (D’haeseleer, 2005). Typically, researchers use a variety of class discovery techniques and then compare the results.
The aim of class prediction is to determine membership of unlabeled samples in
known classes based on their expression profiles. The assumption is that the expression
profile of a reasonable number of differentially expressed marker genes represents
a molecular “signature” that captures the essential features of a particular class or
phenotype. As discussed in Golub et al. (1999), such a signature could form the basis
of a valuable diagnostic or prognostic tool in a clinical setting. For gene expression
analysis, determining whether such a gene expression signature exists can help refine
or validate putative classes defined during class discovery. In addition, a deeper
understanding of the genes included in the signature may provide new insights into the
biology of the phenotype classes. GenePattern provides several class prediction methods
(Table 7.12.6). As with class discovery, it is generally a good idea to try several different
class prediction methods and to compare the results.
COMMENTARY
Background Information
Analysis of microarray data is an iterative process that starts with data preprocessing
and then cycles between computational analysis, hypothesis generation, and further analysis to validate and/or refine hypotheses. The
GenePattern software package and its repository of analysis and visualization modules support this iterative workflow.
Two graphical user interfaces, the Web
Client and the Desktop Client, and a programming environment provide users at any
level of computational skill easy access to
the diverse collection of analysis and visualization methods in the GenePattern module repository. By packaging methods as individual modules, GenePattern facilitates the
rapid integration of new techniques and the
Analyzing
Expression
Patterns
7.12.33
Current Protocols in Bioinformatics
Supplement 22
growth of the module repository. In addition,
researchers can easily integrate external tools
into GenePattern by using a simple form-based
interface to create modules from any computational tool that can be run from the command line. Modules are easily combined into
workflows by creating GenePattern pipelines
through a form-based interface or automatically from a result file. Using pipelines, researchers can reproduce and share analysis
strategies.
By providing a simple user interface and a
diverse collection of computational methods,
GenePattern encourages researchers to run
multiple analyses, compare results, generate
hypotheses, and validate/revise those hypotheses in a naturally iterative process. Running
multiple analyses often provides a richer understanding of the data; however, without careful attention to detail, critical results can be
difficult to reproduce or to share with colleagues. To address this issue, GenePattern
provides extensive support for reproducible research. It preserves each version of each module and pipeline; records each analysis that is
run, including its input files and parameter values; provides a method of building a pipeline
from an analysis result file, which captures the
steps required to generate that file; and allows
pipelines to be exported to files and shared
with colleagues.
Critical Parameters
Using
GenePattern for
Gene Expression
Analysis
Gene Expression data files
GenePattern accepts expression data in tabdelimited text files (GCT file format) that
contain a column for each sample, a row for
each gene, and an expression measurement
for each gene in each sample. As discussed
in Basic Protocol 1, how the expression data is
acquired determines the best way to translate
it into the GCT file format. GenePattern provides modules to convert expression data from
Affymetrix CEL files, convert MAGE-ML format data, and to extract data from the GEO or
caArray microarray expression data repositories. Expression data stored in other formats
can be converted into a tab-delimited text file
that contains expression measurements with
genes as rows and samples as columns and
formatted to comply with the GCT file format.
When working with cDNA microarray
data, do not blindly accept the default values
provided for the GenePattern modules. Most
default values are optimized for Affymetrix
data. Many GenePattern analysis modules do
not allow missing values, which are common
in cDNA two-color ratio data. One way to address this issue is to remove the genes with
missing values. An alternative approach is to
use the ImputeMissingValues.KNN module to
impute missing values by assigning gene expression values based on the nearest neighbors
of the gene.
Class files
A class file is a tab-delimited text file (the
CLS format) that provides class information
for each sample. Typically, classes represent
phenotypes, such as tumor or normal. Basic
Protocol 2 describes how to create class files.
Microarray experiments often include technical replicates. Analyze the replicates as separate samples or remove them by averaging or
other data reduction technique. For example,
if an experiment includes five tumor samples
and five control samples each run three times
(three replicate columns) for a total of 30 data
columns, one might combine the three replicate columns for each sample (by averaging or
some other data reduction technique) to create
a dataset containing 10 data columns (five tumor and five control).
Analysis methods
Table 7.12.9 lists the GenePattern modules
as of this writing; new modules are continuously released. For a current list of modules and their documentation, see the Modules page on the GenePattern Web site at
http://www.genepattern.org. Categories group
the modules by function and are a convenient
way of finding or reviewing available modules.
To ensure reproducibility of analysis results, each module is given a version number.
When modules are updated, both the old and
new versions are in the module repository. If
a protocol in this unit does not work as documented, compare the version number in the
protocol with the version number installed on
the GenePattern server used to execute the protocol. If the server has a different version of
a module, click Modules & Pipelines>Install
from Repository to install the desired version
of the module from the module repository.
Analysis result files
GenePattern is a client-server application.
All modules are stored on the GenePattern
server. A user interacts with the server through
the GenePattern Web Client, Desktop Client,
or a programming environment. When the
user runs an analysis module, the GenePattern
client sends a message to the server, which runs
7.12.34
Supplement 22
Current Protocols in Bioinformatics
Table 7.12.9 GenePattern Modulesa
Module
Description
Annotation
GeneCruiser
Retrieve gene annotations for Affy probe IDs
Clustering
ConsensusClustering
Resampling-based clustering method
HierarchicalClustering
Hierarchical clustering
KMeansClustering
k-means clustering
NMFConsensus
Non-negative matrix factorization (NMF) consensus clustering
SOMClustering
Self-organizing maps algorithm
SubMap
Maps subclasses between two datasets
Gene list selection
ClassNeighbors
Select genes that most closely resemble a profile
ComparativeMarkerSelection
Computes significance values for features using several metrics
ExtractComparativeMarkerResults
Creates a dataset and feature list from ComparativeMarkerSelection output
GSEA
Gene set enrichment analysis
GeneNeighbors
Select the neighbors of a given gene according to similarity of their profiles
SelectFeaturesColumns
Takes a “column slice” from a .res, .gct, .odf, or .cls file
SelectFeaturesRows
Takes a “row slice” from a .res, .gct, or .odf file
Image creators
HeatMapImage
Creates a heat map graphic from a dataset
HierarchicalClusteringImage
Creates a dendrogram graphic from a dataset
Missing value imputation
ImputeMissingValues.KNN
Impute missing values using a k-nearest neighbor algorithm
Pathway analysis
ARACNE
Runs the ARACNE algorithm
MINDY
Runs the MINDY algorithm for inferring genes that modulate the activity
of a transcription factor at post-transcriptional levels
Pipeline
Golub.Slonim.1999.Science.all.aml
ALL/AML methodology, from Golub et al. (1999)
Lu.Getz.Miska.Nature.June.2005.
PDT.mRNA
Probabilistic Neural Network Prediction using mRNA, from Lu et al.
(2005)
Lu.Getz.Miska.Nature.June.2005.
PDT.miRNA
Probabilistic Neural Network Prediction using miRNA, from Lu et al.
(2005)
Lu.Getz.Miska.Nature.June.2005.
clustering.ALL
Hierarchical clustering of ALL samples with genetic alterations, from Lu
et al. (2005)
Lu.Getz.Miska.Nature.June.2005.
clustering.ep.mRNA
Hierarchical clustering of 89 epithelial samples in mRNA space, from Lu
et al. (2005)
Lu.Getz.Miska.Nature.June.2005.
clustering.ep.miRNA
Hierarchical clustering of 89 epithelial samples in miRNA space, from Lu
et al. (2005)
Lu.Getz.Miska.Nature.June.2005.
clustering.miGCM218
Hierarchical clustering of 218 samples from various tissue types, from Lu
et al. (2005)
Lu.Getz.Miska.Nature.June.2005.
mouse.lung
Normal/tumor classifier and KNN prediction of mouse lung samples, from
Lu et al. (2005)
continued
7.12.35
Current Protocols in Bioinformatics
Supplement 22
Table 7.12.9 GenePattern Modulesa , continued
Module
Description
Prediction
CART
Classification and regression tree classification
CARTXValidation
Classification and regression tree classification with leave-one-out
cross-validation
KNN
k-nearest neighbors classification
KNNXValidation
k-nearest neighbors classification with leave-one-out cross-validation
PNN
Probabilistic Neural Network (PNN)
PNNXValidationOptimization
PNN leave-one-out cross-validation optimization
SVM
Classifies samples using the support vector machines (SVM) algorithm
WeightedVoting
Weighted voting classification
WeightedVotingXValidation
Weighted voting classification with leave-one-out cross-validation
Preprocess and utilities
ConvertLineEndings
Converts line endings to the host operating system’s format
ConvertToMAGEML
Converts a gct, res, or odf dataset file to a MAGE-ML file
DownloadURL
Downloads a file from a URL
ExpressionFileCreator
Creates a res or gct file from a set of Affymetrix CEL files
ExtractColumnNames
Lists the sample descriptors from a .res file
ExtractRowNames
Extracts the row names from a .res, .gct, or .odf file
GEOImporter
Imports data from the Gene Expression Omnibus (GEO);
http://www.ncbi.nlm.nih.gov/geo
MapChipFeaturesGeneral
Map the features of a dataset to user-specified values
MergeColumns
Merge datasets by column
MergeRows
Merge datasets by row
MultiplotPreprocess
Creates derived data from an expression dataset for use in the Multiplot
and Multiplot Extractor visualizer modules
PreprocessDataset
Preprocessing options on a res, gct, or Dataset input file
ReorderByClass
Reorder the samples in an expression dataset and class file by class
SplitDatasetTrainTest
Splits a dataset (and cls files) into train and test subsets
TransposeDataset
Transpose a dataset—.gct, .odf
UniquifyLabels
Makes row and column labels unique
Projection
NMF
Non-negative matrix factorization
PCA
Principal component analysis
Proteomics
AreaChange
Calculates fraction of area under the spectrum that is attributable to signal
CompareSpectra
Compares two spectra to determine similarity
LandmarkMatch
A proteomics method to propagate identified peptides across multiple MS
runs
LocatePeaks
Locates detected peaks in a spectrum
mzXMLToCSV
Converts a mzXML file to a zip of csv files
continued
7.12.36
Supplement 22
Current Protocols in Bioinformatics
Table 7.12.9 GenePattern Modulesa , continued
Module
Description
PeakMatch
Perform peak matching on LC-MS data
Peaks
Determine peaks in the spectrum using a series of digital filters.
PlotPeaks
Plot peaks identified by PeakMatch
ProteoArray
LC-MS proteomic data processing module
ProteomicsAnalysis
Runs the proteomics analysis on the set of input spectra
Sequence analysis
GlobalAlignment
Smith-Waterman sequence alignment
SNP analysis
CopyNumberDivideByNormals
Divides tumor samples by normal samples to create a raw copy number value
GLAD
Runs the GLAD R package
LOHPaired
Computes LOH for paired samples
SNPFileCreator
Process Affymetrix SNP probe-level data into an expression value
SNPFileSorter
Sorts a .snp file by chromosome and location
SNPMultipleSampleAnalysis
Determine regions of concordant copy number aberrations
XChromosomeCorrect
Corrects X Chromosome SNP’s for male samples
Statistical methods
KSscore
Kolmogorov-Smirnov score for a set of genes within an ordered list
Survival analysis
SurvivalCurve
Draws a survival curve based on a phenotype or class (.cls) file
SurvivalDifference
Tests for survival difference based on phenotype or (.cls) file
Visualizer
caArrayImportViewer
A visualizer to import data from caArray into GenePattern
ComparativeMarkerSelectionViewer
View the results from ComparativeMarkerSelection
CytoscapeViewer
View a gene network using Cytoscape (http://cytoscape.org)
FeatureSummaryViewer
View a summary of features from prediction
GeneListSignificanceViewer
Views the results of marker analysis
GSEALeadingEdgeViewer
Leading edge viewer for GSEA results
HeatMapViewer
Display a heat map view of a dataset
HiearchicalClusteringViewer
View results of hierarchical clustering
JavaTreeView
Hierarchical clustering viewer that reads in Eisen’s cdt, atr, and gtr files
MAGEMLImportViewer
A visualizer to import data in MAGE-ML format into GenePattern
Multiplot
Creates two-parameter scatter plots from the output file of the
MultiplotPreprocess module
MultiplotExtractor
Provides a user interface for saving the data created by the
MultiplotPreprocess module
PCAViewer
Visualize principal component analysis results
PredictionResultsViewer
Visualize prediction results
SnpViewer
Displays a heat map of SNP data
SOMClusterViewer
Visualize clusters created with the SOM algorithm
VennDiagram
Displays a Venn diagram
a As of April18, 2008.
7.12.37
Current Protocols in Bioinformatics
Supplement 22
the analysis. When the analysis is complete,
the user can review the analysis result files,
which are stored on the GenePattern server.
The term “job” refers to an analysis run on
the server. The term “job results” refers to the
analysis result files.
Analysis result files are typically formatted
text files. GenePattern provides corresponding
visualization modules to display the analysis
results in a concise and meaningful way. Visualization tools provide support for exploring
the underlying biology. Visualization modules
run on the GenePattern client, not the server,
and do not generate analysis result files.
Most GenePattern modules include an output file parameter, which provides a default name for the analysis result file. On
the GenePattern server, the output files for
an analysis are placed in a directory associated with its job number. The default file
name can be reused because the server creates a new directory for each job. However,
changing the file name to distinguish between different iterations of the same analysis
is recommended. For example, HierarchicalClustering can be run using several different
clustering methods (complete-linkage, singlelinkage, centroid-linkage, or average-linkage).
Including the method name in the output
file name makes it easier to compare the results of the different methods. By default,
the output file name for HierarchicalClustering is <input.filename basename>, which
indicates that the module will use the
input file name as the output file name.
Alternative output file names might be
<input.filename basename>.complete,
<input.filename basename>.centroid,
<input.filename basename>.average, or
<input.filename basename>.single.
By default, the GenePattern server stores
analysis result files for 7 days. After that time,
they are automatically deleted from the server.
To save an analysis result file, download the
file from the GenePattern server to a local directory. In the Web Client, to save an analysis
result file, click the icon next to the file and
select Save. To save all result files for an analysis, click the icon next to the analysis and
select Download. In the Desktop Client, in the
Result pane, click the analysis result file and
select Results>Save To.
Using
GenePattern for
Gene Expression
Analysis
tern Web site, http://www.genepattern.org,
provides a current list of modules. To install the latest versions of all modules,
from the GenePattern Web Client, select
Modules>Install from Repository. When using GenePattern regularly, check the repository each month for new and updated modules.
Literature Cited
Benjamini, Y. and Hochberg, Y. 1995. Controlling
the false discovery rate: A practical and powerful
approach to multiple testing. J. R. Stat. Soc. Ser.
B 57:289-300.
Breiman, L., Friedman, J.H., Olshen, R.A., and
Stone, C.J. 1984. Classification and regression trees. Wadsworth & Brooks/Cole Advanced
Books & Software, Monterey, Calif.
Brunet, J., Tamayo, P., Golub, T.R., and Mesirov,
J.P. 2004. Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl.
Acad. Sci. U.S.A. 101:4164-4169.
Cover, T.M. and Hart, P.E. 1967. Nearest neighbor
pattern classification, IEEE Trans. Info. Theory
13:21-27.
D’haeseleer, P. 2005. How does gene expression clustering work? Nat. Biotechnol. 23:14991501.
Getz, G., Monti, S., and Reich, M. 2006. Workshop:
Analysis Methods for Microarray Data. October
18-20, 2006. Cambridge, MA.
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C.,
Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh,
M., Downing, J.R., Caligiuri, M.A., Bloomfield,
C.D., and Lander, E.S. 1999. Molecular classification of cancer: Class discovery and class
prediction by gene expression. Science 286:531537.
Gould, J., Getz, G., Monti, S., Reich, M., and
Mesirov, J.P. 2006. Comparative gene marker
selection suite. Bioinformatics 22:1924-1925.
Lu, J., Getz, G., Miska, E.A., Alvarez-Saavedra, E.,
Lamb, J., Peck, D., Sweet-Cordero, A., Ebert,
B.L., Mak, R.H., Ferrando, A.A, Downing, J.R.,
Jacks, T., Horvitz, H.R., and Golub, T.R. 2005.
MicroRNA expression profiles classify human
cancers. Nature 435:834-838.
MacQueen, J.B. 1967. Some methods for classification and analysis of multivariate observations.
In Proceedings of Fifth Berkeley Symposium
on Mathematical Statistics and Probability, Vol.
1 (L. Le Cam and J. Neyman, eds.) pp. 281297. University of California Press, Berkeley,
California.
Suggestions for Further Analysis
Monti, S., Tamayo, P., Mesirov, J.P., and Golub,
T. 2003. Consensus clustering: A resamplingbased method for class discovery and visualization of gene expression microarray data.
Functional Genomics Special Issue. Machine
Learning Journal 52:91-118.
Table 7.12.9 lists the modules available in
GenePattern as of this writing; new modules
are continuously being released. The GenePat-
Quackenbush, J. 2002. Microarray data normalization and transformation. Nat. Genet. 32:496501.
7.12.38
Supplement 22
Current Protocols in Bioinformatics
Slonim, D.K. 2002. From patterns to pathways:
Gene expression data analysis comes of age.
Nat. Genet. 32:502-508.
Slonim, D.K., Tamayo, P., Mesirov, J.P., Golub,
T.R., and Lander, E.S. 2000. Class prediction
and discovery using gene expression data. In
Proceedings of the Fourth Annual International
Conference on Computational Molecular Biology (RECOMB). (R. Shamir, S. Miyano, S.
Istrail, P. Pevzner, and M. Waterman, eds.)
pp. 263-272. ACM Press, New York.
Specht, D.F. 1990. Probabilistic neural networks.
Neural Netw. 3:109-118.
Storey, J.D. and Tibshirani, R. 2003. Statistical significance for genomewide studies. Proc. Natl.
Acad. Sci. U.S.A. 100:9440-9445.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q.,
Dmitrovsky, E., Lander, E.S., and Golub, T.R.
1999. Interpreting gene expression with selforganizing maps: Methods and application to
hematopoeitic differentiation. Proc. Natl. Acad.
Sci. U.S.A. 96:2907-2912.
Vapnik, V. 1998. Statistical Learning Theory. John
Wiley & Sons, New York.
Westfall, P.H. and Young, S.S. 1993. ResamplingBased Multiple Testing: Examples and Methods for p-Value Adjustment (Wiley Series in
Probability and Statistics). John Wiley & Sons,
New York.
Wit, E. and McClure, J. 2004. Statistics for Microarrays. John Wiley & Sons, West Sussex,
England.
Zeeberg, B.R., Riss, J., Kane, D.W., Bussey, K.J.,
Uchio, E., Linehan, W.M., Barrett, J.C., and
Weinstein, J.N. 2004. Mistaken identifiers: Gene
name errors can be introduced inadvertently
when using Excel in bioinformatics. BMC Bioinformatics 5:80.
Key References
Reich, M., Liefeld, T., Gould, J., Lerner, J., Tamayo,
P., and Mesirov, J.P. 2006. GenePattern 2.0.
Nature Genetics 38:500-501.
Overview of GenePattern 2.0, including comparison with other tools.
Wit and McClure, 2004. See above.
Describes setting up a microarray experiment and
analyzing the results.
Internet Resources
http://www.genepattern.org
Download GenePattern software and view GenePattern documentation.
http://www.genepattern.org/tutorial/gp concepts.html
GenePattern concepts guide.
http://www.genepattern.org/tutorial/
gp web client.html
GenePattern Web Client guide.
http://www.genepattern.org/tutorial/
gp java client.html
GenePattern Desktop Client guide.
http://www.genepattern.org/tutorial/
gp programmer.html
GenePattern Programmer’s guide.
http://www.genepattern.org/tutorial/
gp fileformats.html
GenePattern file formats.
Analyzing
Expression
Patterns
7.12.39
Current Protocols in Bioinformatics
Supplement 22
Data Storage and Analysis in
ArrayExpress and Expression Profiler
UNIT 7.13
Gabriella Rustici,1 Misha Kapushesky,1 Nikolay Kolesnikov,1 Helen
Parkinson,1 Ugis Sarkans,1 and Alvis Brazma1
1
European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus,
Hinxton, Cambridge, United Kingdom
ABSTRACT
ArrayExpress at the European Bioinformatics Institute is a public database for MIAMEcompliant microarray and transcriptomics data. It consists of two parts: the ArrayExpress
Repository, which is a public archive of microarray data, and the ArrayExpress Warehouse of Gene Expression Profiles, which contains additionally curated subsets of data
from the Repository. Archived experiments can be queried by experimental attributes,
such as keywords, species, array platform, publication details, or accession numbers.
Gene expression profiles can be queried by gene names and properties, such as Gene
Ontology terms, allowing expression profiles visualization. The data can be exported and
analyzed using the online data analysis tool named Expression Profiler. Data analysis
components, such as data preprocessing, filtering, differentially expressed gene finding,
clustering methods, and ordination-based techniques, as well as other statistical tools are
all available in Expression Profiler, via integration with the statistical package R. Curr.
C 2008 by John Wiley & Sons, Inc.
Protoc. Bioinform. 23:7.13.1-7.13.27. Keywords: gene expression r microarrays r transcriptomics r public repository r
data analysis
INTRODUCTION
ArrayExpress (AE) resource consists of two databases: (1) AE Repository of Microarray and Transcriptomics Data, which archives well-annotated microarray data typically
supporting journal publications, and (2) AE Warehouse of Gene Expression Profiles,
which contains additionally curated subsets of data from the Repository and enables the
user to query gene expression profiles by gene names, properties, and profile similarity
(Brazma et al., 2003). In addition to the two databases, the resource includes an online
data analysis tool named Expression Profiler (EP; Kapushesky et al., 2004), which allows
the exploration, mining, analysis, and visualization of data exported from AE, as well
as datasets uploaded from any other sources, such as user generated data. Further, the
AE resource includes MIAMExpress and Tab2MAGE tools for data submission to the
Repository.
AE supports standards and recommendations developed by the Microarray Gene Expression Data (MGED) society, including the Minimum Information About a Microarray
Experiment (MIAME; Brazma et al., 2001) and a spreadsheet-based data exchange
format, MAGE-TAB (Rayner et al., 2006). AE is one of three international databases
recommended by the MGED society (Ball et al., 2004) for storing MIAME-compliant
microarray data related to publications (the other two being Gene Expression Omnibus
and CIBEX; Edgar et al., 2002; Ikeo et al., 2003). As of January 2008, the AE Repository
holds data from ∼100,000 microarray hybridizations, from over 3300 separate studies
(experiments) related to over 200 different species. The data in the Repository tends to
double every 14 months. Most of the data relates to transcription profiling experiments,
Current Protocols in Bioinformatics 7.13.1-7.13.27, September 2008
Published online September 2008 in Wiley Interscience (www.interscience.wiley.com).
DOI: 10.1002/0471250953.bi0713s23
C 2008 John Wiley & Sons, Inc.
Copyright Analyzing
Expression
Patterns
7.13.1
Supplement 23
although the proportion of array based Comparative Genomics Hybridization (CGH)
and chromatin immunoprecipitation (ChIP on chip) experiments is growing. The Repository also holds prepublication, password-protected data, mostly for reviewing purposes.
The Repository allows the user to browse or query the experiments via free text search
(e.g., experiment accession numbers, authors, laboratory, publication, and keywords),
and filter the data by species or array design. Once the desired experiment is identified, the user can find more information about the samples, protocols used, experimental
design, etc., and most importantly can export either all, or parts, of the data from the
experiment.
The AE Warehouse holds additionally curated gene expression data that can be queried
and retrieved by gene names, identifiers (e.g., database accession numbers), or properties, such as Gene Ontology terms. The main source of data for the Warehouse is the
Repository, although some in situ gene expression and protein expression data from
external sources are also loaded in the Warehouse. The use of the AE Warehouse is
straightforward: enter the name, ID, or a property of a gene or several genes, retrieve the
list of experiments where the given gene has been studied, and zoom into its expression
profile.
EP is a Web-based gene expression data analysis tool; several of its components are implemented via integration with the statistical package R (Ihaka and Gentleman, 1996). Users
can upload their own data in EP or data retrieved from AE. The users only need a Web
browser to use EP from their local PCs. Data analysis components for gene expression
data preprocessing, missing value imputation, filtering, clustering methods, visualization, significant gene finding, between-group analysis, and other statistical components
are available in EP. The Web-based design of EP supports data sharing and collaborative
analysis in a secure environment. Developed tools are integrated with the microarray
gene expression database AE and form the exploratory analytical front-end to those data.
In this unit we present six basic protocols: (1) how to query, retrieve, and interpret data
from the AE Warehouse of Gene Expression Profiles; (2) how to query, retrieve, and
interpret data and metadata from the AE Repository of Microarray and Transcriptomics
Data; (3) how to upload, normalize, analyze, and visualize data in EP; (4) how to perform
clustering analysis in EP; (5) how to calculate Gene Ontology term enrichment in EP;
and (6) how to calculate chromosome co-localization probability in EP.
BASIC
PROTOCOL 1
QUERYING GENE EXPRESSION PROFILES
This protocol describes how to query and analyze data from AE Warehouse of Gene
Expression Profiles.
Necessary Resources
Hardware
Suggested minimum requirements for a PC system: fast Internet connection
(64K+), graphics card supporting at least 1024 × 768, optimal resolution
1280 × 1024 (65K+ colors)
Software
Internet browser supported: Internet Explorer 6 and 7 (platforms: Windows 2000,
XP, Vista), Firefox 1.5+ (all platforms), Opera 9 (all platforms), and Safari 2.0+
(Mac OS X)
ArrayExpress and
Expression
Profiler
7.13.2
Supplement 23
Current Protocols in Bioinformatics
Query for expression profiles of a particular gene
1. Open the AE homepage at http://www.ebi.ac.uk/arrayexpress.
2. In the Expression Profiles box, on the right-hand side of the page, type a gene name
(e.g., nfkbia) into the Gene(s) box, and type leukemia as a keyword in the
experiment or sample annotation box (Fig. 7.13.1).
The Experiments box, on the left hand-side of the page, allows querying the AE Repository
of Microarray Experiments. This is the focus of Basic Protocol 2.
3. Select species, e.g., Homo sapiens, in the drop-down menu and click the “query”
button.
The interface returns the list of all experiments (studies) in the AE Warehouse where the
selected gene has been studied (Fig. 7.13.2). Experiments are ordered by “relevance,”
with the most relevant experiment on top. The “relevance rank” is based on the correlation
between experimental factors values and gene expression values and is calculated using
several methods, including a linear model in the Bioconductor package limma (Smyth,
2004). For each experiment, a short description, a list of experimental factors (see step 2),
and the experimental set up (type) are provided. In addition, a thumbnail image shows
the behavior of the selected gene in each experiment retrieved. At a glance the user can
now decide which experiment might be interesting for further viewing.
Figure 7.13.1
The ArrayExpress query windows (http://www.ebi.ac.uk/arrayexpress).
Figure 7.13.2 Output window after querying the AE Warehouse for the expression profiles of a
particular gene (e.g., nfkbia).
Analyzing
Expression
Patterns
7.13.3
Current Protocols in Bioinformatics
Supplement 23
Figure 7.13.3 Zoomed-in view of a particular experiment. The main graph shows the expression
profile of the selected gene (e.g., nfkbia), for all experimental samples, based on the selected
experimental factor.
Choose the experiment of interest and explore the expression profile of the chosen
gene
4. Click on the thumbnail image of the expression profile in one of the experiments,
e.g., E-AFMX-5.
In the graph now showing (Fig. 7.13.3), the X axis represents all samples in this study,
grouped by experimental factor, while the Y axis represents the expression levels for
nfkbia in each sample. Explore the dependency of the expression levels on different experimental factors. Experimental factors are the main experimental variables of interest
in a particular study. For instance, experiment E-AFMX-5 has three experimental factors: cell type, disease state, and organism part. Select “cell type.” Observe that nfkbia
has notably higher expression values for the cell type CD33+ myeloid, than, for instance, CD4+ T cells (Fig. 7.13.3). The black line represents the expression value for the
Affymetrix probe 201502 s at, for nfkbia. The dotted lines represent the mean expression
values.
Scroll down the page for more information about the sample properties. In the table
provided, the sample number in the first column corresponds to the sample number on the
X axis of the graph.
The expression values are measured in abstract units as supplied by the submitter. For
instance, E-AFMX-5 uses Affymetrix platform and MAS5 normalization method. For
more information about the particular normalization protocols used in each individual
experiment, click on the experiment accession number (i.e., E-AFMX-5). This will open
the link to the respective dataset entry in the AE Repository, which contains all the
information related to the selected study (described in Basic Protocol 2).
Select other genes with expression profiles most similar to the chosen one
5. On the top right-hand side of the same page (Fig. 7.13.3), from the “similarity search”
drop-down menu, select the “find 3 closest genes” option.
ArrayExpress and
Expression
Profiler
This will select the three most similarly expressed genes and add their expression profiles
to the current selection, next to the nfkbia profiles (Fig. 7.13.4). You will find that the
expression patterns for genes IER2, FOS, and JUN closely resemble the behavior of
7.13.4
Supplement 23
Current Protocols in Bioinformatics
Figure 7.13.4 Similarity search output window. The expression profile of the selected gene (e.g.,
nfkbia) is plotted together with the ones of the 3 genes showing the closest similarity in expression
pattern, within the same experiment. The corresponding gene symbols are listed on the right (Ier2,
Fos, and Jun). For color version of this figure see http://www.currentprotocols.com.
nfkbia. Click on “expand” next to Gene Properties (upper right-hand side of the page),
and follow the links from the expanded view to retrieve additional information about these
genes in ENSEMBL, Uniprot, and 4DXpress databases.
Finally, by clicking on the “download the gene expression data matrix” link located
below the graph (Fig. 7.13.4), the user can obtain the numerical expression values of the
selected genes for further analysis.
Query for expression profiles of several genes
6. Repeat the search as in step 1, but instead of a single gene, enter two or more comma
separated gene names, e.g., Ephb3, Nfkbia, select species Mus musculus, and
click the “query” button.
If more than one gene is selected, the query tries to match the gene names exactly
(Fig. 7.13.5). The user will be prompted to an intermediate window where a list of
matching genes found is provided, together with a list of matching experiments. Toggle
the genes of interest (in this case both of them) and then click display at the top of the
page.
On the thumbnail plot page that opens up, click to zoom into the expression profile of
experiment E-MEXP-774. Note how the response of these two genes to the dexamethasone
hormone treatment is opposite.
Query for expression of genes of a particular Gene Ontology category
7. Repeat the search as in step 1 but instead of entering a gene name, enter a Gene
Ontology category term or keyword, such as cell cycle, and select the species
Schizosaccharomyces pombe.
As this is a wide category, ∼260 genes are returned. The user can select any number for
further exploration by ticking the respective boxes. For instance, one can select the rum1
gene and click on “display” at the top of the page. The familiar thumbnail plots will be
returned. The user can then zoom in as described above.
Analyzing
Expression
Patterns
7.13.5
Current Protocols in Bioinformatics
Supplement 23
Figure 7.13.5 Gene selection page. When more than one gene matches the query, this window
allows refining the search, querying for multiple genes or restricting the search to perfect matches
only.
BASIC
PROTOCOL 2
QUERY THE AE REPOSITORY OF MICROARRAY AND
TRANSCRIPTOMICS DATA
This protocol describes how to browse, query, and retrieve information from AE Repository of Microarray and Transcriptomics Data.
Necessary Resources
Hardware
Suggested minimum requirements for a PC system: fast Internet connection
(64K+), graphics card supporting at least 1024 × 768, optimal resolution
1280 × 1024 (65K+ colors)
Software
Internet browser supported: Internet Explorer 6 and 7 (platforms: Windows 2000,
XP, Vista), Firefox 1.5+ (all platforms), Opera 9 (all platforms), and Safari 2.0+
(Mac OS X)
Query the Repository
1. Open the AE homepage at http://www.ebi.ac.uk/arrayexpress (Fig. 7.13.1).
2. In the Experiments box on the left-hand side of the page, type in a word or a phrase by
which you want to retrieve the experiments, e.g., cell cycle. Click the “query”
button.
The user can query the Repository by experimental attributes, such as keywords, species,
array platform, publication details, or experiment accession numbers. Alternatively, the
user can first click on the Browse experiments link (in the Experiments box) and browse
the entire Repository content and subsequently apply additional filtering such as “filter
on species” and “filter on array,” which allows narrowing down the search.
ArrayExpress and
Expression
Profiler
Note that when using Internet Explorer the drop-down menu for the “filter on array”
option is not displayed properly so we strongly advise the use of Firefox 1.5+ to avoid
this problem.
7.13.6
Supplement 23
Current Protocols in Bioinformatics
3. In the output window, filter on the species Schizosaccharomyces pombe using the
“filter on species” drop-down menu at the top of the page.
This will bring up a window with a list of experiments in the reverse order of their
publication dates in the AE Repository (Fig. 7.13.6). One can increase the number of
experiments per page by changing the default in the top-right corner up to 500 per page.
The information displayed for each experiment is described in Table 7.13.1.
4. Expand an experiment by clicking on the experiment title line. For instance, expand
the experiment E-MEXP-54.
Additional information is provided in the new window together with extremely useful links
to experiment annotation and data retrieval (Fig. 7.13.7). Excel spreadsheets for Sample
annotation and Detailed sample annotation can be viewed or downloaded by clicking on
the Tab-delimited spreadsheet links. A graphical representation of the experimental set up
is also available by clicking the PNG or SVG links, under the Experiment design menu.
Figure 7.13.6 Output window after querying the AE Repository for a particular set of experiments,
using a word or phrase (e.g., cell cycle) and selecting a species (e.g., Schizosaccharomyces
pombe). The total number of experiments and corresponding samples retrieved appears at the
bottom of the page.
Table 7.13.1 Information Displayed for Individual Experiments in the AE Repository
Column header
Description and comments
ID
This is the experiment accession number, a unique identifier assigned to
each experiment by the AE curation staff; this ID can be used directly to
query the Repository
Title
A brief description of the experiment
Hybs
Number of hybridizations associated with the experiment
Species
List of the species studied in the selected experiment
Date
The date when the experiment was loaded in the Repository
Processed/Raw
The data available is shown as processed or raw data. A yellow icon
represents data available for download; a gray icon represents data which
is unavailable. Affymetrix raw data has a dedicated Affymetrix icon.
Data can be easily downloaded by clicking on the icons.
More
Link to the Advanced User Interface for experiment annotation and data
retrieval
Analyzing
Expression
Patterns
7.13.7
Current Protocols in Bioinformatics
Supplement 23
Figure 7.13.7 Expanded view of a single experiment with links to several experiment annotation
files and data retrieval page.
Clicking on Experimental protocols, the user is prompted to a detailed description of all
protocols used, including array manufacturing, RNA extraction and labeling, hybridization, scanning, and data analysis.
Clicking on the array accession number (in this case A-SNGR-8) aligned with the Array
menu, a new page with additional links to the array design used is available. The array
annotation file can be downloaded in Excel or tab-delimited format. These files are used
to define the annotation and the layout of reporters on the array.
Clicking on the FTP server direct link prompts the user to the FTP directory from which
all the annotation and data files available for the selected experiment can be downloaded.
5. Go back to the expanded experiment view (Fig. 7.13.7) and click on the View detailed
data retrieval page (under the Downloads menu) link.
The new page header provides information on data availability for the selected experiment. Two data formats are available in this case: Processed Data Group, which is the
normalized data, and Measured Data Group, which is the raw data.
Take a look at the Processed Data Group 1 (Fig. 7.13.8, top). A list of all hybridizations (or
experimental conditions) is displayed together with the corresponding experimental factors, time intervals after cell synchronization, in this case. Each hybridization corresponds
to a data file which contains expression levels for all genes in that experimental condition.
The Detailed data retrieval page allows the user to generate a gene expression data matrix
from these data files. This matrix is a single .TXT file which contains expression levels
for all genes in all experimental conditions within an experiment. See Commentary for a
definition of data matrix. To generate such matrix the user needs to select the experimental
conditions to be included (all or a selection, as needed), the “quantitation type” which
represents the expression levels, and additional gene annotation columns, which can later
be useful in the interpretation of the results.
6. Select all experimental conditions.
Now the user needs to select the “quantitation type” and the gene annotation.
ArrayExpress and
Expression
Profiler
7. Scroll down to the Quantitation type and the Array Annotation sessions (Fig. 7.13.8,
bottom). Select Software (Unknown):Sanger Rustici:normalized as quantitation type
and select two array annotations: Database DB:genedb and Reporter name.
7.13.8
Supplement 23
Current Protocols in Bioinformatics
Figure 7.13.8 Top: Data retrieval page, Processed data group detail—Experimental conditions.
This section of the page allows the user to select the experimental conditions to be included in
the data matrix for further analysis. Bottom: Data retrieval page, Processed data group detail—
Quantitation Types and Design Element Properties. This section of the page allows the user to
select the format of normalized data and the type of annotation to be included in the data matrix
for further analysis.
The Quantitation type session lists all data formats available. For this experiment, only
one quantitation type is given (the normalized signal provided by the submitter) but for
other array platforms (e.g., Affymetrix) more types are available.
The Array Annotation session lists the annotation information available for the array
platform used.
8. Scroll down and take a look at the Raw Data Group 1. Skip the Experimental
conditions and go to Quantitation types.
This experiment used two-channel microarrays so the data extracted from each individual
feature is provided for both Cy3 and Cy5, including foreground and background intensities
(mean, median, and standard deviation), as well as ratio values and background corrected
intensities. Any combination of these parameters can be included in the final data matrix,
whenever raw data is needed.
9. Go back to Processed Data Group 1 and click on Export data.
A data matrix will be computed using all selected experimental conditions, the normalized
signal from each condition and the selected annotation for each identifier present on the
array.
Analyzing
Expression
Patterns
7.13.9
Current Protocols in Bioinformatics
Supplement 23
10. On the new page, click on See data matrix to view the generated file and on Download
data matrix to save it onto your computer as .TXT file.
Once the data has been retrieved, it can be analyzed using the online data analysis tool
Expression Profiler. This will be the focus of Basic Protocol 3.
11. Open a new window in your browser and query the repository for experiment
E-AFMX-5 and go to the data retrieval page to view an example of data associated with an Affymetrix experiment.
For Affymetrix arrays, the .CHP file contains the processed/normalized expression levels
of each gene on the array and the .CEL file contains the raw data for every feature on
the chip.
BASIC
PROTOCOL 3
HOW TO UPLOAD, NORMALIZE, ANALYZE, AND VISUALIZE DATA IN
EXPRESSION PROFILER
This protocol describes how one can upload, normalize, analyze, and visualize data in
EP.
Necessary Resources
Hardware
Suggested minimum requirements for a PC system: fast Internet connection
(64K+), graphics card supporting at least 1024 × 768, optimal resolution
1280 × 1024 (65K+ colors)
Software
Internet browser supported: Internet Explorer 6 and 7 (platforms: Windows 2000,
XP, Vista), Firefox 1.5+ (all platforms), Opera 9 (all platforms), and Safari 2.0+
(Mac OS X)
Browse the repository and upload data in EP
1. Go to the AE homepage at http://www.ebi.ac.uk/arrayexpress.
The user will now retrieve an experiment from AE and save the data, which will then be
loaded and analyzed with EP.
2. In the Experiments box, on the left-hand side of the page, type in the experiment
accession number E-MEXP-886 and click on the “query” button.
3. Expand the experiment view for E-MEXP-886 by clicking on the experiment title
line and explore its properties.
The experiment used transcription profiling of ataxin-null versus wild-type mice to investigate spinocerebellar ataxia type 1. A total of ten Affymetrix MOE430A arrays were
used, five hybridized with wild-type and five with knock-out samples.
4. Download the raw data by clicking on the raw data icon and saving the
E-MEXP-886.raw.zip file to your PC.
This represents a quick way to export the dataset. Instead of generating a data matrix,
the entire raw dataset can be saved as a compressed archive for direct upload.
5. Go to the EP main page at http://www.ebi.ac.uk/expressionprofiler (Fig. 7.13.9).
6. Create a login by clicking on the Register new user link.
ArrayExpress and
Expression
Profiler
Fill the “user registration” page with all required details, and choose a personal user
name and password. You will be able to use it each time you want to login. All the data
loaded and analysis history will be saved and stored under this user login. With a “guest
login” all the data and analysis will be lost at the end of each session.
7.13.10
Supplement 23
Current Protocols in Bioinformatics
Figure 7.13.9
The Expression Profiler main page (http://www.ebi.ac.uk/expressionprofiler/).
Figure 7.13.10 Upload/Expression data windows in EP. The user can directly upload data in a
variety of tabular formats (top) or in Affymetrix format (bottom).
7. Once registered, click on the EP:NG Login Page link, on the EP main page
(Fig. 7.13.9), enter your username and password, and click Login.
The user will then be prompted to the data upload page (Fig. 7.13.10).
8. In the Upload/Expression data page click on the Affymetrix tab (Fig. 7.13.10,
bottom).
The Data Upload component can accept data in a number of formats including basic
tab-delimited files, such as those exported by Microsoft Excel (Tabular data option), and
Analyzing
Expression
Patterns
7.13.11
Current Protocols in Bioinformatics
Supplement 23
Affymetrix .CEL files (Affymetrix option). The .CEL files can be uploaded by placing
them into an archive (e.g., a .ZIP file) and then uploading the archive. The .ZIP file
should contain only .CEL files from the same type of Affymetrix arrays. Users can also
select a published dataset from the AE database through the EP interface (ArrayExpress
option). A particular dataset can also be directly uploaded from a specific URL, for both
Affymetrix and tabular data.
Except for .CEL files, uploaded expression datasets must be represented as data matrices,
with rows and columns corresponding to genes and experimental conditions, respectively.
9. Browse and select the location of the E-MEXP-886.raw.zip file. Select the
data species, e.g., Mus musculus, enter a name for the experiment, and click on the
Execute button.
When uploading tabular data, the user will need to select a more specific type of data
format among tab-delimited, single-space delimited, any length white-space delimited,
Microsoft Excel spreadsheet, or custom delimiter. According to the number of annotation
columns included in the data matrix (as shown in Basic Protocol 2, when describing how
to compute a data matrix from the Detailed data retrieval page), the user will also need
to specify the position of the first data column and data row in the matrix (Fig. 7.13.10,
top).
After a successful microarray data import, the EP Data Selection view is displayed
(Fig. 7.13.11). This view has three sections as described in Table 7.13.2.
The Subselection component provides several basic mechanisms to select genes and
conditions that have particular expression values. A way to sub-select a slice of the
gene expression matrix by row or column names (partial word matching can be used
for this filter) is provided by the Select rows and Select columns tabs. Other selecting
options are: Missing values, Value ranges, Select by similarity, and “eBayes (limma).”
See Table 7.13.3 for more information on how to use these options.
ArrayExpress and
Expression
Profiler
Figure 7.13.11 Data selection view in EP. This window is divided in 3 mains sections: current
dataset (top), descriptive statistics (middle), and subselection menu (bottom). For color version of
this figure see http://www.currentprotocols.com.
7.13.12
Supplement 23
Current Protocols in Bioinformatics
Table 7.13.2 Sections in the EP Data Selection View
Section
Description
Comments
Current dataset
Displays the user’s folder structure,
EP stores all parameters, results, and graphics files
current dataset selection, and the ongoing for every performed analysis step. These can be
analysis history (see Fig. 7.13.11, top)
retrieved at any stage in the analysis by clicking the
View action output icon next to the respective
analysis step; this is the button with the yellow
arrow/magnifying glass combination. Additional
icons allow the user to view the original data, as
well as row and column headers, or delete a dataset
previously loaded
Descriptive statistics
Provides some basic data visualization
graphics
Graphics may include a plot of perfect match (PM)
probe intensities (log-scale) for Affymetrix arrays
(see Fig. 7.13.11, middle) or distribution density
histograms (one- and two-channel experiments,
absolute, and log-ratio data)
Subselectiona
A number of subsections are available
with various criteria for selecting data
subsets (see Fig. 7.13.11, bottom).
This portion of the page changes according to the
EP component selected by the user from the top
left-hand side menu
a See Table 7.13.3 for more information.
The menu on the top left-hand side (Fig. 7.13.11) provides links to all the EP components,
which can be used for data transformation, analysis, and visualization. The following
sections will give an overview on how to use them.
Perform data normalization in EP
10. Click on Data Normalization, under the Transformations menu on the left-hand side
of the page (Fig. 7.13.11).
EP provides a graphical interface to four commonly used BioConductor data normalization routines, for Affymetrix and other microarray data: GCRMA, RMA, Li and Wong,
and VSN (Li and Wong, 2001; Huber et al., 2002; Irizarry et al., 2003; Wu et al., 2004).
When dealing with raw scanner output data, as in the case of Affymetrix CEL files, it is
important to normalize these data to minimize the noise levels and to make the expression
values comparable across arrays (Quackenbush, 2002). Of the four methods available in
EP, GCRMA, RMA, and Li and Wong can only be applied to Affymetrix CEL file imports,
while VSN can be applied to all types of data. An important difference between GCRMA,
RMA, and VSN that influences subsequent analysis is that as their final step the former
two algorithms take a base 2 logarithm of the data, while VSN takes the natural (base e)
logarithm.
11. Click on the RMA tab and then click Execute.
The results of data normalization will be displayed in a new window as a Dataset heatmap
and a Post-normalization box plot of PM log intensities distribution (Fig. 7.13.12, top).
Explore the expression value distribution plots after the normalization by going back to
the previous window (Fig. 7.13.12, bottom).
12. Apply different normalization methods to the same dataset and compare outputs.
Perform data transformation in EP
13. Click on Data Transformation, under the Transformations menu on the left-hand side
of the page (Fig. 7.13.11).
The Data Transformation component is useful when the data needs to be transformed to
make it suitable for some specific analysis. For instance, if the starting data import was
Analyzing
Expression
Patterns
7.13.13
Current Protocols in Bioinformatics
Supplement 23
Table 7.13.3 Subselection Menu Components
Selection option
Description
Comments
Value ranges
Allows selecting for genes above a
specified number of standard deviations
of the mean in a minimum percentage of
experiments. Alternatively, a slightly
easier to use option is to sub-select the
top N genes with greatest standard
deviations; an input box is provided to
specify the value N.
The value ranges option is fairly similar to the
commonly applied fold change criterion (for example
filtering those genes, whose expression is more than
twice in a given condition than in another one), with
the following main difference: it takes into account
the variability of each gene across multiple
conditions. Moreover, the standard deviation criterion
can be easily applied to single-channel data. Both the
number of standard deviations and the percentage of
conditions used can be adjusted arbitrarily to obtain a
sufficiently reasonable number of candidate genes for
follow-up analysis. We have found that using 1.5
standard deviations in 20% of the conditions is a
good starting point for this type of filtering.
Missing values
Filters out rows of the matrix with more
than a specified percentage of the values
marked as NA (Not Available).
Select by similarity
Provides the functionality to supply a list
of genes and, for each of those, select a
specified number of most similarly
expressed ones in the same dataset,
merging the results in one list.
The performance of this method depends both on the
initial gene selection, and on the choice of distance
measure for computing the similarities (see Critical
Parameters and Troubleshooting section for more on
distance measures).
eBayes (limma)
This filter provides a simple interface to
the eBayes function from the limma
Bioconductor package (Smyth, 2004).
It allows specifying groups of samples and searching
for differentially expressed genes between the defined
groups. To specify the sample groups (factors), one
can click on the Define factors button and use the
dialog window that opens up to define one or several
factor groups. Applying the eBayes data selection
method is then as simple as selecting which factor
group to use and specifying how many genes to
return.
a set of Affymetrix CEL files, it may be desired to look for genes whose expression varies
relative to a reference sample, i.e., to one of the imported CEL files.
The transformations options listed in Table 7.13.4 are available.
14. Click on the Absolute-to-Relative tab and use “gene’s average value” as a reference.
15. Select No log: log 2 data (post RMA) transformation for this data since this has
already been calculated by the RMA normalization algorithm and click Execute.
Once again, the result of data transformation will be displayed in a new window as a
Dataset heatmap.
16. Explore the expression value distribution plots after the transformation by going
back to the previous window (Fig. 7.13.13).
ArrayExpress and
Expression
Profiler
Statistical analysis of microarray data can be significantly affected by the presence of
missing values. Therefore, it is important to estimate these values as accurately as possible
before performing any analysis. For this purpose, three methods are available under the
Transformations menu, in the Missing Value Imputation session: “replace with zeros,”
“replace with row averages,” and KNN imputation (Troyanskaya et al., 2001; Johansson
and Hakkinen, 2006).
7.13.14
Supplement 23
Current Protocols in Bioinformatics
Figure 7.13.12 Data normalization output graphs. The results of data normalization can be
viewed as a box plot of Perfect Match (PM) log intensities distribution (top) or in the descriptive
statistic view (bottom). Above the line graph, the post-normalization mean and standard deviation
values are displayed. For color version of this figure see http://www.currentprotocols.com.
Identification of differentially expressed genes
Two statistical approaches are available for the identification of differentially expressed
genes: t-test analysis and standard multivariate analysis methods, such as Principal Component Analysis and Correspondence Analysis.
Via the t-test component
17. Go to Statistics and click on “t-test Analysis” on the left-hand side of the page
(Fig. 7.13.11).
The t-test component under the Statistics menu provides a way to apply this basic statistics test for comparing the means from 2 distributions in the following differentially
expressed gene identification situations: looking for genes expressed significantly above
background/control, or looking for genes expressed differentially between 2 sets of conditions. In the first case (“one class” option), the user specifies either the background level
to compare against, or selects the genes in the dataset that are to be used as controls. In
the second case (“two classes in one dataset” option), the user specifies which columns in
the dataset represent the first group of conditions and which represent the second group.
The user will now try an example of the latter case.
Analyzing
Expression
Patterns
7.13.15
Current Protocols in Bioinformatics
Supplement 23
Table 7.13.4 Transformation Menu Components
Transformation option
Description
Intensity → Log-Ratio
Takes a set of two-channel arrays, divides every
channel 1 column by the respective channel 2 column, and
then, optionally, takes a logarithm of the ratio
Ratio → Log-Ratio
Log-transforms the selected dataset
Average Row Identifiers
Replaces multiple rows containing the same identifier with a
single row, containing the column-wise averages
K-Nearest Neighbor Imputation
Fills in the missing values in the data matrix, as in
Troyanskaya et al. (2001)
Transpose Data
Switches the rows and columns of the matrix
Absolute-to-Relativea
Converts from absolute expression values to relative ones,
either relative to a specified column of the dataset, or
relative to the gene’s mean.
Mean-centerb
Rescales the rows and/or columns of the matrix to
zero-mean.
a This transformation is useful if there is no reference sample in the dataset, but relative values are still desired for some
specific type of analysis.
b It can be used for running ordination-based methods in order to standardize the data and avoid superfluous scale effects
in Principal Components Analysis, for instance.
Figure 7.13.13 Data transformation output graph. The transformed data is now shown in the
descriptive statistic view. At the top of the graph, the post-transformation mean and standard
deviation values are displayed. For color version of this figure see http://www.currentprotocols.com.
18. Click on Two classes in one dataset tab, type in 1-5 for Class 1 (wild-type mice)
and 6-10 for Class 2 (knock-out mice), and click on Execute.
ArrayExpress and
Expression
Profiler
Upon execution, the t-test involves, for each gene, the calculation of the mean in both
groups being tested (when testing against controls, the mean over all control genes is
taken as the second group mean), and comparing the difference between the two means
to a theoretical t-statistic (Manly et al., 2004). Depending on the number of samples in
each group (this is the number of biological replicates), the test’s reliability is reflected in
the confidence intervals of the p-values that are produced (in this case, the likelihood that
the two means are significantly different, i.e., that the gene is differentially expressed). A
table of p-values, confidence intervals, and gene names is output (Fig. 7.13.14, left-hand
side), as well as a plot of the top 15 genes found (Fig. 7.13.14, right-hand side), as per
the user-specified p-value cut-off, defaulting to 0.01.
7.13.16
Supplement 23
Current Protocols in Bioinformatics
Figure 7.13.14 t-test analysis output graphs. The t-test analysis results are summarized in a
table, where the genes are ranked according to the p-value, with the most significant genes at the
top (left). The top 15 genes are also plotted in a graph (right).
An issue that occurs with running the t-test on datasets with large numbers of genes is
the multiple testing problem (Pounds, 2006). For example, performing 10,000 t-tests on
a dataset of 10,000 rows, with low p-values occurring by chance, at a rate of, e.g., 5%,
will result in 500 genes that are falsely identified as differentially expressed. A number
of standard corrections are implemented, including the Bonferroni, Holm, and Hochberg
corrections (Holm, 1979; Hochberg, 1988; Benjamini and Hochberg, 1995) for reducing
the p-values in order to account for the possibly high numbers of false positives. The user
can select any of them from the Multiple testing correction drop-down menu.
19. Go to Data selection and click on “select by row.”
The user can now find out which gene corresponds to any of the top 15 Affymetrix probe
IDs just identified running the t-test analysis. Click on the small top table icon, type in
the text box an Affymetrix ID, and click search. The corresponding gene symbol, gene
description, and chromosome location will be returned in the result window.
Via the Ordination between group analysis
20. Go to Ordination-based menu and click on Between Group Analysis.
The Between Group Analysis component under the Ordination menu provides a statistically rigorous framework for a more comprehensive multigroup analysis of microarray
data. Between Group Analysis (BGA) is a multiple discriminant approach that is carried
out by coordinating the specified groups of samples and projecting individual sample locations on the resulting axes (Culhane et al., 2002). The ordination step involved in BGA,
as implemented in EP, can be either Principal Components Analysis (PCA), or Correspondence Analysis (COA), both standard statistical tools for reducing the dimensionality
of the dataset being analyzed by calculating an ordered set of values that correspond to
greatest sources of variation in the data and using these values to “reorder” the genes
and samples of the matrix. BGA combined with COA is especially powerful, because it
provides a simultaneous view of the grouped samples and the genes that most facilitate
the discrimination between them. The BGA component’s algorithms are provided through
an interface to the Bioconductor package made4 (Culhane et al., 2005), which, in turn,
refers to the R multivariate data analysis package ade4.
21. Click on the Define new factors icon.
In the new window, click on the Add factor button. In this example, we want to identify the
genes which are differentially expressed between 2 conditions: wild-type and knock-out
mice. The top 5 data files are wild type and the bottom 5 are knock-out. Select a name
for the new experimental factor (e.g., WT/KO), fill the table as shown in Figure 7.13.15
and click Save factor. The newly created experimental factor will now be showing in the
Factors box, in the BGA window, and can now be selected as parameter for the analysis
(Fig. 7.13.16).
Analyzing
Expression
Patterns
7.13.17
Current Protocols in Bioinformatics
Supplement 23
Figure 7.13.15 Define new factor window. When running an ordination-based technique, the
user might need to create a new experimental factor in order to identify the genes differentially
expressed between 2 conditions. In this example, the genotype is the discriminating factor (wild
type versus knock-out) and the new factor can be created filling the table as shown.
Figure 7.13.16 Between Group Analysis window in EP. The user can select which factor determines the group for the analysis, the type of transformation to use, and the output options.
22. Select the WT/KO factor. From the top drop-down menu select either COA or PCA
(Fig. 7.13.16). The user can also decide to replace the missing value with row
averages or leave them in place. Different output graphics can also be added, if
needed. For this example, we will leave the default parameters. Click Execute.
ArrayExpress and
Expression
Profiler
The “overall plot” provides a graphical representation of the most discriminating arrays
and/or genes. In addition to the plot, BGA produces two numerical tables, the table of
gene coordinates and the table of array coordinates. The gene coordinates table is of
special interest, because it provides, for each gene, a measure of how variable that gene
is in each of the identified strong sources of variation. The sources of variation (principal
axes/components) are ordered from left to right. In this example we only have one main
source of variation (component 1). Thus, genes that have the highest or lowest values
7.13.18
Supplement 23
Current Protocols in Bioinformatics
in the first column of the gene coordinates table make up the likeliest candidates for
differential expression.
The user can also try using the “column ID” as a discriminating factor (Fig. 7.13.16)
and run BGA as just described. The results page will now include some additional graphs,
including the Eigenvalues histogram and a scatter plot showing how the data is separated
in the tridimensional space.
PCA and COA can also be run independently of BGA, selecting the Ordination option
under the Ordination-based menu.
HOW TO PERFORM CLUSTERING ANALYSIS IN EXPRESSION PROFILER
This protocol describes all the clustering options available in EP.
BASIC
PROTOCOL 4
Necessary Resources
Hardware
Suggested minimum requirements for a PC system: fast Internet connection
(64K+), graphics card supporting at least 1024 × 768, optimal resolution
1280 × 1024 (65K+ colors)
Software
Internet browser supported: Internet Explorer 6 and 7 (platforms: Windows 2000,
XP, Vista), Firefox 1.5+ (all platforms), Opera 9 (all platforms), and Safari 2.0+
(Mac OS X)
1. Explore the options available under the Clustering menu on the left-hand side of the
main EP page (Fig. 7.13.11).
Clustering analysis is an unsupervised/semi-supervised approach to looking for trends
and structures in large, multi-dimensional datasets, such as microarray gene expression
data. EP provides fast implementations of two classes of clustering algorithms: hierarchical clustering and flat partitioning, in the Hierarchical and K-means/K-medoids
clustering components, respectively, as well as a novel method for comparing the results
of such clustering algorithms in the Clustering Comparison component. The Signature
Algorithm component is an alternative approach to clustering-like analysis, based on the
method by Ihmels et al. (2002).
All clustering algorithms are essentially aimed at grouping objects, such as genes, together, according to some measure of similarity, so that objects within one group or cluster
are more similar to each other than to objects in other groups. Clustering analysis involves
one essential elementary concept: the definition of similarity between objects, also known
as a distance measure. EP implements a wide variety of distance measures for clustering
analysis (all distance measures can be found in the Distance measure drop-down menu).
The Euclidean distance and the Correlation-based distance represent the 2 most commonly applied measures of similarity. The Euclidean metric measures absolute differences
in expression levels, while the Correlation-based captures similar relative trends in expression profiles. In time series data, for instance, where one is interested in finding clusters
of genes that follow a similar pattern of expression over a period of time, the correlation
distance often produces the most informative clusters, while in treatment comparison
experiments, where one seeks for genes that changed significantly between treated samples, Euclidean distance may perform better. Note that the data normalization method
used may also influence the outcome. The user can practice combining different clustering
methods and distance measures to find the optimal combination for each dataset.
2a. Click Hierarchical Clustering.
Hierarchical clustering is an agglomerative approach in which single expression profiles
are joined to form groups, which are further joined until the process has been completed,
forming a single hierarchical tree (Fig. 7.13.17, left-hand side). The user can perform
hierarchical clustering by choosing a data set of interest from the project tree and then
choosing the Hierarchical option from the Clustering menu. The user then needs to specify
Analyzing
Expression
Patterns
7.13.19
Current Protocols in Bioinformatics
Supplement 23
which distance measure and clustering algorithm to use to calculate the tree. Different
algorithms can be used to calculate the distance between clusters: single, complete,
average, or average group linkage (Quackenbush, 2001). Additionally, the user can
choose whether to cluster only rows (genes), only columns (experimental conditions),
or both. The output provides a visual display of the generated hierarchy in the form
of a dendrogram or tree, attached to a heatmap representation of the clustered matrix
(Fig. 7.13.17, left-hand side). Individual branches of the tree, corresponding to clusters
of genes or conditions, can be sub-selected (by clicking on a node) and saved for further
analysis. However, the hierarchical clustering tree produced for large datasets can be
difficult to interpret.
2b. Click Flat Partitioning.
The K-means/K-medoids clustering component provides two flat partitioning methods,
similar in their design. Both approaches are based on the idea that, for a specified
number K, K initial objects are chosen as cluster centers, the remaining objects in the
dataset are iteratively reshuffled around these centers, and new centers are chosen to
maximize the similarity within each cluster, at the same time maximizing the dissimilarity
between clusters. The main practical difference between the two methods implemented
in this component is that the K-medoids allows efficiently computing of any distance
measure available in EP, while the K-means is limited to the Euclidean and Correlationbased measures. Once again, the user needs to select a distance measure to be used, the
K number of clusters and the initializing method, choosing between initializing by most
distant (average) genes, by most distant (minimum) genes, or by random genes. In the
output, each cluster is visualized by a heatmap and a multi-gene lineplot (Fig. 7.13.17,
right-hand side).
2c. Click Clustering Comparison.
A commonly encountered problem with hierarchical clustering is that it is difficult to
identify branches within the hierarchy that in some way form optimally tight clusters. Indeed, it is rare that one can clearly identify a definite number of distinct clusters from the
dendrogram, in the real world data. Similarly, in the case of flat partitioning, the determination of the number of desired clusters is often arbitrary and unguided. The Clustering
Comparison component aims to alleviate these difficulties by providing an algorithm and
a visual depiction of a mapping between a dendrogram and a set of flat clusters (Torrente
et al., 2005; Fig. 7.13.17). The clustering comparison component not only provides an
Figure 7.13.17 A comparison between hierarchical clustering (correlation-based distance, average linkage) and k-means clustering (correlation-based distance, k = 5) in the S. pombe stress response dataset
E-MEXP-29. The normalized data was retrieved from Array Express as described in Basic Protocol 2, step 2
and loaded into EP as tab-delimited file. Data were log transformed and the 140 most varying genes (>0.9 SD
in 60% of the hybridization) selected for clustering comparison. For additional information refer to Torrente el al.
(2005). Line thickness is proportional to the number of elements common to both sets. By placing the mouse
cursor over a line, a Venn diagram is displayed showing the number of elements in the 2 clusters and the overlap.
For color version of this figure see http://www.currentprotocols.com.
7.13.20
Supplement 23
Current Protocols in Bioinformatics
informative insight into the structure of the tree by highlighting the branches that best
correspond to one or more flat clusters from the partitioning, but also can be useful when
comparing the hierarchical clustering to a predefined functionally meaningful grouping of
the genes. For comparison between a pair of flat partitioning clusterings, it can be used as
a process to establish the optimal parameter K by starting with a high number of clusters
and letting the comparison algorithm identify the appropriate number of superclusters.
Adjustable clustering comparison parameters are also provided. The user can specify the
“number of steps to look ahead in the tree search,” the type of “scoring function” to use,
and select an “overlapping index computation method” (Torrente et al., 2005).
2d. Click Signature algorithm.
Signature Algorithm is the R implementation of an algorithm previously described (Ihmels
et al., 2002). It identifies a co-expressed subset in a user-submitted set of genes, removes
unrelated genes from the input, and identifies additional genes in the same dataset that
follow a similar pattern of expression. Co-expression is identified with respect to a subset
of conditions, which is also provided as the output of the algorithm. It is a fast algorithm
useful for exploring the modular structure of expression data matrices.
HOW TO CALCULATE GENE ONTOLOGY TERM ENRICHMENT
IN EXPRESSION PROFILER
BASIC
PROTOCOL 5
This protocol describes how to calculate and visualize Gene Ontology terms enrichment
in EP.
Necessary Resources
Hardware
Suggested minimum requirements for a PC system: fast Internet connection
(64K+), graphics card supporting at least 1024 × 768, optimal resolution
1280 × 1024 (65K+ colors)
Software
Internet browser supported: Internet Explorer 6 and 7 (platforms: Windows 2000,
XP, Vista), Firefox 1.5+ (all platforms), Opera 9 (all platforms), and Safari 2.0+
(Mac OS X)
1. Click on Gene Ontology, under the Annotation menu on the left-hand side of the
main EP page (Fig. 7.13.11).
Gene Ontology is a controlled vocabulary used to describe the biology of a gene product
in any organism (Ashburner et al., 2000). There are 3 independent sets of vocabularies,
or ontologies, which describe the molecular function of a gene product, the biological
process in which the gene product participates, and the cellular component where the
gene product can be found.
Once a subset of genes of interest has been identified, through one or several of the
approaches described so far, the user can look for GO terms enriched in the given gene
list (e.g., a particular cluster obtained with flat partitioning).
2. Enter a list of gene identifiers in the Gene IDs box, select a GO category of interest
(such as biological process) and enter a p-value cutoff (default is 0.05). A multiple
testing correction can be selected from the drop-down menu to reduce the number
of false positives. Click Execute.
Results will be displayed as a tree view of GO terms and genes associated with each term.
In addition, a table will summarize the results, showing for each GO term the observed
and genomic frequencies of enrichment, the p-value associated with this enrichment, and
the genes related to each category (Fig. 7.13.18).
Analyzing
Expression
Patterns
7.13.21
Current Protocols in Bioinformatics
Supplement 23
Figure 7.13.18 Gene ontology annotation output. The results of GO terms enrichment for a
given gene list are summarized in this table.
BASIC
PROTOCOL 6
HOW TO CALCULATE CHROMOSOME CO-LOCALIZATION
PROBABILITY IN EXPRESSION PROFILER
This protocol describes how to use the ChroCoLoc application in EP for calculating the
probability of chromosome co-localization of a set of co-expressed genes.
Necessary Resources
Hardware
Suggested minimum requirements for a PC system: fast Internet connection
(64K+), graphics card supporting at least 1024 × 768, optimal resolution
1280 × 1024 (65K+ colors)
Software
Internet browser supported: Internet Explorer 6 and 7 (platforms: Windows 2000,
XP, Vista), Firefox 1.5+ (all platforms), Opera 9 (all platforms), and Safari 2.0+
(Mac OS X)
1. Click on ChroCoLoc, under the Annotation menu on the left-hand side of the main
EP page (Fig. 7.13.11).
The ChroCoLoc feature allows the user to calculate the probability of groups of coexpressed genes co-localizing on chromosomes (Blake et al., 2006). This application uses
a hypergeometric distribution to describe the probability of x co-expressed genes being
located in the same region, when there are y genes in that region, from the total population
of genes in the study. Karyotypes for human, mouse, rat, and Drosophila are currently
provided.
2. Select a group of co-expressed genes previously identified and use any of the additional filters to remove chromosome regions which might not contain statistically
significant features.
Filters can be set for the minimum number of features per region or the minimum percentage of features regulated in the region. The calculated probabilities can be adjusted
for multiple testing using the Bonferroni correction.
ArrayExpress and
Expression
Profiler
Regions that pass the filter criteria are plotted on the selected karyotype, colored accordingly to the probability of the observed co-localization occurring (Fig. 7.13.19).
7.13.22
Supplement 23
Current Protocols in Bioinformatics
Figure 7.13.19 Chromosome co-localization output. Probabilities of co-localization of regulated
genes are plotted onto a human karyogram. Chromosomal regions are colored according to
decreasing probability of co-localization occurring by chance, with red < = 0.01, orange < = 0.02,
yellow < = 0.03, light blue < = 0.04 and green < = 0.05. For color version of this figure see
http://www.currentprotocols.com.
A table shows the calculated probabilities, which can also be plotted as percentage of
co-expressed genes in a particular region or as absolute number of genes co-expressed
per region.
GUIDELINES FOR UNDERSTANDING RESULTS
The basic protocols in this unit describe: (1) how to query and interpret data from the AE
Warehouse of Gene Expression Profiles (Basic Protocol 1); (2) how to query, interpret,
and retrieve data from the AE Repository of Microarray and Transcriptomics data (Basic
Protocol 2); and (3) how to upload and analyze gene expression data in EP (Basic
Protocols 3, 4, 5, and 6).
The aim of these protocols is to familiarize the user with the AE database content,
showing that the database can be queried in different ways (e.g., at the experiment or at
the gene level) and how to interpret the results of different queries.
In addition, we have shown how microarray data can be exported from AE and subsequently imported into an external data analysis tool, such as EP. Different normalization,
transformation, hypothesis testing, and clustering methods have been illustrated with the
aim of giving the reader an overview of some of the most important steps in the analysis
of gene expression data. It is extremely important for the user to have at least a basic
understanding of these methods before drawing conclusions regarding the results.
Few additional concepts have been left out but are worth mentioning in the context of
microarray research and data analysis.
1. The most crucial aspect of any microarray experiment is the experimental design. If
the experiment is not designed properly, no analysis method will be able to obtain
valid conclusions from the data. When designing an experiment, one should always
take into account: (i) which samples are compared; (ii) how many experimental factors are involved and what are they; (iii) how many biological/technical replications
will be performed; and (iv) which reference would be more appropriate.
2. Array quality assessment is an aspect that should always be included among the
goals of data analysis. Such quality measures will allow discarding the data coming
from below standard arrays, as well as the identification of possible causes of failure
in the microarray process.
Analyzing
Expression
Patterns
7.13.23
Current Protocols in Bioinformatics
Supplement 23
3. Any hypothesis generated from a microarray experiment needs to be validated independently in order to establish the “robustness” or “reliability” of a microarray
finding. Experimental validation is essential especially when considering the inherently noisy nature of microarray data.
COMMENTARY
Background Information
Central concepts: Experiment and dataset
The highest level of organization in the AE
Repository is the experiment, which consists
of one or more hybridizations, usually linked
to a publication. The query interface provides
the ability to query the experiments via free
text search (e.g., experiment accession numbers, authors, and publication details) and filter the experiments by species or array design. Once an experiment has been selected,
the user can examine the description of the
samples and protocols by navigating through
the experiment, or can download the dataset
associated with the experiment for analyzing
it locally (see Basic Protocol 2). Once downloaded, the dataset can be visualized and analyzed online using Expression Profiler (see
Basic Protocols 3, 4, 5, and 6) or other analysis tools.
The dataset is the central object that the user
provides as input for analyses in EP. A dataset
can be one of two types: raw or normalized.
The raw dataset is the spot-intensity data and
can be input to normalization procedure giving, as a result, a gene expression data matrix
(see Basic Protocol 3). The normalized dataset
is already in the format of a gene expression
data matrix. This can be input to a selection
of analysis components, such as t-test and
clustering.
ArrayExpress and
Expression
Profiler
Gene expression data matrix
In a gene expression data matrix, each row
represents a gene and each column represents
an experimental condition or array. An entry
in the data matrix usually represents the expression level or expression ratio of a gene
under a given experimental condition. In addition to numerical values, the matrix can also
contain additional columns for gene annotation or additional rows for sample annotation,
as textual information. Gene annotation may
include gene names, sequence information,
chromosome location, description of the functional role for known genes, and links to the
respective entries in sequence databases. Sample annotation may provide information about
the organism part from which the sample was
taken, or cell type used, or whether the sample
was treated, and, if so, what the treatment was
(e.g., compound used and concentration). An
example of how to generate a gene-expression
data matrix from an experiment deposited in
the AE Repository is given in Basic Protocol 2.
Gene expression profile
Another important concept is that of the
gene expression profile. An expression profile
describes the (relative) expression levels of a
gene across a set of experimental conditions
(e.g., a row in the data matrix) or the expression
levels of a set of genes in one experimental
condition (e.g., a column in the data matrix).
The AE Data Warehouse supports queries
on gene expression profiles using (1) gene
names, identifiers, or properties such as GO
terms; (2) information on which family a gene
belongs to or the motifs and domains it contains (InterPro terms); and (3) sample properties. The user can retrieve and visualize the
gene expression values for multiple experiments. The expression profiles are visualized
using line plots and a similarity search can
be run to find genes with similar expression
level within the same experiment (see Basic
Protocol 1).
Critical Parameters and
Troubleshooting
Many issues are related to microarray data
analysis and it is beyond the scope of this unit
to address them all. In this unit we only want
to draw the attention of the reader to some of
these issues and how they can be addressed
within Expression Profiler. The reader should
also refer to specific discussion provided in
individual protocol steps.
Data normalization and transformation
Microarrays, like any other biological experiment, are characterized by systematic variation between experimental conditions unrelated to biological differences. For example,
when dealing with gene expression data, true
differential expression between two conditions
(e.g., normal versus disease) might be masked
by several biases introduced throughout the
experiment (e.g., different amount of starting
RNA, unequal labeling efficiency, and uneven
hybridization).
7.13.24
Supplement 23
Current Protocols in Bioinformatics
Normalization aims to compensate for systematic technical differences between arrays,
to see more clearly the systematic biological
differences between samples. Without normalization, data from different arrays cannot be
compared.
Although the aim of normalization stays
the same, the algorithms used for normalizing
2-color arrays differ from those used for normalizing 1-color arrays (e.g., Affymetrix).
Four normalization algorithms are implemented in EP, three specific for Affymetrix
data (i.e., RMA, GCRMA, and Li and Wong)
and one (VSN) that can be applied to all data
types. It is advised to apply more than one normalization method to each dataset and compare the outputs in order to identify which
method might be more appropriate for a given
dataset (see Basic Protocol 3, step 2).
In the context of microarray data analysis,
expression ratio values are transformed to logratio values. This transformation is sometimes
carried out by the normalization algorithm or
might need to be carried out separately.
Log transforming the data is of great importance. In the linear ratio space, the ratio values
are centered around a value of 1. All genes that
are up-regulated have values >1, with no upper bound. All genes that are down-regulated
have ratio values compressed between 0 and
1. In this situation the distribution of linear
ratio values is clearly not normal. By transforming all ratio values to log-space, the data
becomes normally distributed around 0. Genes
up-regulated by 2-fold are now at a value of
+1, while 2-fold repression is quantified by a
value of –1.
Several data transformation options are
given in EP and can be used to convert the
data to the desired format. Following normalization, the user should always verify that the
data is normally distributed and apply the appropriate transformation, if needed (see Basic
Protocol 3, step 3).
Identification of differentially expressed
genes
In many cases, the purpose of a microarray experiment is to compare gene expression
levels in two different conditions (e.g., normal versus disease). A wide range of methods
is available for the selection of differentially
expressed genes but some methods (e.g., fold
change) are arbitrary and inadequate.
A more robust method for selecting differentially regulated genes is a classical hypothesis testing approach, such as a t-test. A
t-test assesses whether the means of two
groups are statistically different from each
other, relative to the variability of the distributions. It essentially measures the signal-tonoise ratio and calculates a p-value for each
gene. Consequently, those genes that are significantly different between two conditions,
above a certain p-value cut-off, are selected
and further analyzed.
As explained in Basic Protocol 3, an issue
that occurs with running the t-test on datasets
with large numbers of genes is the multiple
testing problem (Pounds, 2006). For example, performing 10,000 t-tests on a dataset
of 10,000 rows, with low p-values occurring
by chance, at a rate of, e.g., 5%, will result in 500 genes that are falsely identified
as differentially expressed. Therefore, the user
should always apply a multiple testing correction method to account for the possibly high
numbers of false positives.
A number of multiple testing corrections
are implemented in EP, including Bonferroni,
Holm, and Hochberg (Holm, 1979; Hochberg,
1988; Benjamini and Hochberg, 1995). Once
again, the user should apply different correction methods and observe changes in the final
result.
In addition, this approach requires each
experimental condition being represented by
multiple biological replicates, for statistical
significance. There is no ideal number of biological replicates per experiment but the user
should bear in mind that biological replication substantially increases the reliability of
microarray results.
Another problem is the choice of the most
appropriate p-value cut-off to be used in the
t-test analysis. Once again, the most appropriate cut-off must be decided inspecting the
data. The user should apply different cut-off
values and observe how changes in the cutoff affect the final number of differentially expressed genes. Only by moving the cut-off up
and down, the user will be able to decide on
the most appropriate value to use.
Clustering analysis
The aim of clustering is to discover patterns of gene expression in the data by grouping genes together, according to a similarity
measure, so that objects within one group are
more similar to each other than to the objects in
other groups. For this, the user needs to quantify to what degree two expression profiles are
similar. Such measure of similarity is called
distance; the more distant two expression profiles are, in the multidimensional space, the
more dissimilar they are.
Analyzing
Expression
Patterns
7.13.25
Current Protocols in Bioinformatics
Supplement 23
ArrayExpress and
Expression
Profiler
One can measure this distance in a number of different ways. The simplest measure is
Euclidean distance, which is simply the length
of the straight lines connecting two points in
multidimensional space. Another measure is
the Manhattan distance. This measure represents the distance that one needs to travel in
an environment in which one can move only
along directions parallel to the x or y axes. In
comparison to Euclidean distance, Manhattan
distance tends to yield a larger numerical value
for the same relative position of the points.
Other measures quantify the similarity in expression profile shape (e.g., if the genes go
up and down in a coordinated fashion across
the experimental conditions), and are based
on measures of correlation (e.g., Pearson correlation). With so many distance measures, a
natural question is when to use what? It is a
good idea for the user to explore the different similarity measures separately to become
familiar with the properties of each measure.
Also, when clustering genes, the user needs
to choose which clustering algorithm to use.
Two algorithms are available in EP: an agglomerative method or hierarchical clustering and a partitioning method or K-means
clustering.
Hierarchical clustering starts by assigning
each gene to its own cluster, so that if you have
N genes, you now have N clusters, each containing just one gene. It then finds the closest
(most similar) pair of genes, using the similarity measure chosen by the user, and merges
them into a single cluster, so that now you have
one less cluster. The distances between the new
cluster and each of the old clusters are then
computed and the process repeated until all
genes are clustered. Four different algorithms
can be used to calculate the inter-cluster distances in EP (i.e., single, complete, centroid,
or average linkage—see Basic Protocol 4). The
result is a hierarchy of clusters in the form of a
tree or dendrogram (see Fig. 7.13.17, left-hand
side).
When using K-means clustering, the user
needs to specify the number of clusters, K, in
which the data will be split. Once the K number
has been specified, the method assigns each
gene to one of the K clusters, depending on the
minimum distance. The centroids’ position is
recalculated every time a gene is added to the
cluster and this continues until all the genes
are grouped into the final required K number
of clusters.
The initial choice of the number of K clusters is an issue that needs careful consideration. If it is known in advance that the patterns
of gene expression to be clustered belong to
several classes (e.g., normal and disease), the
user should cluster using the known number of
classes as K. If the analysis has an exploratory
character, the user should repeat the clustering
for several values of K and compare the results.
Once again, the user is encouraged to compare the output of different clustering methods
and apply several distance measure-clustering
algorithms combinations to see how the clustering results change. This will help with finding the best visual representation for a dataset
(Quackenbush, 2001). The Clustering comparison component in EP is particularly useful
for this purpose since it provides side-by-side
visualization of two clustering outputs. This
might help the user deciding on a particular
clustering method or on the optimal number
of K in which to subdivide the dataset.
Literature Cited
Ashburner, M., Ball, C., Blake, J., Botstein, D.,
Butler, H., Cherry, J., Davis, A., Dolinski,
K., Dwight, S., Eppig, J., Harris, M.A., Hill,
D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S.,
Matese, J.C., Richardson, J.E., Ringwald, M.,
Rubin, G.M., and Sherlock, G. 2000. Gene ontology: Tool for the unification of biology. The
Gene Ontology Consortium. Nat. Genet. 25:2529.
Ball, C., Brazma, A., Causton, H., Chervitz, S.,
Edgar, R., Hingamp, P., Matese, J.C., Icahn,
C., Parkinson, H., Quackenbush, J., Ringwald,
M., Sansone, S.A., Sherlock, G., Spellman, P.,
Stoeckert, C., Tateno, Y., Taylor, R., White,
J., and Winegarden, N. 2004. An open letter
on microarray data from the MGED Society.
Microbiology 150:3522-3524.
Benjamini, Y. and Hochberg, Y. 1995. Controlling
the false discovery rate: A practical and powerful
approach to multiple testing. J. R. Stat. Soc. B
57:289-300.
Blake, J., Schwager, C., Kapushesky, M., and
Brazma, A. 2006. ChroCoLoc: An application
for calculating the probability of co-localization
of microarray gene expression. Bioinformatics
22:765-767.
Brazma, A., Hingamp, P., Quackenbush, J.,
Sherlock, G., Spellman, P., Stoeckert, C.,
Aach, J., Ansorge, W., Ball, C.A., Causton,
H.C., Gaasterland, T., Glenisson, P., Holstege,
F.C., Kim, I.F., Markowitz, V., Matese, J.C.,
Parkinson, H., Robinson, A., Sarkans, U.,
Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo,
J., and Vingron, M. 2001. Minimum information about a microarray experiment (MIAME)toward standards for microarray data. Nat.
Genet. 29:365-371.
Brazma, A., Parkinson, H., Sarkans, U., Shojatalab,
M., Vilo, J., Abeygunawardena, N., Holloway,
E., Kapushesky, M., Kemmeren, P., Lara, G.G.,
Oezcimen, A., Rocca-Serra, P., and Sansone,
S.A. 2003. ArrayExpress-a public repository for
7.13.26
Supplement 23
Current Protocols in Bioinformatics
microarray gene expression data at the EBI.
Nucleic Acids Res. 31:68-71.
Culhane, A.C., Perriere, G., Considine, E.C.,
Cotter, T.G., and Higgins, D.G. 2002. Betweengroup analysis of microarray data. Bioinformatics 18:1600-1608.
Culhane, A.C., Thioulouse, J., Perriere, G., and
Higgins, D.G. 2005. MADE4: An R package
for multivariate analysis of gene expression data.
Bioinformatics 21:2789-2790.
Edgar, R., Domrachev, M., and Lash, A.E. 2002.
Gene Expression Omnibus: NCBI gene expression and hybridization array data repository.
Nucleic Acids Res. 30:207-210.
Hochberg, Y. 1988. A sharper Bonferroni procedure
for multiple tests of significance. Biometrika
75:800-803.
A. 2004. Expression Profiler: Next generationan online platform for analysis of microarray
data. Nucleic Acids Res. 32:W465- W470.
Li, C. and Wong, W.H. 2001. Model-based analysis of oligonucleotide arrays: Expression index
computation and outlier detection. Proc. Natl.
Acad. Sci. U.S.A. 98:31-36.
Manly, K.F., Nettleton, D., and Hwang, J.T. 2004.
Genomics, prior probability, and statistical tests
of multiple hypotheses. Genome Res. 14:9971001.
Pounds, S. 2006. Estimation and control of multiple
testing error rates for microarray studies. Brief.
Bioinform. 7:25-36.
Ihaka, R. and Gentleman, R. 1996. R: A language
for data analysis and graphics. J. Comput.
Graph. Stat. 5:299-314.
Quackenbush, J. 2001. Computational analysis
of microarray data. Nat. Rev. Genet. 2:418427.
Quackenbush, J. 2002. Microarray data normalization and transformation. Nat. Genet. 32:496501.
Rayner, T.F., Rocca-Serra, P., Spellman, P.T.,
Causton, H.C., Farne, A., Holloway, E., Irizarry,
R.A., Liu, J., Maier, D.S., Miller, M., Petersen,
K., Quackenbush, J., Sherlock, G., Stoeckert,
C.J., White, J., Whetzel, P.L., Wymore, F.,
Parkinson, H., Sarkans, U., Ball, C.A., and
Brazma, A. 2006. A simple spreadsheet-based,
MIAME-supportive format for microarray data:
MAGE-TAB. BMC Bioinformatics 7:489.
Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O.,
Ziv, Y., and Barkai, N. 2002. Revealing modular organization in the yeast transcriptional network. Nat. Genet. 31:370-377.
Smyth, G.K. 2004. Linear models and empirical
bayes methods for assessing differential expression in microarray experiments. Stat. Appl.
Genet. Mol. Biol. 3:Article3.
Ikeo, K., Ishi-i, J., Tamura, T., Gojobori, T., and
Tateno, Y. 2003. CIBEX: Center for information
biology gene expression database. C. R. Biol.
326:1079-1082.
Torrente, A., Kapushesky, M., and Brazma, A.
2005. A new algorithm for comparing and visualizing relationships between hierarchical and
flat gene expression data clusterings. Bioinformatics 21:3993-3999.
Holm, S. 1979. A simple sequentially rejective
Bonferroni test procedure. Scand. J. Stat. 6:6570.
Huber, W., von Heydebreck, A., Sultmann, H.,
Poustka, A., and Vingron, M. 2002. Variance
stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18:S96-S104.
Irizarry, R.A., Bolstad, B.M., Collin, F., Cope,
L.M., Hobbs, B., and Speed, T.P. 2003. Summaries of Affymetrix GeneChip probe level
data. Nucleic Acids Res. 31:e15.
Johansson, P. and Hakkinen, J. 2006. Improving
missing value imputation of microarray data by
using spot quality weights. BMC Bioinformatics
7:306.
Kapushesky, M., Kemmeren, P., Culhane, A.C.,
Durinck, S., Ihmels, J., Korner, C., Kull, M.,
Torrente, A., Sarkans, U., Vilo, J., and Brazma,
Troyanskaya, O., Cantor, M., Sherlock, G., Brown,
P., Hastie, T., Tibshirani, R., Botstein, D., and
Altman, R.B. 2001. Missing value estimation
methods for DNA microarrays. Bioinformatics
17:520-525.
Wu, Z., Irizarry, R., Gentleman, R., MartinezMurillo, F., and Spencer, F. 2004. A modelbased background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc.
99:909-917.
Analyzing
Expression
Patterns
7.13.27
Current Protocols in Bioinformatics
Supplement 23
Analyzing Gene Expression Data from
Microarray and Next-Generation DNA
Sequencing Transcriptome Profiling
Assays Using GeneSifter Analysis Edition
UNIT 7.14
Sandra Porter,1, 2 N. Eric Olson,2 and Todd Smith2
1
2
Digital World Biology, Seattle, Washington
Geospiza, Inc., Seattle, Washington
ABSTRACT
Transcription profiling with microarrays has become a standard procedure for comparing
the levels of gene expression between pairs of samples, or multiple samples following different experimental treatments. New technologies, collectively known as next-generation
DNA sequencing methods, are also starting to be used for transcriptome analysis. These
technologies, with their low background, large capacity for data collection, and dynamic
range, provide a powerful and complementary tool to the assays that formerly relied on
microarrays. In this chapter, we describe two protocols for working with microarray data
from pairs of samples and samples treated with multiple conditions, and discuss alternative protocols for carrying out similar analyses with next-generation DNA sequencing
data from two different instrument platforms (Illumina GA and Applied Biosystems
C 2009 by John Wiley & Sons,
SOLiD). Curr. Protoc. Bioinform. 27:7.14.1-7.14.35. Inc.
Keywords: gene expression r microarray r RNA-Seq r transcriptome r
GeneSifter Analysis Edition r next-generation DNA sequencing
INTRODUCTION
Transcriptome profiling is a widely used technique that allows researchers to view the
response of an organism or cell to a new situation or treatment. Insights into the transcriptome have uncovered new genes, helped clarify mechanisms of gene regulation,
and implicated new pathways in the response to different drugs or environmental conditions. Often, these kinds of analyses are carried out using microarrays. Microarray assays
quantify gene expression indirectly by measuring the intensity of fluorescent signals from
tagged RNA after it has been allowed to hybridize to thousands of probes on a single
chip. Recently, next-generation DNA sequencing technologies (also known as NGS or
Next Gen) have emerged as an alternative method for sampling the transcriptome. Unlike microarrays, which identify transcripts by hybridization and quantify transcripts by
fluorescence intensity, NGS technologies identify transcripts by sequencing DNA and
quantify transcription by counting the number of sequences that align to a given transcript. Although the final output from an NGS experiment is a digital measure of gene
expression, with the units expressed as the numbers of aligned reads instead of intensity,
the data and goals are similar enough that we can apply many of the statistical methods
developed for working with microarrays to the analysis of NGS data.
There are many benefits to using microarray assays, the greatest being low cost and
long experience. Over the years, the laboratory methods for sample preparation and the
statistical methods for analyzing data have become more standardized. As NGS becomes
more commonplace, these new methods are increasingly likely to serve as a complement
Current Protocols in Bioinformatics 7.14.1-7.14.35, September 2009
Published online September 2009 in Wiley Interscience (www.interscience.wiley.com).
DOI: 10.1002/0471250953.bi0714s27
C 2009 John Wiley & Sons, Inc.
Copyright Analyzing
Expression
Patterns
7.14.1
Supplement 27
or alternative to microarrays. Since these assays are based on DNA sequencing rather than
hybridization, the background is low, the results are digital, the dynamic range is greater,
and transcripts can be detected even in the absence of a pre-existing probe (Marioni
et al., 2008; Wang et al., 2009). Furthermore, once the sequence data are available,
they can be aligned to new reference data sets, making NGS data valuable for future
experiments. Still, until NGS assays are better characterized and understood, it is likely
that microarrays and NGS will serve as complementary technologies for some years to
come.
In this chapter, we describe using a common platform, GeneSifter Analysis Edition
(GSAE; a registered trademark of Geospiza, Inc.), for analyzing data from both microarray and NGS experiments. GSAE is a versatile Web-based system that can already be
used to analyze data from a wide variety of microarray platforms. We have added features for uploading large data sets, aligning data to reference sequences, and presenting
results, which make GSAE useful for NGS as well. Both kinds of data analyses share
several similar features. Data must be entered into the system and normalized. Statistical
methods must be applied to identify significant differences in gene expression. Once
significantly different expression patterns have been identified, there must be a way to
uncover the biological meaning for those results. GSAE provides methods for working with ontologies and KEGG pathways, clustering options to help identify genes that
share similar patterns of expression, and links to access information in public databases.
Data-management capabilities and quality control measures are also part of the GSAE
system.
In both of the two basic protocols, we will present general methods for analyzing microarray data, follow those procedures with alternative procedures that can be used to
analyze NGS data, and discuss the differences between the microarray protocol and the
NGS alternative. Basic Protocol 1 presents a pairwise analysis of microarray data from
mice that were fed different kinds of food (Kozul et al., 2008). The protocol uses data
from the public Gene Expression Omnibus (GEO) database at the NCBI (Barrett et al.,
2009), and demonstrates normalizing the data and the analyses. Alternate Protocol 1,
for a pairwise comparison, also uses data from GEO; however, these are NGS data from
the Applied Biosystems SOLiD instrument. In Alternate Protocol 1, we use a pairwise
analysis to compare gene expression from single wild-type mouse oocytes with gene expression in mouse oocytes containing a knockout mutation for DICER, a gene involved
in processing microRNAs (Tang et al., 2009).
Basic Protocol 2 presents a general method for analyzing microarray data from samples
that were obtained after multiple conditions were applied. In this study, mice were fed two
kinds of food and exposed to increasing concentrations of arsenic in their water (Kozul
et al., 2008). This protocol includes ANOVA and demonstrates options for Principal
Component Analysis, clustering data by samples or genes, and identifying expression
patterns from specific gene families. Alternate Protocol 2, a variation on Basic Protocol
2, describes an analysis of NGS data from the Illumina GA analyzer, comparing samples
from three different tissues (Mortazavi et al., 2008). Cluster analysis is included in this
procedure as a means of identifying genes that are expressed in a tissue-specific manner.
As with Basic Protocol 1, these studies use data from public repositories, in this case,
GEO and the NCBI Short Read Archive (SRA; Wheeler et al., 2008). It should be noted
for both protocols that GSAE contains alternatives to the statistical tools used in these
procedures and that other tools may be more appropriate, depending on the individual
study.
Analyzing Gene
Expression Data
Using GeneSifter
7.14.2
Supplement 27
Current Protocols in Bioinformatics
COMPARING GENE EXPRESSION FROM PAIRED SAMPLE DATA
OBTAINED FROM MICROARRAY EXPERIMENTS
BASIC
PROTOCOL 1
One of the most common types of transcriptome profiling experiments involves comparing gene expression from two different kinds of samples. These conditions might
be an untreated and treated control, or a wild-type strain and a mutant. Since there are
two conditions, we call this process a pairwise analysis. Often, the two conditions involve replicates as well. For example, we might have four mice as untreated controls
and four mice that were subjected to some kind of experimental treatment. Comparing these two sets of samples requires normalizing the data so that we can compare
expression within and between arrays. Next, the normalized results are compared and
subjected to statistical tests to determine if any differences are likely to be significant. Procedures can also be applied at this stage to correct for multiple testing. Last,
we use z scores, ontologies, and pathway information to explore the biology and determine if some pathways are significantly over-represented, and elucidate what this
information is telling us about our samples. Figure 7.14.1 provides an overview of this
process.
In this analysis, we compare the expression profiles from the livers of five mice that were
fed for 5 weeks with a purified diet, AIN-76A, with the expression profiles from the
livers of five mice that were fed for the same period of time with LRD-5001, a standard
laboratory mouse food.
pairwise comparison
mouse 1 mouse 2 mouse 3
mouse 4 mouse 5 mouse 6
treatment
no treatment
isolate RNA
microarrays
upload data, normalize
identify differential expression
- fold change
- quality
- statistics-e.g. t -test, others
- multiple testing correction
- Bonferroni
- Benjamini and Hochberg
- others
explore biology-ontology, KEGG, scatter plot
Figure 7.14.1
Overview of the process for a pairwise comparison.
Analyzing
Expression
Patterns
7.14.3
Current Protocols in Bioinformatics
Supplement 27
Necessary Resources
Software
GeneSifter Analysis Edition (GSAE): a trial account must be established in order
to upload data files to GSAE; a trial account or license for GeneSifter Analysis
Edition may be obtained from Geospiza, Inc. (http://www.geospiza.com)
GSAE is accessed through the Web; therefore Internet access is required along with
an up-to-date Web browser, such as Mozilla Firefox, MS Internet Explorer, or
Apple Safari.
Files
Data files from a variety of microarray platforms may be uploaded and analyzed in
GSAE, including Affymetrix, Illumina, Codelink, or Agilent arrays, and custom
chips.
The example data used in this procedure were CEL files from an Affymetrix array
and were obtained from the GEO database at the NCBI (Accession code GSE
9630).
CEL files are the best file type for use in GSAE. To obtain CEL files, go to the
GEO database at the NCBI (www.ncbi.nih.gov/geo/).
Enter the accession number (in this case GSE 9630) in the section labeled Query
and click the Go button.
In this example, all the files in the data set are downloaded as a single tar file by
selecting (ftp) from the Download column at the bottom of the page.
After downloading to a local computer, the files are extracted, unzipped, then
uploaded to GSAE as described in the instructions.
Files used for the AIN-76 group: GSM243398, GSM243405, GSM243391,
GSM243358, and GSM243376.
Files used for the LRD-5001 group: GSM243394, GSM243397, GSM243378,
GSM243382, and GSM243355.
A demonstration site with the steps performed below and the same data files can be
accessed from the data center at http://www.geospiza.com.
Uploading data
1. Create a zip archive from your microarray data files.
a. If using a computer with a Microsoft Windows operating system, a commonly
used program is WinZip.
b. If using Mac OS X, select your data files, click the right mouse button, and choose
Compress # Items to create a zip archive.
2. Log in to GeneSifter Analysis Edition (GSAE; http://login.genesifter.net). A username and password are provided when a trial account is established.
3. Locate the Import Data heading in the Control Panel on the left-hand side of the
screen and click Upload Tools.
Several types of microarray data can be uploaded and analyzed in GSAE. Since different
microarray platforms produce data in a variety of formats, each type of microarray data
has its own upload wizard. In this protocol, we will be working with Affymetrix CEL
data from the NCBI GEO database, and so we choose the option for “Advanced upload
methods.” This option also allows you to normalize data during the upload process using
standard techniques for Affymetrix data such as RMA, GC-RMA, or MAS5. Instructions
for using other GSAE upload wizards are straightforward and are available in the GSAE
user manual.
Analyzing Gene
Expression Data
Using GeneSifter
RMA and GC-RMA are commonly used normalization procedures (Millenaar et al., 2006).
Both of these processes involve three distinct operations: global background normalization, across-array normalization, and log2 transformations of the intensity values. One
7.14.4
Supplement 27
Current Protocols in Bioinformatics
point to note here is that if you plan to use RMA or GC-RMA, the across-array normalization step requires that all the data be uploaded at the same time. If you wish to compare
data to another experiment at a later time, you will need to upload the data again, together
with those data from the new experiment.
4. Click the Run Advanced Upload Methods button.
5. Next, select the normalization method and the array type from pull-down menus.
Click the Next button (at the bottom of the screen).
Choose GC-RMA normalization and the 430 2.0 Mouse array for in this example.
6. In the screen which now appears, browse to locate the data file created in step 1.
7. Choose an option (radio button): Create Groups, Create New Targets, or Same as File
Name.
Since a pairwise analysis involves comparing two groups of samples, choose Create
Groups and set 2 as the value. If the experiment were to involve comparing more than two
conditions, other options would be chosen. These are described in Basic Protocol 2.
8. Click the Next button.
The screen for the next step will appear after the data are uploaded.
9. On the screen displayed in “step 3 of 4,” you will be asked to enter a title for your
data set, assign a condition to each group, add labels to your samples if desired, and
identify which sample(s) belong to which group.
In this case, decide that the AIN-76A mice should be condition 1 and the LRD-5001 mice
should be condition 2. Then, use the buttons to assign all the AIN-76 samples to group 1 and
the LRD-5001 samples to group 2.
Comparing paired groups of samples and finding differentially expressed genes
10. Begin by selecting Pairwise from the Analysis section of the control panel
(Fig. 7.14.2).
11. Find the array or gene set that corresponds to your experiment. In this case, our array
is named “Mouse food and arsenic.”
12. Select the spyglass to set up the analysis.
A new page will appear with a list of all the samples in the array as well as the analysis
options.
13. Use the checkboxes in the group 1 column to select the samples for group 1, and the
checkbox in the group 2 column to select the samples for group 2.
Usually, the control, wild-type, or untreated samples are assigned to group 1. Here, assign
the AIN-76A sample to group 1 and the LRD-5001 samples to group 2.
14. Choose the advanced analysis settings. Since the data were normalized during the
uploading process by the GC-RMA algorithm, we can use some of the default settings
for the analysis. If you choose a setting that is not valid for RMA or GC-RMA
normalized data, warnings will appear to let you know that the data are already
normalized or already log transformed.
a. Normalization: Use None with RMA or GC-RMA normalized data. This step
has already been performed, since RMA and GC-RMA both perform quantile
normalization during the upload process.
b. Statistics: The statistical tests available from the pull-down menu are used to
determine the probability that the differences between the mean values for intensity
measurements, for each gene (or probe), from a set of replicate samples, are
Analyzing
Expression
Patterns
7.14.5
Current Protocols in Bioinformatics
Supplement 27
1
2
3
1
select Pairwise
2
select the gene set
3
assign samples to the two
groups
4
choose analysis settings
5
click Analyze
4
5
Figure 7.14.2
Analyzing Gene
Expression Data
Using GeneSifter
Setting up a pairwise comparison.
significant. The significance level for each gene is reported as a p value. When
multiple replicates of a sample are used, GSAE users can choose between the t test,
Welch’s t test, a Wilcoxon test, and no statistical tests. The t test is commonly
used for this step when samples from a controlled experiment are being compared.
The t test assumes a normal distribution with equal variance. Other options that
may be used are the Welch’s t test, which does not assume equal variance, and the
Wilcoxon test, a nonparametric rank-sum test.
Since all of these tests look at the variation between replicates, you must have at
least two replicates for each group to apply these tests. For the Wilcoxon test, you
must have at least four replicates. Use the t test for this example.
c. Quality (Calls): The quality options in this menu are N/A, A (absent), M
(marginal), or P (present). However, neither RMA nor GC-RMA produce quality
values, so N/A is the appropriate choice when these normalization methods are
used.
d. Exclude Control Probes: Selecting this check box excludes positive and negative
control probes from the analysis. This step can be helpful because it cuts down on
the number of tests and minimizes the penalty from the multiple testing correction.
For our example, check this box.
7.14.6
Supplement 27
Current Protocols in Bioinformatics
e. Show genes that are up-regulated or down-regulated: Use the checkboxes to
choose both sets of genes or one set. Check both boxes for this example.
f. Threshold: The Lower threshold menu allows you to filter the results by the change
in expression levels. For example, picking 1.5 as the lower threshold means that
genes will only appear in the list if there is at least a 1.5-fold difference in
expression between the two groups of samples.
Use a setting of 1.5 as the Lower limit and None as the Upper limit.
g. Correction: Every time gene expression is measured, in a microarray or Next
Gen experiment, there is a certain probability that the results will be identified
as significantly different, even though they are not. These kinds of results can
be described as false positives or as type I errors. As we increase the number
of the genes tested, we also increase the probability of seeing false positives.
For example, if we have a p value of 0.05, we have a 5% chance that the gene
expression difference between the two groups resulted from chance. When a large
data set such as one generated by a microarray experiment is analyzed, with a list
of 10,000 genes (an average-sized microarray), about 500 of those genes could
be incorrectly identified as significant. The correction methods in this menu are
designed to compensate for this kind of result.
Four different options are available in GSAE to adjust the p values for multiple
testing and minimize the false-discovery rate. Since these methods are used to
correct the p values obtained from statistical tests, these corrections are only be
applied if a statistical test, such as a t test, has been applied.
GSAE offers the following correction methods: Bonferroni, Holm, Benjamini and
Hochberg, and Westfall and Young maxT corrections. The Bonferroni and Westfall
and Young corrections calculate a family-wise error rate. This is a very conservative requirement, with a 5% chance that you will have at least one error. The
Benjamini and Hochberg correction calculates a False Discovery Rate. With this
method, when the error rate equals 0.05%, 5% of the genes considered statistically
significant will be false positives. Benjamin and Hochberg is the least stringent of
the four choices, allowing for a greater number of false positives, and fewer false
negatives.
When it comes to choosing a correction method, we choose correction methods
depending on our experimental goal. If our goal is discovery, we can tolerate
a greater number of false-positive results in order to minimize the number of
false negatives. If we choose a method to minimize false positives, we have to
realize that some of the real positives may be missed. Genes with real differences
in expression may appear to show an insignificant change after multiple testing
corrections are applied. One of the most important reasons for using these tests is
that they allow the user to rank genes in order of the significance of change, and
make choices about which changes to investigate.
For this example, choose Benjamini and Hochberg.
h. Data Transformation: Our data are already log2 transformed, since RMA and GCRMA both carry out this step during the upload process. Choose Data Already
Log Transformed for this example.
i. Click the Analyze button.
A page with results appears when the processing step is complete.
Investigating the biology
Figure 7.14.3 shows the results from our pairwise analysis of the microarray data—the
differentially expressed genes. Pull-down menus in the middle of the page contain options
Analyzing
Expression
Patterns
7.14.7
Current Protocols in Bioinformatics
Supplement 27
7
6
view changes
by ontology
or the z score
view changes in
KEGG pathways
5
see the
scatter plot
8
2
change to
adjusted p
1
change to
adjusted p
3
save your
results
click search to
apply changes
539 genes show a
significant change
4
this gene is expressed
about 7-fold more
highly in mice fed
with LRD-5001
Figure 7.14.3
select a name
to view the
gene summary
Analyzing the results from a pairwise comparison.
for sorting and changing the views. You may increase the number of genes in the list,
sort by the ratio, p value, or adjusted p value, choose a p-value cutoff so that genes are
only shown if the p values are below a certain number, and change the presentation from
the raw p value to the adjusted p value. After choosing selections from the menus, click
the Search button to show the results.
When this page first appears, our results show a list of 764 genes that are differentially
expressed. Arrows on the left side of each gene ratio point up if a gene shows an increase
in expression relative to the first group or down if a gene shows decreased expression. The
ratio shows the extent of up- or down-regulation. When this page first appears, the list is
filtered by the raw p value.
15. Filter based on the corrections for multiple testing by selecting “adjusted p” from the
raw p value menu and clicking the Search button.
By choosing “adjusted p” from the left pull-down menu to correct for the false discovery
rate, calculated by the Benjamini and Hochberg correction, and clicking the Search button
to show the p values for the differences between each gene, the number of genes is changed
to 539.
Analyzing Gene
Expression Data
Using GeneSifter
16. Next, it can be helpful to sort the data. Initially, the data are shown sorted by ratio so
that genes with a larger-fold change appear earlier in the list. It can also be helpful to
sort the data by the p value or the adjusted p value to see which genes show the most
7.14.8
Supplement 27
Current Protocols in Bioinformatics
significant change. Choose “Adj. p” from the Sort By menu to sort by the adjusted
p value.
Sorting by the adjusted p value shows that the genes with the most significant changes
are cytochrome p450, family 2, subfamily a, polypeptide 4, and glutathione S-transferase,
alpha 2 gene.
17. We can learn more about any gene in the list by clicking its name. Clicking the top
gene in the list brings us to a page where we can view summarized information for
this gene and obtain links to more information in public databases.
18. Click Scatter Plot to view the differences in gene expression another way. A new
window will open with the data presented as a scatter plot (Fig. 7.14.4).
2
click the zoom button
cytochrome P450, family 2, subfamily a,
LRD-5001, water 0
100000
1
drag the box over
the genes you wish
to view in detail
up-regulated genes
appear on this side
of the diagonal line
10000
3
click a spot to
see more details
1000
100
down-regulated genes
appear on this side
10
>> Gene Info
Cytochrome P450, family 2, subfamily a, polypeptide 4
1
1
10
100
1000
10000 100000
AIN-76A, water 0
Open static scatter plot
Group
N
Mean
SEM/Mean
Quality
AIN-76A, water 0
5
10.0036
0.1127
SEM
1.13%
0.0000
LRD-5001, water 0
5
12.7881
0.1339
1.05%
0.0000
gene summary information
appears in this lower corner
Each spot in the graph represents the
expression measurements for one gene.
The expression level for group 2 is plotted
on the y axis and the value for group 1 is
plotted on the x axis.
Intensity
15
10
5
0
AIN-76A,
water 0
LRD-5001,
water 0
LRD-5001, water 0 up-regulated 6.9 fold
compared to AIN-76A, water 0
Figure 7.14.4
Scatter plot.
Analyzing
Expression
Patterns
7.14.9
Current Protocols in Bioinformatics
Supplement 27
a. The scatter plot.
The scatter plot gives us a visual picture of gene expression in the different samples. The
levels of gene expression in group 1 (mice fed with AIN-76A) are plotted on the x axis and
group 2 (mice fed with LRD-5001) on the y axis. Genes that are equally expressed in both
samples fall on the diagonal line. Genes that are expressed more in one group or in the
other appear either above the line (group 2) or below the line (group 1) depending on the
group that shows the highest level of expression.
If we used a method to correct for the false-discovery rate, then the points for genes
showing nonsignificant changes would be colored gray, up-regulated genes showing a
significant change would be colored red, and down-regulated genes showing a significant
change would be colored blue or green.
b. The zoom window and gene summary.
To learn more about any gene in the graph, we drag the box on top of a spot and click the
“zoom” button. After a short time (up to 30 sec), the highlighted spot and surrounding
spots will appear in the top right window. If spots overlap, you may separate them by
dragging them with the mouse. The name of each gene will appear when the mouse is
moved over a spot, and clicking a spot will produce the gene summary information in the
lower right corner.
In our experimental example, clicking some of the spots will find genes that were seen earlier in the list, such as genes for members of the cytochrome p450 family and glutathioneS-transferase.
19. Return to the results window and click the KEGG link.
a. The KEGG report.
The KEGG report, as shown in Figure 7.14.5, presents a list of biochemical and regulatory
pathways that contain members from the list of differentially expressed genes on the results
page. Each row shows the name of the pathway, a link to a list of gene-list members that
belong to that pathway, with arrowheads to show if a member is up- or down-regulated, a
link to the KEGG pathway database, the number of genes from the list that belong to that
pathway, the number of genes that are up-regulated, the number down-regulated, the total
number from that pathway that were present in the array (or reference data set for Next
Gen data), and the z scores for up- and down-regulated genes.
b. z scores.
z scores are used to evaluate whether genes from a specific pathway are enriched in your
list of differentially expressed genes. If genes from a specific pathway are represented in
your gene list more often than they would be expected to be seen by chance, the z-scores
reflect that occurrence. A z score greater than 2 indicates that a pathway is significantly
enriched in the list of differentially expressed genes, while a z-score below −2 indicates
that a pathway is significantly under-represented in the list. The direction and color of
the arrowheads show whether those genes are up- or down-regulated in the second group
relative to the first group of samples. Clicking the arrows above a z score column will
allow you to sort by z scores for up-regulated or down-regulated genes.
Click the arrowhead that is pointed up in the z score column to sort by up-regulated genes.
We can see at least 20 pathways are up-regulated when mice are fed LRD-5001.
c. Genes.
Pick one of the top listed pathways and click the corresponding icon in the Genes column.
A new section will appear underneath the name of the pathway. Before proceeding, look
at the values in the List, totals, and Array column. We can see in our analysis that the
cytochrome P450 pathway for metabolizing xenobiotics is significantly up-regulated and
contains 19 members from our 539-member gene list. We also see that those members are
all up-regulated and that there are 53 genes on the array that belong to this pathway.
Analyzing Gene
Expression Data
Using GeneSifter
Now, look at the list of genes in the newly opened section. Where we had 19 genes shown
as the value in the list column, there are 26 genes listed below the name of the pathway.
7.14.10
Supplement 27
Current Protocols in Bioinformatics
click the KEGG icon
to see a diagram of
the pathway
click the gene icon to see
a list of the genes that
belong to this pathway
genes from the list
are in green boxes
genes that are up-regulated
are in boxes with red numbers
and a red border
1 2 3 4 5 6
1
The number of genes from the list
on the analysis page, that belong
to this pathway.
2
The number of genes in this
pathway that are up-regulated in
group 2 relative to group 1.
3
The number of genes in this
pathway that are down-regulated in
group 2 relative to group 1.
4
The number of genes on the array
that belong to this pathway.
5
The z score for the number of genes
that belong to this pathway and are
up-regulated. Clicking the red arrow,
will sort the list by z scores.
6
The z score for the number of genes
that belong to this pathway and are
down-regulated. Clicking the green arrow,
will sort the list by z scores.
Figure 7.14.5
KEGG pathway results.
Most of the genes have different names, but some appear to be identical. For example,
there are three listings for glutathione-S-transferase, mu 1. Are they really the same gene?
Clicking the gene names shows us that two entries have the same accession number. One
possible explanation for their duplication in the list could be that they are represented
multiple times on the array. It could also be that the probes were originally thought to
belong to different genes and now, with a better map, are placed in the same gene. We
also see that one of the three genes has a different accession number. This entry might
represent a different isoform that is transcribed from the same gene. Many arrays do not
distinguish between alternative transcripts and count them all together. Affymetrix arrays
can also have multiple probe sets for a single gene; in these cases, the gene will appear
multiple times since intensity measurements will be obtained from each probe.
It should also be noted that some genes may belong to multiple KEGG pathways (see
below).
d. KEGG pathways.
Click the KEGG icon to access the KEGG database and view more details for a KEGG
pathway. Once we have identified KEGG pathways with significant changes, we can
investigate further by selecting the links to the individual genes in that pathway or we
Analyzing
Expression
Patterns
7.14.11
Current Protocols in Bioinformatics
Supplement 27
can select the KEGG icon to view the encoded enzymes in the context of a biochemical
pathway. Clicking the boxes in the KEGG database takes us to additional information
about each enzyme.
In our experiment, we find that 19 of the 53 genes in the array are up-regulated and belong
to the cytochrome P450 pathway for metabolizing xenobiotics. The KEGG pathway shows
some of the possible substrates for these enzymes. It would be interesting to look more
closely at LRD-5001 and see if it contains naphthalene or benzopyrene, or one of the other
compounds shown in the KEGG pathway. Other pathways that are up-regulated, when mice
are fed LRD-5001 instead of AIN-76A, are pathways for biosynthesis of steroids, fatty acid
metabolism, arachidonic acid metabolism, etc. Down-regulated pathways include those
for pyruvate metabolism and glycolysis.
20. Return to the results window and click Ontology (options described below).
a. Ontology reports.
An overview of the ontology reports and their features is shown in Figure 7.14.6. Three
kinds of ontology reports are available from Ontology: a set organized by biological
process, another by cellular component, and a third by molecular function. Each report
shows a list of ontologies that contain up or down-regulated genes from the list of 539
genes.
i. Ontology. Selecting the name of an ontology, allows you to drill down and view subontologies.
ii. Genes. Clicking the icon in the genes column shows the genes from the gene list that
belong to that ontology.
iii. GO. Clicking the GO icon opens the record for the ontology in the AmiGO database.
iv. List. The list column shows the total number of genes, from the gene list, both up- and
down-regulated, that have that ontology as part of their annotation.
v. Totals (up or down). One column contains the values for number of up-regulated genes in
the list that belong to an ontology. The other column shows the number of down-regulated
genes that belong to that ontology.
vi. Array. This value shows the number of probes on a microarray chip that could correspond to genes in an individual ontology.
vii. z-score. As with the KEGG report, the z-score provides a way to determine whether a
specific ontology is over- or under-represented in the list of differentially expressed genes.
Significant z scores are above 2 or below negative 2. We cannot sort by z scores on the
ontology report pages, but we can sort by z scores from the z score report.
viii. Pie graph. The pie graph depicts the ontologies in the list and the numbers of members.
b. z-score reports.
Each ontology report page contains a link to a z-score report. Where the ontology reports
show ontologies through a hierarchical organization, the z-score report shows all the
ontologies with significant z-scores, without the need to drill down into the hierarchy.
This is helpful both because significant z scores can be hidden inside of a hierarchy, and
because this report allows you to sort by z scores. It should also be noted that some genes
may belong to multiple ontologies.
Analyzing Gene
Expression Data
Using GeneSifter
When we look at the ontology information for our experiment, we can see that the most
significant ontologies in biological processes are metabolism, cellular processes, and
regulation; for cellular components, we see that cells and cell parts are significant, and
for the molecular function ontology, catalytic activity and electron carrier activity are
significant. When we look at the z score report for molecular function and sort our results
by up-regulated genes, we see that many genes show oxidoreductase and glutathione-Stransferase activity, which is consistent with our findings from the KEGG report. Selecting
the Genes icon shows us that those genes are cytochrome P450s. Taking all of our data
together, we can conclude that genes for breaking down substances like xenobiotics are
expressed more highly when mice are fed LRD-5001 than when they are fed AIN-76A.
7.14.12
Supplement 27
Current Protocols in Bioinformatics
Figure 7.14.6
Gene ontology reports.
COMPARE GENE EXPRESSION FROM PAIRED SAMPLES OBTAINED
FROM TRANSCRIPTOME PROFILING ASSAYS BY NEXT-GENERATION
DNA SEQUENCING
Several experiments have been published recently where NGS or “Next Gen” technologies were used for transcriptome profiling. NGS experiments have three phases for data
analysis. First, there is a data-collection phase where the instrument captures information,
performs base-calling, and creates the short DNA sequences that we refer to as “reads.”
Next, there is an alignment phase, where reads are aligned to a reference data set and
ALTERNATE
PROTOCOL 1
Analyzing
Expression
Patterns
7.14.13
Current Protocols in Bioinformatics
Supplement 27
counted. Last, there is a comparison phase where the numbers of read counts can be used
to gain insights into gene expression. Many of the steps in the last phase are similar to
those used in the analysis of microarray data.
In this protocol, we will describe analyzing data from two NGS data sets and their
replicates. These data were obtained from an experiment to assess the transcriptome
from single cells (mouse oocytes) with different genotypes (Tang et al., 2009). In one
case, wild-type mouse oocytes were used. In the other case, the mouse oocytes had a
knock-out mutation for DICER, a gene required for processing microRNAs.
We will discuss uploading data and aligning the data, view the types of information
obtained from the alignment, and compare the two samples to each other, mentioning
where the NGS data analysis process differs from a pairwise comparison of samples from
microarrays.
Necessary Resources
Software
GeneSifter Analysis Edition (GSAE): a trial account must be established in order
to upload data files to GSAE; a license for the GeneSifter Analysis Edition may
be obtained from Geospiza, Inc. (http://www.geospiza.com)
GSAE is accessed over the Web; therefore, Internet access is required along with
an up-to-date Web browser, such as Mozilla Firefox, MS Internet Explorer, or
Apple Safari
Files
Data files may be uploaded from a variety of sequencing instruments. For the
Illumina GA analyzer, the data are text files, containing FASTA-formatted
sequences. Data from the ABI SOLiD instrument are uploaded as csfasta files.
The example NGS data used in this procedure were generated by the ABI SOLiD
instrument and obtained as csfasta files from the GEO database at the NCBI
(Accession number GSE14605).
The csfasta files are obtained as follows. The accession number GSE14605 is
entered in the data set search box at the NCBI GEO database
(http://www.ncbi.nlm.nih.gov/geo/) and the Go button is clicked. The csfasta
files are downloaded for both wild-type mouse oocytes and DICER knockout
mouse oocytes by clicking the links to the file names and clicking (ftp) for the
gzipped csfasta files: GSM365013.filtered.csfasta.txt.gz,
GSM365014.filtered.csfasta.txt.gz,
GSM365015.filtered.csfasta.txt.gz,
GSM365016.filtered.csfasta.txt.gz.
1. Log in to GeneSifter Analysis Edition (GSAE; http://login.genesifter.net).
Uploading data
2. Locate the Import Data heading in the Control Panel and click Upload Tools.
The uploading and processing steps described for Next Gen data sets require a license from
Geospiza. However, you may access data that have already been uploaded and processed
from a demonstration site. The demonstration site can be accessed from the data center at
http://www.geospiza.com.
3. Click the Next Gen File Upload button to begin uploading Next Gen data.
4. Enter a name for a folder.
Analyzing Gene
Expression Data
Using GeneSifter
Folders are used to organize Next Gen data sets.
7.14.14
Supplement 27
Current Protocols in Bioinformatics
5. Click the Next button.
6. Two windows will appear for managing the upload process. Use the controls in the
left window to locate your data files. Once you have found your data files, select them
with your mouse and click the blue arrowhead to move those files into the Transfer
Queue.
7. Once the files you wish to transfer are in the Transfer Queue, highlight those files and
click the blue arrow beneath the Transfer Queue window to begin transferring data.
Transferring data will take a variable amount of time depending on your network, the
volume of network traffic, and the amount of data you are transferring. A 2-GB Next Gen
data set will take at least 40 min to upload.
Aligning Next Gen data to reference data
Once the data have been uploaded to GSAE, the reads in each data set are aligned
to a reference data source. During this process, the number of reads mapping to each
transcript are counted and normalized to the number of reads per million reads (RPM)
so that data may be compared between experiments.
8. Access uploaded Next Gen data sets by clicking Next Gen in the Inventories section
of the control panel.
9. Use the checkboxes to select data sets for analysis, then click the Analyze button on
the bottom right side of the table.
A new page will appear.
10. Choose the Analysis Type, Reference species, and a Reference Type from the corresponding pull-down menus.
a. Analysis Type: The Analysis Type is determined by the kind of data that were
uploaded and the kind of experiment that was performed. For example, if you
uploaded SOLiD data, analysis options specific to that data type would appear as
choices in the menu. For SOLiD data, the alignment algorithm is specific for data
in a csfasta format. Choose RNA-Seq (SOLiD, 3 passes).
b. Reference Species: The Reference Species is determined by the source of your
data. If your data came from human tissues, for example, you would select “Homo
sapiens” as the reference species. Since our data came from mouse, choose “Mus
musculus.”
c. Reference Type: The choices for Reference Type are made available in the Reference Type menu after you have selected the analysis type and reference species.
The Reference Type refers to the kind of reference data that will be used in the
alignment. Since we are measuring gene expression, choose “mRNA” as the reference type. This reference data set contains the RNA sequences from the mouse
RefSeq database at the NCBI.
11. Click the checkbox for “Create Experiment(s) upon completion.”
This selection organizes your data as an experiment, allowing you to compare expression
between samples after the analysis step is complete. In order to set up experiments,
GeneSifter must already contain an appropriate Gene Set. A Gene Set is derived from the
annotations that accompany the reference data source.
12. Click the Analyze button to queue the Next Gen data set for analysis.
13. The analysis step may take a few hours depending on the size of your data file and
the number of samples that need to be processed.
Analyzing
Expression
Patterns
7.14.15
Current Protocols in Bioinformatics
Supplement 27
Read Mapping Statistics–RNA-Seq Analysis
Gene List
Total number of reads:
Number of unmapped reads:
12537930
1993626
10544304
Number of mapped reads:
Non-uniquely mapped reads:
Uniquely mapped reads:
1759919
8784385
6163136
Uniquely mapped reads with 0 mismatches:
Uniquely mapped reads with 1 mismatches:
Uniquely mapped reads with 2 mismatches:
Uniquely mapped to ribosomal,
>> Analysis Job Details
(15.90% of all reads)
(84.10% of all reads)
(14.04% of all reads)
2 mismatches:
(70.06% of all reads)
(70.16% of uniquely mapped reads)
(19.88% of uniquely mapped reads)
1746203
875046
0
(9.96% of uniquely mapped reads)
(0.00% of uniquely mapped reads)
Uniquely Mapped Reads by Chromosome
Job Info:
Job ID:
51
Analysis Type:
RNA-Seq (SOLID, 3 passes)
Input File:
GSM365013.filtered.csfasta.txt
State:
Complete
Initiated:
2009-05-15 13:33:10
Completed:
2009-05-15 14:26:43
Comment:
–
Remote ID:
2032
Experiment:
GSM365013 -1-
18
1617
15
14
19 X
1
2
3
13
4
12
5
11
6
10
9
8
7
RNA-Seq (SOLID, 3 passes) Info:
Reference Species: Mus musculus
Reference Type:
mRNA
Analysis Results:
Read alignment
statistics (HTML)
Gene list (Text)
Gene list (HTML)
Job log file
Standard error
Standard output
Figure 7.14.7
Gene List for genes.txt
Summary Statistics
Reads
RPKM
RPM
Entrez Overview
Download Gene List
RefSeq ID
Title
Gene ID
Chrom.
Type
137275 12756.87 15627.16 12049
NM_013479
Bcl2-like 10
Bcl2l10
9 mRNA
115536 11806.49 13152.43 171506
NM_138311
H1 histone family,
member O,
oocyte-specific
H1foo
6 mRNA
103361 8452.91
11766.45 21432
NM_009337
87655
2412.60
9978.50 20729
NM_011462,
NM_146043 spidlin 1
80718
4851.53
9188.80 72114
NM_028106
T-cell lymphoma
breakpoint
Zinc finger, BED
domain containing
3
Tcl 1
12 mRNA
Spin1
13 mRNA
Zbed3
13 mRNA
Analysis results from NGS data, obtained from an ABI SOLiD instrument.
Viewing the Next Gen alignment results
14. When the alignment step is complete, you will be able to view different types of
information about your samples. Click the file name to get to the analysis details
page for your file, then click the Job ID to get the information from the analysis.
15. The exact kinds of information will depend on the data type and the algorithms that
were used to align the reads to the reference data source (Fig. 7.14.7). The types
of information seen from Illumina data will be described in the next protocol. For
SOLiD data, you will see information that includes:
Analyzing Gene
Expression Data
Using GeneSifter
a. Read alignment statistics: These include the total number of reads and the numbers
that were mapped, unmapped, or mapped to multiple positions. Sets of reads can
also be downloaded from the links on this page.
b. Gene list (text): A gene list can be downloaded as a text file after the alignment is
complete.
7.14.16
Supplement 27
Current Protocols in Bioinformatics
c. Gene list (html): The gene list (html) shows a table with information for all the
transcripts identified in this experiment.
i. Reads: A read is a DNA sequence obtained, together with several other reads, from a
single sample. Typical reads from Next Gen instruments such as the ABI SOLiD and the
Illumina GA are between 25 and 50 bases long. The number of reads in the first column
equals the number of reads from a single sample that were aligned to the reference data
set, in this case, RefSeq RNAs.
ii. RPKM: Reads per thousand bases, per million reads. This column shows the number
of reads for a given transcript divided by the length of the transcript and normalized to
1 million reads. Dividing the number of reads by the transcript length corrects for the
greater number of reads that would be expected to align to a longer molecule of RNA.
iii. RPM: Reads per million reads.
iv. Entrez: This column contains links to the corresponding entries in the Entrez Gene
database.
v. Image maps: Image maps are used to show where reads align to each transcript. The
transcripts in these images are all different lengths.
vi. RefSeq ID: The RefSeq accession number for a given transcript.
vii. Title: The name of the gene from RefSeq.
viii. Gene ID: The symbol for that gene.
ix. Chrom: The chromosomal location for a gene.
x. Type: The type of RNA molecule.
Comparing paired samples and finding differentially expressed genes
In the next step, the numbers of reads mapping to each transcript are compared in order to
quantify differential gene expression between the samples. This process is similar to the
process that we used in Basic Protocol 1; we will set up our analysis, apply statistics to
correct for multiple testing, then view the results from the scatter plot, KEGG pathways,
and ontology results to explore the biology.
16. Locate the Analysis section in the GSAE Control Panel and select Pairwise.
A list of potential array/gene sets will appear. The gene sets correspond to the results
from analyzing Next Gen data. Clicking the name of a gene set will allow you to view the
samples that belong to that set.
17. To set up the analysis, either click the spyglass on the left of a gene set, or click the
name of the gene set and choose “Analyze experiments from this array” from the
middle of the window.
A page will appear where you can assign samples to a group and pick the analysis settings.
18. Use the checkboxes to assign one sample (or set of samples) to group 1 (these are
often the control samples) and the other sample (or set of samples) to group 2.
Assign the two sets from wild-type mouse oocytes to group 1 and the two sets from the
DICER knock-out oocytes to group 2.
19. Use the pull-down menus to select the advanced analysis settings.
a. Normalization.
This step involves normalizing data for differences in signal intensity within and between
arrays. This type of normalization process does not apply to Next Gen data since Next
Gen measurements are derived from the number of reads that map to a transcript instead
of the intensity of light.
Next Gen sequence data are normalized by GSAE but this happens during the alignment phase. During the alignment process, the number of reads from each experiment is
Analyzing
Expression
Patterns
7.14.17
Current Protocols in Bioinformatics
Supplement 27
normalized to the number of mapped reads per million reads (RPM). This allows data
from different experiments to be compared.
For this example, choose “None” from the menu.
b. Statistics.
The statistical tests available from this menu are used to determine if the differences
between the mean numbers of read counts (or intensity measurements, in the case of
microarrays), from a set of replicate samples, are significant. The significance levels are
reported as p values, i.e., the probability of seeing a result by chance.
For this example, choose “t test” for the statistics.
c. Quality.
For Next Gen data, the quality values correspond to the number of reads per million
transcripts and range from 0.5 to 100.
For this example, set the quality at “1”, meaning that we will only look at transcripts
where there the RPM value is at least 1 in one of the samples being compared.
d. Show genes that are up-regulated and/or down-regulated.
Selecting the checkboxes allows you to choose whether to limit the view to up-regulated
or down-regulated genes, or to show both types.
For this example, check both boxes.
e. Threshold, Lower.
The threshold corresponds to the fold-change. For this example, choose 1.5 as the lower
threshold.
f. Threshold, Upper.
This option is usually set to “none,” however, if you wish to filter out highly expressed
genes, you might wish to set an upper threshold. For this example, leave the upper threshold
at “none.”
g. Correction.
For this example, choose the Benjamini and Hochberg correction.
h. Data Transformation.
Use these buttons to choose whether the data will be log transformed or not. Log transformations are often used with microarray data to make the data more normally distributed.
For this example, apply a log transformation to the data.
20. Click the Analyze button. When the analysis is complete, the results page will
appear.
Viewing the results
The results page shows the two groups of samples that were compared and the conditions
that were used for the comparison. All the genes that varied in expression by at least 1.5
fold are listed in a table on this page.
21. Choose “adjusted p” from the last menu and click the Search button.
Adjusted p values are the p values obtained after the multi-test correction (in our case,
Benjamini and Hochberg) has been applied.
Analyzing Gene
Expression Data
Using GeneSifter
In this analysis, choosing the adjusted p value decreases the number of differentially
expressed genes from 1449 to 28. As noted earlier, although the multiple testing correction
provides a way to sort genes by the significance, genes that truly change may be missed
when these corrections are applied. To view additional genes that may be candidates for
study, you can raise the cut-off limit for the adjusted p values, using the pull-down menu,
or skip the multiple test correction altogether.
7.14.18
Supplement 27
Current Protocols in Bioinformatics
Interpreting the results
After adjusting the p value, only 28 genes in our set show significant changes. It is
helpful at this point to save our results before proceeding on to further analyses. Since
the reports that we would use next (the scatter plot, KEGG pathway information, and
ontology reports) are the same as in Basic Protocol 1, we will leave it to the reader to
refer to the earlier protocol for instruction. The one point we would like to discuss here
is interpreting the gene summary and the differences between the gene summaries for
microarray and Next Gen data.
Each gene in the list is accompanied by a summary that can be accessed by clicking
the gene name. The summary page presents information about expression levels at the
top and links to external databases in the bottom half. Summaries from both microarray
data and Next Gen data (Fig. 7.14.8) show the number of samples (N), along with the
values for each sample and the standard error of the mean. Where the two kinds of
summaries differ is in intensity and quality values. For microarray data, the columns
labeled “intensity values” do show the intensity data. If the data were log transformed
during the upload process or the analysis, then the log-transformed values are reported.
For Next Gen data, however, the values in the “intensity values” column are not intensity
values. When Next Gen data are used, these values refer to the normalized number of
Figure 7.14.8 Gene summaries for microarray and NGS data. A gene summary from a microarray sample is shown in the top half of the image and a summary for a sample analyzed by NGS
is shown in the bottom half. Note the difference between the intensity and quality values.
Analyzing
Expression
Patterns
7.14.19
Current Protocols in Bioinformatics
Supplement 27
reads that were mapped to a gene (RPM). If the data were log transformed during the
analysis, then these values are the log-transformed values.
The other difference between these data for the two systems is in the quality column.
For Next Gen data, the quality column shows the RPM value for that gene. In the quality
column for the Next Gen data, two of the samples show quality values of zero. This
means that zero transcripts were detected. The other two samples show values around
6, indicating that approximately 6 transcripts were detected, per million reads, for the
Drebrin-like gene.
BASIC
PROTOCOL 2
COMPARING GENE EXPRESSION FROM MICROARRAY EXPERIMENTS
WITH MULTIPLE CONDITIONS
GSAE has two modes for analyzing data, depending on the number of factors that are
tested. If two factors are compared, such as treated and untreated, or wild-type and
mutant samples, then a pairwise analysis, as described in Basic Protocol 1, is used to
compare the results. If an experiment involves multiple conditions, such as a time course,
different drug dosing regimes, and perhaps even different genotypes, then the analysis
is considered a project. GSAE projects have additional capabilities for analyzing these
projects as well as different statistical procedures for identifying significant changes in
expression. Some of the tests that can be performed with GSAE are a one-way ANOVA,
a two-way balanced ANOVA, and a non-parametric Kruskal-Wallis test. Corrections for
multiple testing such as those from Bonferroni, Holm, and Benjamini and Hochberg can
also be applied. Additional analyses are clustering, or using the Pearson coefficient to
look for patterns of expression. Specific searches for genes by name, characteristic, or
function can also be performed.
In Basic Protocol 2, we describe a general procedure (shown in Fig. 7.14.9) for analyzing microarray data from specimens that were obtained from different treatments.
An alternative procedure (Alternate Protocol 2) follows in which we will demonstrate
a multiple-condition analysis with Next Gen data from the Illumina GA analyzer. The
samples used in Basic Protocol 2 were obtained from the GEO database. These samples
came from the same study described in Basic Protocol 1. RNA was isolated from mouse
livers where two factors were examined: diet and arsenic in the drinking water. Over
a 5-week period, the mice were fed two kinds of food, AIN-76A, a purified diet, or
LRD-5001, a standard laboratory mouse food, and given arsenic in their water at three
different concentrations (0, 10 ppb, or 100 ppb). There were four to five biological replicates (mice) for each treatment. We will demonstrate setting up the analysis, applying
statistics and multiple testing corrections, and using some of the clustering tools. Some
of the clustering methods, PAM and CLARA, will be discussed in Alternate Protocol 2
rather than Basic Protocol 2.
Necessary Resources
Software
GeneSifter Analysis Edition (GSAE): a trial account must be established in order
to upload data files to GSAE; a license for the GeneSifter Analysis Edition may
be obtained from Geospiza, Inc. (http://www.geospiza.com)
GSAE is accessed over the Web; therefore, Internet access is required along with
an up-to-date Web browser, such as Mozilla Firefox, MS Internet Explorer, or
Apple Safari.
Files
Analyzing Gene
Expression Data
Using GeneSifter
7.14.20
Supplement 27
Data files from a variety of microarray platforms may be uploaded and analyzed in
GSAE, including Affymetrix, Illumina, Codelink, or Agilent arrays, and custom
chips
Current Protocols in Bioinformatics
multiple sample comparison
mouse 1 mouse 2 mouse 3
mouse 4
mouse 5
treatment, 10 ppb As
no treatment
condition 1
mouse 7 mouse 8
condition 2
mouse 9
mouse 10 mouse 11 mouse 12
treatment, 1000 ppb As
treatment, 100 ppb As
condition 3
upload data, normalize
identify differential expression
- fold change
- quality
- ANOVA
- multiple testing correction
- Bonferroni
- Benjamini and Hochberg
- others
Figure 7.14.9
mouse 6
condition 4
isolate RNA
microarrays
visualize results
hierarchical clustering
PCA
explore biology ontology
KEGG
scatter plot
partitioning
PAM
silhouettes
Overview of an experiment comparing multiple conditions.
The example data used in this procedure were CEL files from an Affymetrix 430
2.0 Mouse array and were obtained from the GEO database at the NCBI
(Accession code GSE 9630).
CEL files are the best file type for use in GSAE. To obtain CEL files, go to the
GEO database at the NCBI (www.ncbi.nih.gov/geo/).
Enter the accession number (in this case GSE 9630) in the section labeled “Query”
and click the Go button.
In this example, all the files in the data set are downloaded as a single tar file by
selecting (ftp) from the Download column at the bottom of the page.
After downloading to a local computer, the files are extracted, unzipped, and
uploaded to GSAE as described in the instructions.
Files used for the AIN-76 group with 0 ppb arsenic: GSM243398, GSM243405,
GSM243391, GSM243358, and GSM243376; for the AIN-76 group with 10 ppb
arsenic: GSM243359, GSM243400, GSM243403, GSM243406, GSM243410;
for the AIN-76 group with 100 ppb arsenic: GSM243353, GSM243365,
GSM243369, GSM243377; for the LRD-5001 group with 0 ppb arsenic:
GSM243394, GSM243397, GSM243378, GSM243382, and GSM243355; for
the LRD-5001 group with 10 ppb arsenic: GSM243374, GSM243380,
GSM243381, GSM243385, GSM243387; and for the LRD-5001 group with 100
ppb arsenic: GSM243354, GSM243356, GSM243383, GSM243390,
GSM243392.
A demonstration site with the same files and analysis procedures can be viewed
from the data center at http://www.geospiza.com.
Analyzing
Expression
Patterns
7.14.21
Current Protocols in Bioinformatics
Supplement 27
Uploading data
1. Create a zip archive from your microarray data files.
a. If using a computer with a Microsoft Windows operating system, a commonly
used program is WinZip.
b. If using Mac OS X, select your data files, click the right mouse button, and choose
“Compress # Items” to create a zip archive.
The resulting archive file will be called Archive.zip.
2. Log in to GeneSifter Analysis Edition (GSAE; http://login.genesifter.net).
3. Locate the Import Data heading in the Control Panel on the left-hand side of the
screen and click Upload Tools.
Several types of microarray data can be uploaded and analyzed in GSAE. See Basic Protocol 1 for detailed descriptions. Our data were uploaded using the option for “Advanced
upload methods” and normalized with GC-RMA.
4. On the page that appears, click the Run Advanced Upload Methods button.
5. Select the normalization method and the array type from pull-down menus. For this
example, use GC-RMA and the 430 2.0 Mouse array. Click the Next button.
6. In the screen which now appears, browse to locate the data file on your computer.
7. Choose the option (radio button) for Create Groups, Create New Targets, or Same as
File Name.
Our data came from mice that were given two different kinds of food and drinking water
with three concentrations of arsenic, so six groups were created. Therefore, set 6 as the
value next to Create Groups.
8. Click the Next button.
The screen for the next step will appear after the data are uploaded.
9. On the screen displayed in “step 3 of 4,”, you will be asked to enter a title for your
data set, assign a condition to each group, add labels to your samples if desired, and
identify which sample(s) belong to which group.
In our case, we have six conditions (see Table 7.14.1), with four to five biological replicates
(targets) for each condition. We kept the original file names as the target or sample names.
Setting up a project for analysis
10. We begin the analysis process by creating a project. Select New Project from the
Create New section, add a title for the project, click the checkbox next to the array
that contains the samples, and click the Continue button.
Table 7.14.1 Conditions Used for the Example in Basic
Protocol 2
Analyzing Gene
Expression Data
Using GeneSifter
Condition
Mouse food
Arsenic in water (ppb)
1
AIN-76A
0
2
AIN-76A
10
3
AIN-76A
100
4
LRD-5001
0
5
LRD-5001
10
6
LRD-5001
100
7.14.22
Supplement 27
Current Protocols in Bioinformatics
11. Enter the name for the control group as the group name and any descriptive information in the Description field.
12. Choose a Normalization option.
Leave the setting at None because our data were log transformed and normalized when
we used GC-RMA during the upload process.
13. Choose a Data Transformation option.
Leave the setting at “Data already log transformed” because our data were log transformed and normalized when we used GC-RMA during the upload process.
14. Select a group for a control sample and use the arrow button to move that group to
the box on the right side.
Choose AIN-76A, with 0 as the control sample.
15. Select the other groups that will be part of the analysis and move them to the right
side by clicking the arrow button.
16. Click the Create Group button.
17. Next, select the samples for each condition. Select all the experiments and click the
Create Group button.
A new page will appear with a list of all the conditions and all the samples for each
condition.
18. Choose the samples that will be used in the analysis. You may choose the samples
one by one, or if all the samples will be used, click Select All Experiments.
Click Select All Experiments.
19. Click the Create Group button.
A small window will appear while data are processing. When the processing step is
complete, a new page will appear stating that your project has been created. From this
point, you can continue the analysis by selecting Analyze This Project or you can analyze
the project at a later time.
Identifying differential gene expression
20. Select Projects from the Analysis section of the Control Panel.
21. Choose the project name to review the box plots for the samples and replicates in the
project.
When we analyze multiple samples, GSAE creates box plots that allow us to evaluate the
variation between experimental groups and the replicate samples within each group. The
box plot, also known as a “box and whiskers plot,” shows the averaged data either from a
group of replicate samples or from the intensity values for a single sample. The line within
the box represents the median value for the data set. The ends of the whiskers show the
highest and lowest values. If a box and whiskers graph is made from data with a normal
distribution, the graph would look like the box plot in Figure 7.14.10.
Box plots are helpful for quality control. If we find a box plot with a different median value
from the other samples, it could indicate a problem with that sample or array.
a. Locate the Project Info section in the Project Details page and click Boxplot.
A box plot will appear, as shown in Figure 7.14.11A, with plots representing all six of the
different conditions. Notice that all six of the plots have similar shapes and similar values.
b. Return to the Project Details page. Locate the bottom section, entitled Group Info.
Each of the conditions in this section has between four and five replicates and a
box plot (Fig. 7.14.11B). The box plot link for this section opens a window for
Analyzing
Expression
Patterns
7.14.23
Current Protocols in Bioinformatics
Supplement 27
a box-and-whiskers plot showing a
normal distribution of data
highest value
3rd quartile
range
median
1st quartile
lowest value
Figure 7.14.10
Box plot.
a box for each replicate. Click the box plot link for some of the replicates to see
if the replicates are similar or if any of the replicates appear to be different from
other members of the group.
22. Click Analyze This Project to begin the analysis.
The Pattern Navigation page appears. From the Pattern Navigation page, you may view all
the genes, or limit the genes to those that meet certain criteria for fold change, statistics,
or a certain pattern of expression. Additional options from the Gene Navigation link allow
genes to be located by name, chromosome, or accession number, and options from the
Gene Function link allow them to be located by ontology or KEGG pathway. Statistics
can also be applied to limit the results.
23. Locate the Search by Threshold section and set the threshold choices. Choose 1.5 for
the Threshold, ANOVA for the statistics, and Benjamini and Hochberg to correct for
the false discovery rate. Click the Exclude Control Probes checkbox, then click the
Search button.
Clicking Show All Genes gives 45,101 results. Returning to this page and making the
choices listed here cuts the number of genes to 921.
The results page appears after the threshold filtering is complete. At this point, you can
either save these results and return or continue the analysis.
There are a variety of paths we can follow from this point, as shown in Figure 7.14.12. We
can view the ontology or KEGG reports, as discussed in Basic Protocol 1. We can also use
clustering to group related genes, or we can change the p cutoff value to limit the number
of genes even further.
Using clustering to identify patterns of differential gene expression
24. Choose PCA from the Cluster options.
The PCA option performs a type of clustering known as Principal Component Analysis
(PCA). PCA allows you to evaluate the similarities between samples by identifying the
directions where variation is maximal. The idea behind PCA is that much of the variation
in a data set can be explained by a small number of variables.
Analyzing Gene
Expression Data
Using GeneSifter
In Figure 7.14.12, we can see that principal component analysis breaks our conditions up
into three groups. One group contains all of the LRD-5001 samples, one group contains
the AIN-76A samples, and another group contains the AIN-76A sample that was treated
with 100 ppb arsenic. These results tell us that the greatest difference between the groups
resulted from the food.
7.14.24
Supplement 27
Current Protocols in Bioinformatics
A
18
16
14
12
10
8
6
4
2
b
0
10
10
er
er
w
at
w
at
00
D
-5
LR
LR
D
-5
00
1,
1,
00
LR
D
-5
pp
b
pp
0
w
at
1,
er
6A
,w
at
AI
N
-7
10
er
6A
,w
at
er
10
b
pp
0
er
AI
N
-7
A,
w
at
AI
N
-7
6
B
0
0
18
16
14
12
10
8
6
4
2
0
AIN-76A -1-
AIN-76A -5-
AIN-76A -7-
AIN-76A -8-
AIN-76A -11-
Figure 7.14.11 Box plots from a multiple-condition experiment. (A) Box plots from the six conditions that were compared in Basic Protocol 2. Each plot represents the averaged data from the
four to five replicates from each treatment. (B) Box plots from biological replicates. Replicates
from the AIN-76, 0 lead samples are shown.
25. Return to the results page and choose Samples from the Cluster options.
These results also show us that the groups of samples are divided by the kinds of food they
received. The mice that ate the LRD-5001 show patterns of gene expression more similar
to each other than to the patterns from the mice that ate AIN-76A.
We also see that the AIN-76A samples that had 100 ppb of arsenic were more different
from the AIN-76A samples without arsenic than the LRD-5001 samples were from each
other.
Analyzing
Expression
Patterns
7.14.25
Current Protocols in Bioinformatics
Supplement 27
clustered
by genes
clustered
by sample
PCA
summary
ontology
KEGG
Figure 7.14.12
Analyzing the results from comparing multiple samples.
26. Return to the results page and select Genes from the Cluster options.
Clustering by genes produces an image consistent with our earlier results. On the right
half, where the mice were fed LRD-5001, the three conditions show similar patterns of
expression. The samples in the left half are also similar to each other, although it appears
that some genes have changed in the sample with 100 ppb arsenic.
If we look more closely at the genes that appear to be up-regulated in the LRD-5001 mice,
we can see that many of the genes belong to the cytochrome P450 family.
Examining differential gene expression in a specific gene family
The user may decide to look further at the cytochrome P450s that were induced by
LRD-5001 to see if patterns of expression can be discerned.
27. Click Pattern Navigation, located on the right top corner of the page.
Analyzing Gene
Expression Data
Using GeneSifter
28. Locate the Project Analysis section in the bottom half of the page and click Gene
Navigation.
7.14.26
Supplement 27
Current Protocols in Bioinformatics
enter part of
a gene name
cluster by gene
Figure 7.14.13
a.
b.
c.
d.
Gene-specific navigation.
Enter the gene symbol in the Name textbox as shown in Figure 7.14.13.
Choose Match Any Word from the Option pull-down menu.
Choose ANOVA from the Statistics menu.
Click the Search button.
A page will appear when the filtering is complete. It will indicate that 20 genes matched
this query. At this point, clustering with the Gene option lets us see which of the cytochrome
P450 genes are up-regulated in the presence of LRD-5001.
To understand this phenomenon further, we could use the ontology reports and KEGG
pathways to learn about the specific roles that these cytochrome P450s play in metabolism
and why they might be up-regulated when mice are fed LRD-5001. We could also use a
2-way ANOVA.
COMPARE GENE EXPRESSION FROM NEXT-GENERATION DNA
SEQUENCING DATA OBTAINED FROM MULTIPLE CONDITIONS
ALTERNATE
PROTOCOL 2
This protocol discusses a general method for analyzing samples from Next Generation
DNA sequencing experiments that represent different conditions. In this example, we
will compare replicate samples (n = 3) from three different tissues: brain, liver, and
muscle. We will also discuss using partitioning to cluster data by the pattern of gene
expression.
Necessary Resources
Software
GeneSifter Analysis Edition (GSAE): a trial account must be established in order
to upload data files to GSAE; a license for the GeneSifter Analysis Edition may
be obtained from Geospiza, Inc. (http://www.geospiza.com)
Analyzing
Expression
Patterns
7.14.27
Current Protocols in Bioinformatics
Supplement 27
GSAE is accessed over the Web, therefore, Internet access is required along with
an up-to-date Web browser, such as Mozilla Firefox, MS Internet Explorer, or
Apple Safari
Files
Data files may be uploaded from a variety of sequencing instruments. Illumina GA
analyzer data are text files, containing FASTA-formatted sequences. Data from
the ABI SOLiD instrument are uploaded as csfasta files.
The example data used in this procedure were generated by the Illumina GA
Analyzer and obtained from the SRA database at the NCBI (Accession code
SRA001030).
The data files are obtained as follows. The accession number SRA001030 is
entered in the data set search box at the NCBI Short Read Archive
(http://www.ncbi.nih.gov/sra), and the Go button is clicked. The files are
downloaded for each tissue type by clicking “Download data” for this
experiment link. After downloading the data files, the text files containing the
fasta sequences are uploaded to GSAE and processed as described in the
instructions.
1. Log in to GeneSifter Analysis Edition (GSAE; http://login.genesifter.net).
Uploading data
2. Locate the Import Data heading in the Control Panel and click Upload Tools.
3. Click the Next Gen File Upload button to begin uploading Next Gen data.
4. Enter a name for a folder.
Folders are used to organize Next Gen data sets.
5. Click the Next button.
6. Two windows will appear for managing the upload process. Use the controls in the
left window to locate your data files. Once you have found your data files, select them
with your mouse and click the blue arrowhead to move those files into the Transfer
Queue.
7. Once the files you wish to transfer are in the Transfer Queue, highlight those files and
click the blue arrow beneath the Transfer Queue window to begin transferring data.
Transferring data will take a variable amount of time depending on your network, the
volume of network traffic, and the amount of data you are transferring. Illumina GA data
sets are approximately 250 MB and take at least 10 min to transfer.
Aligning Next Gen data to reference data
Once the data have been uploaded to GSAE, the expression levels for each gene are
measured by aligning the read sequences from the data set to a reference data source and
counting the number of reads that map to each transcript.
8. Access uploaded Next Gen data sets by clicking Next Gen in the Inventories section
of the control panel.
9. Use the checkboxes to select the data sets then click the Analyze button on the bottom
right side of the table.
Analyzing Gene
Expression Data
Using GeneSifter
10. A new page will appear where you can choose analysis settings from pull-down
menus. These settings include the File Type, Analysis Type, Reference Species,
and Reference Type. Choose the appropriate Analysis Type, Reference Species, and
Reference Type.
7.14.28
Supplement 27
Current Protocols in Bioinformatics
a. File Type: The file type is determined by the instrument that was used to collect
the data.
Since our read data were generated by an Illumina Genome Analyzer, choose Genome
Analyzer.
b. Analysis Type: The Analysis Type is determined by the kind of data that were
uploaded and the kind of experiment that was performed. This setting also allows
you to choose which algorithm to use for the alignment.
Choose RNA-Seq (BWA, 2 MM). This setting uses the Burroughs Wheeler algorithm (Li
and Durbin, 2009) to align the reads with a tolerance setting of 2 mismatches.
c. Reference Species: The Reference Species is determined by the source of your
data.
Since our data came from mouse, choose “Mus musculus.”
d. Reference Type: The choices for Reference Type are made available in the Reference Type menu after you have selected the analysis type and reference species.
The Reference Type refers to the reference data that will be used in the alignment.
Since we are measuring gene expression, pick “mRNA” as the reference type. This reference
data set corresponds to the current build for mouse RefSeq RNA.
11. Click the checkbox for “Create Experiment(s) upon completion.”
This selection organizes your data as an experiment, allowing you to compare expression
between samples after the analysis step is complete.
12. Click the Analyze button to queue the Next Gen data set for analysis.
The analysis step may take a few hours depending on the size of your data file and the
number of samples waiting to be processed.
When the analysis has finished, the information on the right side of the table, in the
Analysis State column, will change to Complete. When the analysis step is complete, you
will be able to view different types of information about your samples.
Viewing the Next Gen alignment results
13. Click the file name to get to the analysis details page for your file, then click the Job
ID to get the information from the analysis. The kinds of analysis results obtained
depend on the alignment algorithms. The results from processing data from the AB
SOLiD instrument are described in Alternate Protocol 1. For Illumina data, processed
with the BWA, we obtain the following kinds of results: gene lists (text and html), a
base composition plot, a list of genes formatted for GSAE, a transcript coverage plot,
and an analysis log (Fig. 7.14.14).
a. Gene lists (text and html): The gene lists show the number of reads that map to
each transcript, and the number mapping per transcript, normalized per million.
The html version of the gene list includes a graph showing where the reads map,
which is linked to a more detailed map with each base position. Links are provided
to the NCBI RefSeq record.
b. Base composition plot: This graph shows the numbers of each base at each position
and can be helpful for quality control. If sequencing DNA, we would expect the
ratios to be fairly similar. If sequencing single-stranded RNA, we would expect to
see more differences.
c. Transcript coverage plot: The transcript coverage plot shows the number of reads
that map to different numbers of transcripts. For example, in each case, you can
see there are a large number of transcripts that only have one mapping read.
Analyzing
Expression
Patterns
7.14.29
Current Protocols in Bioinformatics
Supplement 27
Figure 7.14.14
Illumina data.
Setting up a project
14. To compare multiple samples, begin by setting up a project. Find the Create New
section in the Control Panel and click Project.
15. Give the project a title and add a description.
Use “Mouse tissues” for this project.
16. Use the checkboxes to select the arrays that contain your data. These names correspond to the Array/Gene Set names that you assigned to the data sets during the
upload process. If you checked the correct box, you will see the sample names appear in the Common Conditions box. The conditions that appear should match your
experimental treatments.
17. Click the Continue button.
18. Assign a name to this group.
19. Select a normalization method.
Choose None for this example.
Analyzing Gene
Expression Data
Using GeneSifter
20. Use the “Data transformation” menu to select a method for data transformation. Data
transformation options are “no transformation,” “log transformation,” or “already
transformed.” Log transformations smooth out the data and produce a more Gaussian
distribution.
For this example, choose “Log transform data” from the menu.
7.14.30
Supplement 27
Current Protocols in Bioinformatics
21. In this next step, you will set the condition order. The first group selected acts as a
reference or control group. Changes in gene expression in the other groups are all
measured relative to first group that is chosen.
a. Decide which group is group 1 and enter the name of that group in the Group
Name box. To do this, select that group in the Conditions box on the left side, and
use the arrow key to move it to the right-hand box.
b. Select the other conditions that you wish to analyze and use the arrow key to move
those to the left condition box as well.
c. Click the Create Group button.
A new page will appear with a list of all the groups and samples.
22. Select the samples for each condition. We will use all the samples, so click Select All
Experiments, then click the Create Group button.
The processing window will appear while the data are being processed.
23. Once a project has been created, you may analyze the project or create a new project
or new group. These steps can also be completed at a later time.
Comparing samples
24. Locate the Analysis section in the Control Panel, select Projects, and find your project
in the list.
Once you have found your project in the list, you may wish to select the project name to
view some of the project details. You may also wish to view the box plots for these data as
discussed in Basic Protocol 2.
Identifying differential expression
25. To begin the analysis, select Projects from the Analysis section, locate your project
in the list, and either select the spyglass or click the name or your project and then
click on Analyze this Project.
The Project summary page appears. From this page, we can choose to view all the
genes or apply filters to locate specific genes by name, chromosome, function, or other
distinguishing features.
26. Choose Show All Genes.
It will take a few moments for the results to appear, especially with large data sets.
The Project results appear. At this point, we see there are 40,009 results. We will need
to apply a threshold and some statistics to select genes that are differentially expressed.
The threshold filter allows us to choose the genes that show at least a minimum change in
expression. Use a threshold of 1.5 for this project.
27. GSAE offers three types of statistical tests (described below) that can be applied at
this point. At least three replicates per group are recommended. A balanced ANOVA
can also be carried out when only one factor is varied (such as time or dose) and there
are equal numbers of replicates for each sample.
a. A standard 1-way ANOVA: This method is used when there is a normal distribution,
the samples show equal variance, and the samples are independent.
b. A 1-way ANOVA for samples with unequal variance: Like the standard 1-way
ANOVA, this method assumes a normal distribution and independent, random
samples.
c. The Kruskal-Wallis test (nonparametric): This method assumes independent random samples but does not make assumptions about the distribution or variance.
Choose the standard 1-way ANOVA for this analysis.
Analyzing
Expression
Patterns
7.14.31
Current Protocols in Bioinformatics
Supplement 27
28. After choosing a statistical method, click the Search button. At this point, there are
still over 17,000 results. The advanced analysis methods in GSAE work best with
gene numbers under 5000; consequently, we will use some additional filters to reduce
the number of genes in the list.
a. Apply a correction to limit false discoveries. The options are the Bonferroni,
Holm, and Benjamini and Hochberg. Bonferroni is the most stringent, followed
by Holm, with Benjamini and Hochberg allowing more false positives in order to
minimize false negatives. Multiple testing corrections are discussed in detail in
Basic Protocol 1.
Used Benajmini and Hochberg in this example.
b. Apply a p Cutoff. This sets a threshold for the minimum p value.
Set the p-value cutoff at 0.01.
c. Set the quality. For NGS data, the quality corresponds to the number of reads per
million reads.
Set the quality level at 100 to view highly expressed genes that differ between these three
tissues. A quality value of 100 for NGS data corresponds to 100 reads per million sampled.
29. Click the Search button.
30. Now, we have limited the number of genes to 3293. At this point, it is helpful to save
the results so that we can easily return to this point. To do this, click Save and enter
a name and description for this subset of our project.
When saving your project, it is helpful to enter information about the data transformations
or statistical tests that were used during the analysis. For example, if your data were
log transformed, or statistical tests or corrections were applied, it helps to enter this
information in the description field.
31. A page will appear asking if you wish to continue the analysis or analyze the newly
created project. Select “Analyze newly created project” and select Show All Genes
from the Project Summary page.
Visualizing the results
Now, we can begin use some of the other analysis features in GSAE. The ontology
and KEGG reports were discussed earlier in Basic Protocol 1, and some of the clustering options such as PCA and clustering by samples or genes were described earlier in this protocol. We will use clustering by genes here as well, in order to gain
insights into the possible numbers of genes with related expression patterns. In this
case, clustering by genes suggests that there may be three to four different expression
patterns.
Partition clustering
Two of the advanced clustering methods provided in GeneSifter are PAM (Partitioning
Around Medoids) and CLARA (Clustering Around Large Applications). Both of these
options are variations of K-means clustering. K-means clustering is used to break a set
of objects in this case, genes, into set of k groups. The clusters are formed by locating
samples at the medoids (median values) to act as the seeds and clustering the other genes
around the medoids.
Analyzing Gene
Expression Data
Using GeneSifter
In order to use the advanced clustering methods such as PAM or CLARA, filters must be
applied in order to limit the number of the genes to below 5000. Two ways to limit the
gene number are to set a lower p value as a cutoff and to raise the threshold. These filters
can be used separately or in combination.
7.14.32
Supplement 27
Current Protocols in Bioinformatics
To use the advanced clustering methods
32. Choose Pattern Navigation from the analysis path.
33. Choose Cluster.
34. Choose a method for clustering and set the options (as described below).
The two options for advanced cluster analysis are PAM (Partitioning Around Medoids) and
CLARA (Clustering Around Large Applications). The difference between the two methods
is that PAM will try to group the samples into the number of clusters that you assign, while
CLARA will try to find the optimum number of groups. PAM is recommended for data sets
smaller than 3500 genes, while CLARA is more suited to larger data sets. PAM is also
more robust; it tries all possible combinations of genes for k and picks the best clusters.
CLARA does a sampling (100) and picks the best from that sample.
a. Clusters: The number chosen here determines the number of gene groups. Often
people try different values to see which gives the best results.
b. Row Center: The values in this set, Row Mean, None, or Control are used to
determine the centers of each row.
c. Distance: The Distance choices are Euclidean, which corresponds to a straight
line distance, Manhattan, which is a sum of linear distances, and Correlation.
As a starting point for this example, choose PAM with 4 clusters based on our Gene cluster
pattern, a Euclidean distance, and the Row Center at the Row Mean.
35. Click the Search button to begin.
Silhouettes
When the clustering process is complete, a page appears with multiple graphs, one
for each cluster group. At the top of the page and under each graph are values called
“silhouettes.” Silhouette widths are scores that indicate how well the expression of the
genes within a cluster matches that graph. Values between 0.26 and 0.50 indicate a weak
structure, between 0.50 and 0.70 a reasonable structure, and above 0.70 a strong structure.
The mean silhouette value for all the silhouettes appears at the top of the page, with the
individual values appearing below each graph along with the number of genes that show
that pattern (Kaufman and Rousseeuw, 1990).
The graphs showing the average expression pattern within each cluster and the silhouette
values for our clusters are shown in Figure 7.14.15. When a graph in GSAE is clicked, the
heat map containing the genes represented by the graph will appear. The first graph shows
a pattern that seems a bit different from the results we might expect. Instead of showing
the brain samples with a higher level of expression and liver and muscle lower, our first
20 liver and muscle samples appear instead to up-regulated. This result is puzzling until
we look more closely at the results and see that the first silhouette contains 1920 genes,
and that the variations in expression levels are small. It is likely that looking at more
genes would show us that they do follow the pattern of expression seen in the graph.
The other three graphs, with 397, 717, and 277 genes, respectively, match the results that
we see in their respective heat maps. These groups also make biological sense. If we look
at the genes and read about their function in the ontology and KEGG reports, we can see
that, as expected, brain genes are expressed in brain, liver genes in liver, muscle genes in
muscle, and some genes in two or more of tissues examined.
It should be noted that clustering is not a definitive analytical tool. Clustering is used to
try and group genes by the expression patterns that we see, and we will often try multiple
values for k and different ways of making the clusters. Although the silhouette scores
are helpful for evaluating the strength of the group, ultimately, we want to see if the
cluster makes biological sense, with genes in a common pathway showing a pattern of
coordinate control.
Current Protocols in Bioinformatics
Analyzing
Expression
Patterns
7.14.33
Supplement 27
Figure 7.14.15
Partitioning and silhouette data from a Next Gen experiment.
LITERATURE CITED
Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F., Soboleva, A.,
Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Muertter, R.N., and Edgar, R. 2009.
NCBI GEO: Archive for high-throughput functional genomic data. Nucleic Acids Res. 37:D885-D890.
Kaufman, L. and Rousseeuw, P. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley
Series in Probability and Statistics. John Wiley & Sons, Inc., New York.
Kozul, C.D., Nomikos, A.P., Hampton, T.H., Warnke, L.A., Gosse, J.A., Davey, J.C., Thorpe, J.E., Jackson,
B.P., Ihnat, M.A., and Hamilton, J.W. 2008. Laboratory diet profoundly alters gene expression and
confounds genomic analysis in mouse liver and lung. Chem. Biol. Interact. 173:129-140.
Li, H. and Durbin, R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics E-pub May 18.
Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., and Gilad, Y. 2008. RNA-seq: An assessment of
technical reproducibility and comparison with gene expression arrays. Genome Res. 18:1509-1517.
Analyzing Gene
Expression Data
Using GeneSifter
Millenaar, F.F., Okyere, J., May, S.T., van Zanten, M., Voesenek, L.A., and Peeters, A.J. 2006. How to
decide? Different methods of calculating gene expression from short oligonucleotide array data will give
different results. BMC Bioinformatics 7:137.
7.14.34
Supplement 27
Current Protocols in Bioinformatics
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. 2008. Mapping and quantifying
mammalian transcriptomes by RNA-Seq. Nat. Methods 5:621-628.
Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., Wang, X., Bodeau, J., Tuch, B.B.,
Siddiqui, A., Lao, K., and Surani, M.A. 2009. mRNA-Seq whole-transcriptome analysis of a single cell.
Nat. Methods 5:377-382.
Wang, Z., Gerstein, M., and Snyder, M. 2009. RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev.
Genet. 10:57-63.
Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., Dicuccio,
M., Edgar, R., Federhen, S., Feolo, M., Geer, L.Y., Helmberg, W., Kapustin, Y., Khovayko, O., Landsman,
D., Lipman, D.J., Madden, T.L., Maglott, D.R., Miller, V., Ostell, J., Pruitt, K.D., Schuler, G.D.,
Shumway, M., Sequeira, E., Sherry, S.T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R.L.,
Tatusova, T.A., Wagner, L., and Yaschenko, E. 2008. Database resources of the National Center for
Biotechnology Information. Nucleic Acids Res. 36:D13-D21.
INTERNET RESOURCES
http://www.geospiza.com/Support/datacenter.shtml
The microarray data center at Geospiza, Inc. A diverse set of microarray data sets and tutorials on using
GSAE are available from this page.
http://www.ncbi.nlm.nih.gov/geo/
The NCBI GEO (Gene Expression Omnibus) database. GEO is a convenient place to find both microarray
and Next Gen transcriptome datasets.
http://www.ebi.ac.uk/microarray/
The ArrayExpress database from the European Bioinformatics Institute. Both microarray and Next Gen
transcriptome data can be obtained here.
http://www.ncbi.nlm.nih.gov/sra/
The NCBI SRA (Short Read Archive) database. Some Next Gen transcriptome data can be obtained here.
Analyzing
Expression
Patterns
7.14.35
Current Protocols in Bioinformatics
Supplement 27