genetical genomics applied to haemopoietic development using

Transcription

genetical genomics applied to haemopoietic development using
GENETICAL GENOMICS APPLIED TO HAEMOPOIETIC
DEVELOPMENT USING MOUSE MODELS
BIOINFORMATICS RESEARCH INTO ILLUMINA MICROARRAY DATA
Name:
Student number:
Email:
Project:
Date:
Danny Arends
S1276891
D.Arends@student.rug.nl
Bachelor Thesis: Life science & Technology -GPB2007
Table of Content
Genetical genomics applied to haemopoietic development using mouse models .................................................... 1
Content ..................................................................................................................................................................2
Acknowledgment ...................................................................................................................................................3
Abstract .................................................................................................................................................................4
Dutch abstract .......................................................................................................................................................5
Introduction............................................................................................................................................................6
Materials & Method................................................................................................................................................ 7
Materials............................................................................................................................................................7
Methode ............................................................................................................................................................7
Results...................................................................................................................................................................9
Discussion ...........................................................................................................................................................16
Future perspectives .............................................................................................................................................19
Literature .............................................................................................................................................................20
Web resources ....................................................................................................................................................21
Acknowledgment
During the writing of this thesis a lot of support was offered from a group of wonderful people,
helping and guiding me. I don’t think that without them this work would have been possible. I learned
a lot during the last few months that I’ve been in this group, and feel challenged and stimulated to
continue in this field and make a contribution to bioinformatics as a developing field of research.
I would like to thank the following people from the Groningen Bioinformatics Centre:
Ritsert Jansen
Yang Li
Bruno Tesson
Morris Swertz
Gonzalo Vera
- For welcoming me into the Bioinformatics group, and giving me the opportunity to
do this bachelor thesis.
- For supervising me during this project. Also for the open door, and the help on
writing this thesis.
- For your help in finding errors, pointing out the obvious and useful input.
- For a listening ear, feedback and enjoyable lunch breaks.
- For help on building R-packages and many laughs in the sun.
All the other staff/students at the GBIC department who enabled me to do this work and helped me
with my questions, or asked questions so that thinking about answers and possible problems was
necessary.
Also I would like to extend my thanks to the Department of Cell Biology, Stem Cell Biology:
Gerald de Haan
Leonid v. Bystrykh
- For trusting me with the Illumina datasets, and insight into the experiment.
- For the useful feedback and biological focus, and questions.
Abstract
Illumina technology microarrays are a novel way of doing whole genome analysis. These arrays are
manufactured in a completely new fashion. Oligonucliotide probes are attached to glass beads, pooled
together and then distributed across an etched surface. The beads will randomly locate themselves (in
these etched wells). The manufacturing process of these bead arrays produces arrays that are unique.
To overcome this problem of no two arrays being the same, decoding is needed to identify each bead
located at each position. This decoding is done by sequential hybridization. Because Illumina arrays are
completely different than normal (Affymetrix or spotted oligo) microarrays, bioinformatics tools
developed for those platforms aren’t (directly) suited to be used on Illumina arrays. In this thesis a
method is proposed (and implemented) to automatically generate data quality control plots. Also the
method provides pre-processing support and it generates different files used in the follow up analysis of
the data.
To develop this method Illumina array data for 62 samples (mus musculus) were provided by the
department of stem cell biology at the UMCG. These 62 samples were drawn from 4 different cell
types (stem cell, progenitor, erythrocyte and granulocyte) to allow differential expression of genes to
be observed. The biological question in this research: “How do haemopoietic cells differentiate into
other cell types?” is an ongoing field of research studied by many groups all over the world. The new
approach to uncover regulatory elements on differentiation is by sampling Recombinant Inbred Lines
(RILs). These inbred mus musculus have the unique ability that their genome is a mosaic of the parents
strains. Using molecular markers genotyping of these animals at specific genome locations was done.
Because there are only 2 origins of genetic diversity (the parents) a map can be made for each RIL. On
these RIL populations genetical genomics can be performed. Genetical genomics is the mapping of a
trait (in our case: expression) onto the most likely molecular marker (expression QTL), this enables
researchers to find regulators for gene expression and genetic hotspots governing differentiation. A
method for this kind of analysis (together with differential expression) is also provided in the proposed
method.
The method has been implemented using the R-programming language to facilitate easy access for the
novice user, but at the same time give more experienced users a tool that will help them analyze
Illumina data in less time and with greater accuracy.
Dutch abstract
Illumina microarrays bieden een nieuwe manier van volledige genoom analyse. Oligonucleotides
probes worden vastgemaakt aan glazen kraaltjes (beads), daarna worden ze samen gemixt en verspreid
over een drageroppervlakte die d.m.v. etst technieken zeer kleine gaatjes (wells) bevat. De kraaltjes uit
de kralenpool zullen in de gaatjes vallen en zich willekeurig verdelen over dit oppervlak. Het op deze
manier produceren van microarrays heeft als gevolg dat elke microarray uniek is. Om het probleem op
te lossen dat geen twee microarrays gelijk zijn, moet er en decoder stap toegepast worden om de
identiteit van elk kraaltje vast te stellen. Decoderen gebeurt d.m.v. herhaaldelijk hybridizeren van de
microarray met verf gelabelde oligonucleotide decoders. Omdat Illumina arrays complete anders zijn
dan gewone (bijv. Affimetix of geprinte oligonucliotide) arrays, zijn bio-informatica gereedschap
gemaakt voor deze technologieën niet (direct) bruikbaar bij de analyse van Illumina arrays. In deze
bachelor scriptie word een manier voorgesteld (en aangeboden) om automatisch verschillende data
bewerkingsstappen te doen aan Illumina microarrays. Deze stappen omvatten: Data kwaliteit controle
grafieken, voor bewerking van de gegevens en uitvoer naar verschillende programma gebruikt bij
verdere analyse van de gegevens.
Om deze methode te ontwikkelen werden 62 samples (mus musculus) geanalyseerd met behulp van
Illumina sentrix microarrays. De samples zijn beschikbaar gesteld door de afdeling stamcel biologie in
het UMCG. Deze 62 samples kwamen van 4 verschillende celtypes (stamcel, voorloper (progenitor),
erytrocyten en granulocyten) om differentiële expressie van genen te observeren. De biologische vraag
achter dit onderzoek: ”Hoe differentiëren haemopoietische cellen tot andere celtypes?” is een
onderzoeksvraag waaraan groepen vanuit de hele wereld werken. De nieuwigheid van dit onderzoek
om genetische regulatoren te vinden is het gebruik van recombinant inbred lines (RILs), deze inteelt
muizen stammen hebben de unieke eigenschap dat ze genetisch stabiel zijn en een mozaïek van de
genomen van hun ouders. Met behulp van moleculaire markers is het genetische mozaïek van deze
muizen stammen in kaart gebracht. Omdat er maar twee ouders zijn die bijdragen aan de genetische
diversiteit is deze kaart uitermate geschikt voor genetical genomics.. Genetical genomics is het
vakgebied waarin een eigenschap (trait, in ons geval expressie) wordt geassocieerd met dit mozaïek
van moleculaire marker. Dit stelt onderzoekers in staat om regulatoren te vinden voor gen expressie
en/of genetische locaties die een regulerende functie hebben op ontwikkeling en differentiatie. Dit soort
analyse methodes (samen met differentiële expressie) zijn ook aanwezig in de voorgestelde methode.
Deze methode is geïmplementeerd door middel van de programmeertaal R om gemakkelijk
toegankelijk te zijn voor de ‘biologische’ gebruiker, maar ook zodat experts een hulpmiddel hebben om
Illumina microarray data analyse te doen in minder tijd en met meer nauwkeurigheid.
Introduction
Expression profiling provides to researchers a valuable tool to understanding life sciences. With the
complete sequence of the mus musculus genome available and variation of the genome still under
active investigation. It is possible for researchers to find and map individual genetic variation, and use
this information to associate genetic differences with diseases or unravel genetics underlying
development of the diverse tissues that we see in eukaryotes. This field of research is currently called
genetical genomics[1-3]. Mapping of these complex traits is done by using a specific kind of animal
models, recombinant inbred lines[4] (RILs). RILs are homozygous inbred mouse species. These RILs
are genetically stable and a mosaic of the parents genomes. This mosaic can be unraveled by using
molecular markers to determine the origin of each marker, and when comparing a quantitative trait
with these marker maps correlation between the map and the trait could indicate that the mapped
location is ‘responsible’ for the trait. This is called QTL mapping. QTL mapping provides researcher
with a valuable tool for unraveling genetic variation responsible for numerous traits seen in nature.
With the ability to measure thousands of genes at the same time, due to improvements in microarray
technology, another kind of trait mapping can now take place: ExpressionQTL mapping. This mapping
of gene expression data combined onto marker data from a RIL can then be used to draw biologically
important conclusions[2]. Because the quality of
the data influences the results of statistical analysis,
pre-processing and data handling should be the
highest priority of every researcher currently
involved in the field of theoretical biology.
To uncover the genetics behind the developmental
stages of haemopoietic stem cells in mus musculus
the department of stem cell biology at the UMCG
in Groningen set up a genetical genomics
Fig 1. Differentiation model.
experiment. Four different types of cells were
From stem cell to either erythroid or granulocyte.
extracted from different RILs, these cells represent
four critical developmental stages in the haemopoietic differentiation (see fig 1.). A stem cell
differentiates into a progenitor cell, this then differentiates into either an granulocyte or a erythrocyte.
Cells were extracted from mus musculus of the BXD recombinant inbred line. The BXD strain is a
recombinant inbred line derived by crossing C57BL/6J (B6) and DBA/2J (D2) and then inbreeding of
progeny for many years. Around 100 different RIL strains of this mixture are made from late 1970 until
now. One of the advantages of this strain is that both the origin strains are sequenced. Cells extracted
from several of these RIL strains were separated using a cell sorter based on expressed surface proteins.
Next from these cells mRNA was extracted and this was prepared to be analyzed using Illumina[5]
microarrays. After preparation and hybridization the arrays were scanned and these images were
analyzed to obtain raw intensity values (bead level). The datasets from this experiment are the focus of
this thesis.
The goal of this experiment was to find regulatory switches that govern differentiation into either
granulocyte or erythrocyte. These switches can be a master gene controlling differentiation (e.g. like
HOX genes) but it could also be a complex trait governed by many genes and their interactions. When
it is a complex trait genetical genomics research will focus on the genes involved in differentiation but
also in the pathways they participate. Data quality control will be done by standard bioinformatics tools,
adapted to suit Illumina microarray needs. The results from this analysis will be the basis for further
research into the development of a protocol for processing Illumina array data. A genetical genomics
approach will be used to find regulatory switches governing differentiation.
Other research in this field has been done by the UMCG on the level of stemcells by using Affymetrix
technology [1, 6]. Groups around the world are working on haemopoietic stemcells, they are working
not only the differentiation of haemopoietic stemcells into other cell types but also their ability to
proliferate without exhaustion[7]. Research into the differentiation by using all four cell types and
genetical genomics is a novel approach. The genetical genomics approach has been used successfully
before on other cell types[2] to find regulatory locations. Research into the decoding of the beadarrays
and data processing afterwards is still an active field of research [8-12], because Illumina is a new
technology of which not every detail is fully known and understood. Also the designing of microarray
experiments for this kind of research is an active field of research, because much analytical power is to
be gained from better microarray design. Research is done into optimal design [13] of microarray
experiments.
Materials & Method
Materials
AriseSoft Winsyntax 2.0
- Freeware text editor
RGui 2.5.0
- R programming language interpreter
Illumina sentrix beadarray 13 pieces
Illumina summary data on 62 samples
20 Stem cell
(BXD SCA+ KIT+)
14 Progenitory (BXD SCA- KIT+)
14 Erythrocyte (BXD TER119+)
14 Granulocytes (BXD GR1+)
Annotation file (provided by G. de Haan)
Array/Sample description files
The following analysis tools / R-packages were used during analysis of the data:
Illumina GUI
Bioconductors Affy
GO
GOCluster
gPlots
IlluminaMousev1p1
matrix2png
Cytoscape
BINGO
BioNetBuilder
(Graphic user interface for creating Bead Summary Files)
(Standard bioinformatics toolkit)
(Gene ontology annotation)
(Gene ontology clustering)
(Advanced plotting options for R)
(Annotation provided by Illumina as an R-package)
(Used for some graphics)
(Visualization of networks)
(Plug-in for Cytoscape: goAnnotation)
(Plug-in for Cytoscape: gene interaction networks)
Method
Different techniques to manufacturing a microarray exist, a new and emerging technology is the sentrix
beadchip[14] developed by Illumina. One of the main advantages of Illumina sentrix beadarrays is the
level of miniaturization, this enables researchers to monitor more genes/probes. Manufacturing of
Illumina sentrix arrays starts with the synthesis of long-oligonucliotides. These oligonucliotides
(probes) are 50 base long and serve as targets for synthesized cDNA made from mRNA extracted from
the biological samples. Also a 13 base ‘barcode’ sequence is attached to each oligonucliotide. These
oligonucliotides are then attached to glass beads[15]. With this technique each bead is loaded with
hundreds of thousands of covalently bound oligonucliotides probes of which ~90 % is available for
hybridization[16]. These beads with the probes attached are then pooled together in beadpools, and
then distributed across an etched microwell substrate. This is a carrier surface with several thousands
(to millions) etched wells and is made by using the latest technologies available by different industries
(fiber optics, computer industry and silicon manufacturing). The beads will randomly assemble
themselves in the wells creating an unique microarray. Because it is not know what probe is at a certain
coordinate [x,y] on the microarray, decoding has to take place. This is done by using dye labeled
oligonucliotide decoders. Sequential hybridization of these oligo-nucleotide decoders with the probes
on the microarray will reveal the ‘beadtype’. When the decoder DNA binds to the barcode on the
illumine probe it gives a signal. Illumina decoding uses 2 color signals (Cy3 and Cy5) and a not
hybridized signal. When using multiple rounds of hybridization, scanning, de-hybridization a signal
library can be made for each probe. The order of ON and OFF signals (and the colors of the ON signals)
will determine which type of bead was positioned at a certain coordinate [x,y]. A more detailed
introduction about decoding randomly ordered beads is given by Gunderson et al 2004[12]. Possible
decoding problems and solutions are also discussed there in more detail. After decoding a file (with the
location and identity of all the probes on a that array) is delivered with the array. The array can now be
used for hybridization with the sample of interest. After hybridization, the array is scanned and the
image obtained is decoded into datafiles which contain spot intensity data, spot probe type (annotated
from the datafile obtained during decoding) and detection probability values. On these files statistical
analysis can be done to uncover genetic regulatory networks.
STATISTICAL ANALYSIS: DIFFERENTIALLY EXPRESSED GENES
From the normalized dataset a list of differentially expressed probes was made, this was done by
applying a one-way analysis of variance testing procedure onto the Illumina intensity data. The factor
yi set as outcome in this model is the log2 transformed intensity data, and there are four levels in our
model (each cell type is a ANOVA level) . The ANOVA model for detecting differential expression:
yi =
+ Ci +
The resulting test statistic from the ANOVA model follows an F-statistic, and thus we can test for
differences between group means. The ANOVA analysis gives information on which probes are
significantly different expressed when analyzing the different groups (Stem, Progenitor, Erythrocyte or
Granulocyte). The assumptions under which this analysis is valid are independent cases, normal
distribution and around the same variance in the groups. Because of the preprocessing steps (which
include normalization of the data) variance is about equal for all groups. The probes that came up as
suspect to be differentially expressed were selected and a T-test was used to determine specificity for a
certain group. So for each marker found in the ANOVA analysis four T-test are preformed to find
group specificity. The T-tests used:
In this model the T is the calculated test statistic, X and Y are the means of the groups and n and m are
the number of observations in those groups. S is the weighted standard deviation, which is assumed to
be normally distributed and around equal for both groups.
STATISTICAL ANALYSIS: EXPRESSION QTL MAPPING
Genetical Genomics is the study of gene expression combined with the mapping of this expression on
positions on the chromosome. The basic model of genetical genomics is simple P=G+E. A trait is
composed of 2 parts a genetic part and an environmental part. The genetical part is the part interesting
to researchers. Genetic markers were used to genotype the RILs, these markers divide the chromosome
and QTL can be associated with them. This is done by fitting a linear model to the data using ANOVA.
The ANOVA model used for mapping eQTLs:
yi =
+ Qi +
In this model yi is again the log2 transformed value of the gene expression, is the mean and Qi is the
genotype effect (-1, 0, +1). This model is used to find expression QTLs considering the genetic map
and the expression levels. The genetic sequence of the BXD strain was obtained from Genenetwork.
CREATING A R-PACKAGE
The R programming language is a valuable tool in statistical analysis of large data sets. When
publishing a method to pre-process illumine sentrix array data files, and automatically generate quality
control files and input files for goCluster, BINGO and BioNetBuilder, the usability of created methods
depend on the way other researchers can access them. Because of the wide spread use of R as THE
language to process and analyze biological data, a R-package containing the developed functions and
the pipeline was created. This package can be easily downloaded and installed into the R programming
environment, after installation the pipeline can be executed (with default settings) by a simple text
command. Creating a package for R enables novice users to easily do pre-processing themselves. But
more experienced users are able to fully utilize the package and customize it to their own needs. Using
R packages has more advantages, The need to document code and functions is a major advantage, this
enables end-users to better understand what and how the functions work/should be used. Also it forces
the programmer to think/re-think every step they made when creating the package.
Results
A R-package was made which supplies the used/created methods to other researchers.
DATA PREPARATION
Files were made from the summarized genetic profiles (SGP) data file provided by Illumina. This SGP
file contained intensities, pValues and averaged numbers of beads for all samples used during the
experiment. Each category intensities, pValue and number of bead was used to build a new file called
after the category name. These files are smaller than the original dataset and can be more easily
manipulated in terms of processing power and memory usage. The created files are saved as CSV-files
for further usage with R. Probe annotation was provided by the Department of stem cell biology at the
UMCG. These annotation files contained Illumina probeID’s as well as the corresponding Affymetrix
probeID’s. The modified probe annotation file also contained an Illumina to Affymetrix probeID
conversion. Thus annotation of the probes could be done by merging two files, the annotation file and
the intensity file.
QUALITY CONTROL: HISTOGRAMS AND BOXPLOTS
Analysis started with the generation of histograms to estimate the threshold for signal to noise ratios.
Also histograms give information about the quality of hybridization of the arrays. Analysis of the
summarized histograms (Fig 2. above) before and after normalization looks like a normal distribution
profile for log2 transformed microarray data. To detect any faulty arrays (in terms of over/under
hybridization) data distributions were made and plotted for each array grouped by cell type. Data
distributions of the different arrays showed no clear difference between samples/cell types. All these
histogram plotting options have been added to the created R package.
Boxplots were made of the Log2 transformed intensities, this reduces the domain in which the
intensities are found and provides a better picture of the state of the data. Boxplots were made for raw
data as well as normalized data. Boxplots for each sentrix array could show a batch effect, this can be
detected by eye using box plots. Also box plots were made of the biological replicates to see how much
variation these contained before normalization. The boxplots from the raw data (Fig 2) show only a
difference in distribution, but the means of the intensity values seems to be fairly equal (across sentrix
arrays). After quantile normalization method the normalized box plot shows less variation between
arrays.
Fig 2a. Histograms
Summarized over all samples, left log2 transformed before normalization and right afterwards. No over/under
hybridization is seen and the data looks like a normal distribution (with more higher values than lower ones).
Fig 2b. Histograms
Bar graphs of all samples grouped to cell type. The x-range of these 4 plots ranges from 0 to 20 (stepsize 2)
No overhybridization is seen in any of the arrays and the distributions are what would be expected of microarray data.
The bar graphs show no clear outlying arrays, so all samples were used in the follow up analysis.
Raw Data
Normalized Data
Fig 3: Box plots
Biological replicates
Numbers 1 and 2:
BXD28 1 SKA-Kit+
Numbers 3 and 4:
BXD28 2 SKA-Kit+
Numbers 5 and 6:
BXD33 2 SKA-Kit+
Between arrays
variation of the
samples is less after
quantile normalization
DATA PROCESSING: NORMALIZATION
There are numerous normalization methods available, each has its own merits and drawbacks. For
Affymetrix normalization is usually done with the Affymetrix RMA methode. This method uses probe
perfect match and probe mismatch information to normalize the data. The data received from Illumina
has already been compared to the background, thus our normalization method of choice would be to
only do a quantile normalization. This normalization method has been implemented in the
normalize.quantiles function from the bioconductor affy[17] package. This method was used because it
was most suited to the data distributions seen when boxplotting the data (Fig 3: Raw Data). As stated
earlier there are a lot of normalization schemes available, and there is still ongoing debate on different
normalization methods[18-21], and which to use in what cases. This paper will not discuss
normalization methods in detail, for this is beyond the scope of this paper.
DIFFERENTIALLY EXPRESSED GENES
After normalization differentially expressed probes were identified as described in the method section
of this thesis. This function has also been implemented in the created R-package. Because all probes
show some variation, a threshold was determined by setting an arbitrary cut-off p-value searching for
differentially expressed genes. A summary of threshold vs. hits is shown in table 1.
1,0E-02
1,0E-08
1,0E-10
1,0E-14
Stemcells
15176
3340
2088
815
Progenitor
10416
2234
1383
501
Erythrocyte
18523
5769
4200
2128
Granulocyte
15028
5030
3460
1655
Table 1. Threshold and hits
This table contains the number of
differentially expressed genes found at a
certain threshold (probability value). A
threshold of 10-14 was chosen for follow up
analysis
When a p-value of 10-14 is chosen there are enough hits to do the follow up analysis, and not that many
that it is (computationally) impossible. Lists of differentially expressed genes were constructed using a
p-value of 10-14.
HEATMAP ANALYSIS
These differentially expressed probes were then heat mapped using the heatmap.2 function from the
gplots package. With the heat map of the selected differentially expressed probes it is possible to check
by eye if the intensities of the selected probes is different from other groups. This indicates that probes
that are suspected to be differently expressed are really differently expressed. Clustering analysis can
also be done on heatmaps. This uses distance (differences between the intensities) to generate a
dendogram, this dendogram can then be analyzed by eye to see if the clustering of the differentially
expressed genes yields back the four cell types used during the experiment. Expected is that the
differentially expressed genes should cluster back to their cell type. Heatmap clustering analysis shows
that clustering of the differentially expressed genes cluster back to the cell types (1 = Stem cell, 2 =
Progenitor, 3 = Erythrocyte & 4 = Granulocyte). To confirm these results also clustering of all the
genes took place. Clustering all genes also resulted in the groups defined by the four cell types.
Fig 4a: Heatmap of stem cell specific genes.
Fig 4b: Heatmap of progenitor specific genes.
Distance clustering shows 4 distinct groups
(corresponding to each cell-type). Also specificity for
the 1 (Stemcells) can be seen, because the distance
between this cell type and the other three is largest.
Distance clustering shows distinct groups
(corresponding to each cell-type), with 1 exception:
Stem cells aren’t clustered back into their original
group. Specificity of the genes for progenitor cell
type seems to overlap with that of 1 (Stemcells).
This is observed because of the small(er) distance
between cell type 1 and 2.
Fig 4c: Heatmap of erythroid specific genes.
Distance clustering shows the four distinct groups
(corresponding to each cell-type). Also specificity for
only erythroid cell type is observed
Fig 4d: Heatmap of granulocyte specific
genes.
Distance clustering shows the four distinct
groups (corresponding to each cell-type). The
distance between cell type 4 and the other three
cell types indicates specificity for only the
granulocyte cell type.
Fig 5: Heat map of all genes.
Distance clustering shows the four distinct
groups (corresponding to each cell-type)
when clustered.
The stemcells (cell type 1) seem to consist of
2 groups that are somewhat different from
each other, but still the variance within a
group is higher than between groups.
The progenitor group (cell type 2) has an
outlier (left side) this outlier has a relative
large distance to the progenitor cell group.
Cell types 1 and 2 seem more related (less
distance between them), which is logical
because progenitor cells are the descendants
of stemcells.
GENE ONTOLOGY[22]: ANNOTATION AND CLUSTERING
Gene Ontology (GO) annotation was retrieved using the GO R-package. A function was made to
convert the differentially expressed gene lists into lists of EntreZ Genebank[23, 24] identifiers. This
entrezID is translated into a GO object, using the function GOENTREZID2GO() from the GO package.
The output from this function contains 3 vectors, GOID the Gene Ontology IDs associated with this
entrezID, GO ontology the parent ontology group and a GO Evidence vector. These 3 lists of data
describes in which ontology’s this gene is found and what evidence led researchers to annotate this
gene with this gene ontology annotation. Creating input for goCluster[25] is also handles by the
package created. goCluster uses the lists of differentially expressed genes, modified to suit the input
requirements. After loading in the input goCluster retrieves the matching ontology from the GOdatabase online. Clustering of found ontology’s is done and compared to the known ontology groups. If
one (or more) clusters are overrepresented in the data supplied, this will be reported as an enriched GO
pathway. The GOcluster package need more information after submitting the input to goCluster. Other
parameters need to be set (e.g. clustering algorithm, false detection rate and taxonomy), and this
enables users to do a lot of tweaking in the options of goCluster. It is e.g. possible to select four
different clustering algorithms and six distance measurements. The GoCluster output shown was made
by using the HClust clustering algorithm, Euclidean distance and a false discovery rate, set to control
the number of false positives, at P = 0.05.
Fig 6a and 6b: Typical output of goCluster
Fig 6a: Molecular function in progenitor cells (left)
Fig 6b: Biological processes in granulocytes (right)
For each cell type 2 plots were made: Biological process and molecular function. In each of the 2 plots shown here
3 gene ontology groups were found to be enriched. These ontology codes were then translated into group names by
using the gene ontology website. These ontology groups that are overrepresented give insight into which functional
processes or biological process could be involved in differentiation. The genes that make up these groups can be
selected for further analysis.
Because GO clustering isn’t an exact science, it is based on statistical analyses of ‘associated groups’
which are defined by the gene ontology consortium. Results from goCluster were verified by using
BINGO[26] a plug-in for Cytoscape[27]. The input for the BINGO plug-in is somewhat different than
that for goCluster, BINGO doesn’t take into account the expression levels of the genes. It finds
enrichment in lists of entrezIDs. The BINGO plug-in enables users of Cytoscape to do GO clustering.
BINGO has less clustering options that the GOcluster package but as an advantage it generates a gene
ontology network. This network can then be viewed and analyzed by using Cytoscape thus enabling
user of this package to visually inspect overrepresentation of certain GO groups, and their relationships.
Stem cell
Progenitory
Biological process
Ubiquitin Cycle
cellular macromolecule metabolic process
Molecular Function
None Found
immune system process
unfolded protein binding
monovalent inorganic cation transporter activity
phosphoric monoester hydrolase activity
Table 2. Summary of goCluster
In this table a summary is given from the goCluster analysis. (continued on next page)
Biological process
Granulocyte
Erythrocyte
nucleobase, nucleoside, nucleotide
cellular carbohydrate metabolic process
nucleic acid metabolic process
methylation-dependent chromatin silencing
Macromolecule metabolic process
M phase
cell communication
cell cycle
biogenic amine catabolic process
Autophagy
Molecular Function
nucleic acid binding
cobalt ion transporter activity
RNA binding
cytoskeletal protein binding
signal transducer activity
urea transporter activity
phospholipid-translocating ATPase activity
microtubule motor activity
Table 2 (continued). Summary of goCluster
In this table a summary is given from the goCluster analysis. For each of the 2 main categories in the GO
database the pathways found to be enriched are given. The GO annotation confirms that the cell types
extracted and analyzed are indeed the correct ones. Also pathways can be identified which could have an
impact on the development of haemopoietic cells
In these BINGO generated networks the node color denotes the significance of enrichment (darker is
more significant). The results from Bingo are shown in Fig 7a, 7b and 7c. Highlighted parts are parts of
interest and are explained in the caption of the picture, only the parts of interest are shown here. All of
the GO-pathways (except the ubiquitin cycle) found in the goCluster analysis were also seen to be
overrepresented in the BINGO analysis. The ubiquitin cycle is seen in progenitor cells to be
overrepresented in BINGO so perhaps the transition from stem cell to progenitor cell is associated with
changes in this ontology group.
Fig 7a: Cytoscape visualization of BINGO generated Stem cell ontology network.
The outlined groups are development and intracellular signaling. These gene ontology groups are
expected to be over expressed in stem cells. No trace of the ubiquitin cycle can be found in the
BINGO analysis.
Fig 7b: Cytoscape visualization of BINGO generated Granulocyte ontology network.
The outlined groups are lytic vacuole and lysosome. These gene ontology localization groups are
expected to be over expressed in granulocytes. Which are known to have organelles with a very low PH.
Also there is high concordance between goCluster and BINGO between biological functions found to be
enriched. (Data not shown)
Fig 7c: Cytoscape visualization of BINGO generated Erythroid ontology network.
The outlined groups are heme biosynthesis and protein modification. These gene ontology groups are
expected to be over expressed in erythroid cells, because of their function as oxygen carriers. Also there
is high concordance between goCluster and BINGO between biological process found to be enriched.
(Data not shown)
INTERACTION NETWORKS
Gene interactions are important when further analyzing the data, the interactions between genes give
insight into which genes in the differentially expressed groups are know to be associated with each
other. These known interactions are available at a number of databases and from these databases the
interactions can be retrieved (see Web resources & Public databases used). For building the network
itself the BioNetBuilder[28] plug-in for Cytoscape can be used. This plugin can search several
databases for known interactions between genes. Before a network could be generated the differentially
expressed probes were annotated using the provided annotation file. From this file a list was made to be
inputted to BioNetBuilder, GI (gene identification) numbers are used by BioNetBuilder to find known
interactions. Different databases give different information, this leads to different interactions that can
all be included in the network e.g. Protein-Protein interactions or Protein-Gene interactions. These
interactions are then visualized by Cytoscape and give insight into what is really going on biologically
in the observed clusters of genes. Different manipulations can be applied to the networks like
subtracting networks, merging them and finding differences. Fig 8 is a typical output from the
BioNetBuilder plug-in, different line colors denote different types of evidence/database for that
interaction. The package creates input for the plug-in and thus gives researchers an easy way to
visualize the interaction in selected gene groups. Striking results from the interactions between
differentially expressed genes were: proteasome group in stem cells (perhaps related to the ubiquitin
cycle) and ubiquione group in progenitor cells, also in both cell-types protein kinases can be seen. In
granulocytes ubiquitin and ubiquione genes were seen in the differentially expressed genes.
Fig 8: Cytoscape view of known biological interactions
In this picture the interactions between the differentially expressed genes of the stem cell group are shown. (only
sub-networks with >2 edges are shown) . These kind of interaction plots supply information about which genes
are know to interact and new interactions can be searched for. Also it gives insight into which kind of
interactions are present in between differently expressed genes.
EXPRESSION QTL MAPPING
Likelihood of expression of a gene was associated with molecular markers. These molecular markers
divide up the entire genome of the mus musculus strain BXD. Information about the positions of
markers and their origin was used from the online Genenetwork database. When plotting the location
of the eQTL of a gene and the location of that gene in a 2 dimensional plot, 2 phenomena are seen: Cis
acting genes, genes of which the highest likelihood of regulation falls around the same region as where
the gene is located on the genome. This is the diagonal of the plot and normally the most striking effect
seen in a eQTL plot. Trans acting genes, genes that show places of regulation located somewhere else
on the genome. These transacting genes usually fall inside transbands, seen as horizontal (or vertical)
lines in the eQTL plot, a single location on the chromosome which regulates the expression of a lot of
genes on other locations. The Cis-acting genes in these transbands could be biologically important in
the differentiation from stem cell to erythrocyte or granulocyte. Expression QTL plots (Fig 9) were
made, and the algorithm for eQTL analysis can be found in the created R-package. eQTLs plots made
for each cell type shows clear transbands which can be analyzed for possible candidate genes
regulating differentiation. Normally the Cis acting genes to show up on the diagonal of the plots, and
all the Cis acting genes combined would give the clearest line on the diagonal of the plot. The eQTL in
stem cells, erythrocyte and granulocyte has more Cis acting genes and only some faint transbands. For
the progenitor cell type there is a strange looking transband. Which could have biological importance
e.g. a master switch/control of gene expression in that cell type. But it could also be an artifact of the
experiment, this transband (and the other transbands) can be analyzed in further research.
Fig 9: Expression QTL plots for all cell types.
For each cell type transbands are seen, these locations on the genome have a statistical
influence on the expression of genes genome wide. The very clear transband in the progenitor
cell type is still under active investigation. Not because of biological interest but because it is
more visible than the Cis acting genes. Other transbands identify regions which are biological
interesting, the genes at those location can be listed and go analyzed or trough interaction
networks. This will decrease the number of possible candidate genes and aid researchers in
focusing resources.
Discussion
Pathway information can be reconstructed from an genetical genomics experiment using bioinformatics.
This can help researchers in finding novel drug targets or uncovering new metabolic routes or
differentiation pathways. Microarray driven research will be continued to be used in medicine and
fundamental research. Novel array design and analysis methods will continue to speed up and increase
size of the experiments done, thus increasing our knowledge of the genetics underlying complex traits.
With this increase in complexity from larger experiments also new methods and ways have to be
developed to cope with these amounts of data. Data used during this bachelor thesis was supplied by
Illumina on a hard drive and raw signal size was 17,7 gigabytes, summary of this data still takes around
1 gigabyte. Compressing this much data into a thesis or other comprehensible format (word document
~1.5 megabytes) is only possible due to the advances in bioinformatics made in the past. Data quality
control should have a big focus in any genetical genomics experiment because of those huge amounts
of data and quality checking should have the utmost priority. In this thesis the obtained data was
relatively good, however we can’t verify which processing steps were undertaken by Illumina before
this data was supplied. Because the data presented by Illumina was not the raw bead level data but a
summary of all the beads on 1 array (per bead type) no conclusions can be drawn about the decoding
steps from Illumina.
Normalization of microarray data always presents a challenge, the normalization method chosen during
the experiment should suit the data. We would expect data distributions to follow (in general terms) a
somewhat normal distribution and when there is more information available in the dataset (e.g.
Affymetrix technology which contains match and mismatch probes) this should be used in the
normalization step during pre-processing. Because there a numerous normalization methods and each
method has at least several variants perhaps it is better to let the user of the package created choose
which method should be used. With an R package it is relatively easy to add other normalization
methods and let the user choose which normalization should be applied. This does not mean that
normalization of data should be taken to lightly, because choosing an unsuited normalization approach
will lead to data corruption and this will inevitable lead to wrong conclusions.
The detection of differentially expressed genes is a relative simple statistical test (T-test). Several other
statistical tests can be used to detect significant changes in intensity between conditions, and as with
normalization the method used usually depends on the type of experiment and the size of the different
groups/conditions participating. With detection of differentially expressed genes a threshold is
associated, currently the package uses a static threshold set by the user. It would be better to calculate
this threshold from the data itself thus simplifying the steps of choosing differently expressed probes.
Another problem is the multiple testing problem and the falsely positive differently expressed genes.
Genes that aren’t significantly changed that are picked up by the statistical test. This problem is hard to
circumvent while all statistical tests have a certain false positive rate. Setting a false detection rate to
correct for multiple comparisons will decrease the number of false positives, however with an FDR of
0,05 and 46000 probes analyzed there is still a lot of room for errors. Also the lack of discriminating
power sometimes will not pickup every differentially expressed gene, when using FDR power will
suffer even more. All of these problems have an effect on gene ontology clustering (which is also a
statistical process) which could lead to a certain pathway being overrepresented (while it is not) and
vise versa.
Expression QTL mapping is a relatively new technique, there aren’t a lot of known problems
associated with it other than false positive/negative QTLs due to the statistical tests done. This can lead
to false transbands and thus instead of helping researchers find locations that are of interest, steering
them away to uninteresting regions because of falsely significant QTLs. The transbands found in this
thesis should be compared to the transbands found in the earlier research into stemcells done by the
UMCG. When these match more will be known about interactions between within each cell and how
these interactions change between cell type. In this way eQTL mapping will help understanding the
basics of haemopoietic development.
Future perspectives
To continue this research and find out which genes are involved in the complex trait of differentiation
several steps should be taken. These steps include scaling up the experiment to increase sensitivity, but
also analysis of the ‘suspected’ genes by using recombinant techniques. These suspected genes can be
found at the locations at which transbands are seen. Combined analysis of gene ontology, interaction
networks and expression QTL maps can narrow down the suspected genes in those regions. Also the
analysis of the progenitor cells has to be checked, and there has to be a why the progenitor cell type has
such a clear transband. Identifying this source of error /or an other effect is important for further
analysis. Also implementing a view which will provide the possibility to see QTL transband shifts is
necessary for biological users to actively use this software. The protocol for analysis and the created
package can be extended to include more normalization methods or other annotation schemes, this will
be dependant on future requirements/questions from biologists and life science researchers.
Further analysis will reveal which genes are responsible for differentiation and proliferation, this
knowledge can then be used in many fields associated with stem cell biology. Also parallels can be
draw to humans, because mus musculus has always been used as a model animal for human
diseases/development. Identifying which pathways/genes or transbands are associated with erythrocyte
/ granulocyte development will help researchers of tomorrow develop novel applications in several
fields. Another step that has to be taken in future is usage of bead level data to check probe results,
some probes will not always function as expected or will show cross hybridization from multiple
sources interfering with the results of the analysis. By studying those effects in more detail other
sources of variation can be excluded and thus increasing power of detection and obtaining a cleaner
dataset. This approach also includes checking the barcodes that are used by Illumina to see if they
could interfere with the experiment.
Comparison with the Affymetrix genechips to validate the cross-platform compatibility of experiments
and results will also have to be done in the future. This can be done by comparing QTLs obtained from
the Illumina experiment to the Affymetrix experiment. If similar genes are found and they have similar
behavior (Cis/Trans and Up/Down regulation) on both platforms this will validate the results obtained
from the Illumina experiment.
In future perhaps a protocol for handling Illumina data that is more generalized than the method
proposed here should be implemented, or a redesign of the created package to include more advanced
analysis to the user. End users of the created R-package should be able to do more with this package if
it is ever going to be used in pre-processing and analyzing Illumina data. Perhaps adding a graphic user
interface (GUI) for more biological users, or adding a web interface to ease the usage of the package.
Also the package with the functions it provides should be checked and submitted to CRAN
(Comprehensive R Archive Network). A package being accepted to CRAN will provide easy access to
that package from all over the world.
Combining eQTL analysis with gene ontology or interaction networks would be helpful for biological
users to quicker understand what is going on in the analyzed cell. This combination allows gene
ontology to utilize the information generated during QTL analysis to increase knowledge about
pathways and their regulation(ors). This is also the case for protein interaction networks, information
from the QTL analysis could be integrated with into the interaction network to find new interaction
partners of genes / or identify regions controlling changes in interaction during haemopoietic
development.
Literature
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
Jansen RC, Nap JP: Genetical genomics: the added value from segregation. Trends Genet
2001, 17(7):388-391.
de Koning DJ, Haley CS: Genetical genomics in humans and model organisms. Trends
Genet 2005, 21(7):377-381.
Shen R, Fan JB, Campbell D, Chang W, Chen J, Doucet D, Yeakley J, Bibikova M, Wickham
Garcia E, McBride C et al: High-throughput SNP genotyping on universal bead arrays.
Mutat Res 2005, 573(1-2):70-82.
Jansen RC: Studying complex biological systems using multifactorial perturbation. Nat
Rev Genet 2003, 4(2):145-151.
Steemers FJ, Gunderson KL: Illumina, Inc. Pharmacogenomics 2005, 6(7):777-782.
Bystrykh L, Weersing E, Dontje B, Sutton S, Pletcher MT, Wiltshire T, Su AI, Vellenga E,
Wang J, Manly KF et al: Uncovering regulatory pathways that affect haemopoietic stem
cell function using 'genetical genomics'. Nat Genet 2005, 37(3):225-232.
Kamminga LM, Bystrykh LV, de Boer A, Houwer S, Douma J, Weersing E, Dontje B, de
Haan G: The Polycomb group gene Ezh2 prevents haemopoietic stem cell exhaustion.
Blood 2006, 107(5):2170-2179.
Steemers FJ, Gunderson KL: Whole genome genotyping technologies on the BeadArray
platform. Biotechnol J 2007, 2(1):41-49.
Eggle D, Schultze J: IlluminaGUI: Graphical User Interface for analyzing gene
expression data generated on the Illumina platform. Bioinformatics 2007.
Fan JB, Gunderson KL, Bibikova M, Yeakley JM, Chen J, Wickham Garcia E, Lebruska LL,
Laurent M, Shen R, Barker D: Illumina universal bead arrays. Methods Enzymol 2006,
410:57-73.
Kuhn K, Baker SC, Chudin E, Lieu MH, Oeser S, Bennett H, Rigault P, Barker D, McDaniel
TK, Chee MS: A novel, high-performance random array platform for quantitative gene
expression profiling. Genome Res 2004, 14(11):2347-2356.
Gunderson KL, Kruglyak S, Graige MS, Garcia F, Kermani BG, Zhao C, Che D, Dickinson T,
Wickham E, Bierle J et al: Decoding randomly ordered DNA arrays. Genome Res 2004,
14(5):870-877.
Bueno Filho JS, Gilmour SG, Rosa GJ: Design of microarray experiments for genetical
genomics studies. Genetics 2006, 174(2):945-957.
Verdugo RA, Medrano JF: Comparison of gene coverage of mouse oligonucleotide
microarray platforms. BMC Genomics 2006, 7:58.
Steinberg G, Stromsborg K, Thomas L, Barker D, Zhao C: Strategies for covalent
attachment of DNA to beads. Biopolymers 2004, 73(5):597-605.
Joos B, Kuster H, Cone R: Covalent attachment of hybridizable oligonucleotides to glass
supports. Anal Biochem 1997, 247(1):96-101.
Gautier L, Cope L, Bolstad BM, Irizarry RA: affy--analysis of Affymetrix GeneChip data
at the probe level. Bioinformatics 2004, 20(3):307-315.
Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods
for high density oligonucleotide array data based on variance and bias. Bioinformatics
2003, 19(2):185-193.
Boes T, Neuhauser M: Normalization for Affymetrix GeneChips. Methods Inf Med 2005,
44(3):414-417.
Stoyanova R, Querec TD, Brown TR, Patriotis C: Normalization of single-channel DNA
array data by principal component analysis. Bioinformatics 2004, 20(11):1772-1784.
Schuchhardt J, Beule D, Malik A, Wolski E, Eickhoff H, Lehrach H, Herzel H:
Normalization strategies for cDNA microarrays. Nucleic Acids Res 2000, 28(10):E47.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K,
Dwight SS, Eppig JT et al: Gene ontology: tool for the unification of biology. The Gene
Ontology Consortium. Nat Genet 2000, 25(1):25-29.
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids
Res 2007, 35(Database issue):D21-25.
McEntyre J: Linking up with Entrez. Trends Genet 1998, 14(1):39-40.
Wrobel G, Chalmel F, Primig M: goCluster integrates statistical analysis and functional
interpretation of microarray expression data. Bioinformatics 2005, 21(17):3575-3577.
26.
27.
28.
Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess overrepresentation
of gene ontology categories in biological networks. Bioinformatics 2005, 21(16):3448-3449.
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B,
Ideker T: Cytoscape: a software environment for integrated models of biomolecular
interaction networks. Genome Res 2003, 13(11):2498-2504.
Avila-Campillo I, Drew K, Lin J, Reiss DJ, Bonneau R: BioNetBuilder: automatic
integration of biological networks. Bioinformatics 2007, 23(3):392-393.
Web resources & Public databases
WEB RESOURCES
http://www.arisesoft.com
http://www.Illumina.com
http://www.bioconductor.org
http://www.ncbi.nlm.nih.gov/entrez
http://www.geneontology.org
http://www.genenetwork.org
http://r-project.org
http://cran.r-project.org
http://www.cytoscape.org
http://err.bio.nyu.edu/cytoscape/bionetbuilder
PUBLIC DATABASES
http://www.genome.jp/kegg - KEGG - Kyoto encyclopedia of genes and genomes.
http://bond.unleashedinformatics.com - BIND - Biomolecular interaction network database.
http://www.thebiogrid.org - BioGRID - General repository for interaction datasets.
http://dip.doe-mbi.ucla.edu - DIP - Database of interacting proteins.
http://128/97/39/94/cgi-bin/functionator/pronav - ProLINK - Protein interactions.