mogsa: gene set analysis on multiple omics data

Transcription

mogsa: gene set analysis on multiple omics data
mogsa: gene set analysis on multiple omics data
Chen Meng
Modified: March 17, 2015. Compiled: June 13, 2015.
Contents
1 MOGSA overview
2 Run mogsa
2.1 Quick start . . . . . . . . . . . .
2.2 Result analysis and interpretation
2.3 Plot gene sets in projected space
2.4 Perform MOGSA in two steps . .
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
3
9
9
3 Preparation of gene set data
10
4 Session info
11
1
MOGSA overview
Modern ”omics” technologies enable quantitative monitoring of the abundance of various biological molecules in a
high-throughput manner, accumulating an unprecedented amount of quantitative information on a genomic scale.
Gene set analysis is a particularly useful method in high throughput data analysis since it can summarize single
gene level information into the biological informative gene set levels. The mogsa provide a method doing gene set
analysis based on multiple omics data that describes the same set of observations/samples.
MOGSA algorithm consists of three steps. In the first step, multiple omics data are integrated using multi-table
multivariate analysis, such as multiple factorial analysis (MFA) [1]. MFA projects the observations and variables
(genes) from each dataset onto a lower dimensional space, resulting in sample scores (or PCs) and variables
loadings respectively. Next, gene set annotations are projected as additional information onto the same space,
generating a set of scores for each gene set across samples [2]. In the final step, MOGSA generates a gene set
score (GSS) matrix by reconstructing the sample scores and gene set scores. A high GSS indicates that gene
set and the variables in that gene set have measurement in one or more dataset that explain a large proportion of
the correlated information across data tables. Variables (genes) unique to individual datasets or common among
matrices may contribute to a high GSS. For example, in a gene set, a few genes may have high levels of gene
expression, others may have increased protein levels and a few may have amplifications in copy number.
In this document, we show with an example how to use MOGSA to integrate and annotate multiple omics data.
2
Run mogsa
1
mogsa: gene set analysis on multiple omics data
2.1
2
Quick start
In this working example, we will analyze the NCI-60 transcriptomic data from 4 different microarray platforms. The
goal is to explore which functions (gene sets) are associated with (high or low expressed) which type of tumor.
First, load the library and data
# loading gene expression data and supplementary data
library(mogsa)
library(gplots) # used for visulizing heatmap
# loading gene expression data and supplementary data
data(NCI60_4array_supdata)
data(NCI60_4arrays)
NCI60 4arrays is a list of data.frame. The list consists of microarray data for NCI-60 cell lines from different
platforms. In each of the data.frame, columns are the 60 cell lines and rows are genes. The data was downloaded
from [3], but only a small subset of genes were selected. Therefore, the result in this vignette is not intended for
biological interpretation.
NCI60 4array supdata is a list of matrix, representing gene set annotation data. For each of the microarray
data, there is a corresponding annotation matrix. In the annotation data, the rows are genes (in the same order
as their original dataset) and columns are gene sets. An annotation matrix is a binary matrix, where 1 indicates a
gene is present in a gene set and 0 otherwise. See the ”Preparation of gene set data” section about how to create
the gene set annotation matrices as required by mogsa. To have an overview of the two datasets:
sapply(NCI60_4arrays, dim) # check dimensions of expression data
##
agilent hgu133 hgu133p2 hgu95
## [1,]
300
298
268
288
## [2,]
60
60
60
60
sapply(NCI60_4array_supdata, dim) # check dimensions of supplementary data
##
agilent hgu133 hgu133p2 hgu95
## [1,]
300
298
268
288
## [2,]
150
150
150
150
# check if the gene expression data and annotation data are mathced in the same order
identical(names(NCI60_4arrays), names(NCI60_4array_supdata))
## [1] TRUE
head(rownames(NCI60_4arrays$agilent)) # the type of gene IDs
## [1] "ST8SIA1" "YWHAQ"
"EPHA4"
"GTPBP5"
"PVR"
"ATP6V1H"
Also, we need to confirm the columns between the expression data and annotation data are mapped in the same
order. To verify this, we do
dataColNames <- lapply(NCI60_4arrays, colnames)
supColNames <- lapply(NCI60_4arrays, colnames)
identical(dataColNames, supColNames)
## [1] TRUE
Before applying MOGSA, we first define a factor describing the tissue of origin of cell lines and color code, which
will be used later.
# define cancer type
cancerType <- as.factor(substr(colnames(NCI60_4arrays$agilent), 1, 2))
# define color code to distinguish cancer types
colcode <- cancerType
mogsa: gene set analysis on multiple omics data
3
0.0000
0.0010
0.0020
hgu95
hgu133p2
hgu133
agilent
PC1 PC13
PC27
PC41
PC55
Figure 1: The variance of each principal components (PC), the contributions of different data are distinguished by
different colors
levels(colcode) <- c("black", "red", "green", "blue",
"cyan", "brown", "pink", "gray", "orange")
colcode <- as.character(colcode)
Then, we call the function mogsa to run MOGSA:
mgsa1 <- mogsa(x = NCI60_4arrays, sup=NCI60_4array_supdata, nf=3,
proc.row = "center_ssq1", w.data = "inertia", statis = TRUE)
In this function, the input argument proc.row stands for the preprocessing of rows and argument w.data indicates the weight of datasets. The last argument statis is about which multiple table analysis method should be
used. Two multivariate methods are available at present, one is ”STATIS” (statis=TRUE) [4], the other one is
multiple factorial analysis (MFA; statis=FALSE, the default setting) [1].
In this analysis, we arbitrarily selected top three PCs (nf=3). But in practice, the number of PCs need to be
determined before running the MOGSA. Therefore, it is also possible to run the multivariate analysis and projecting
annotation data separately. After running the multivariate analysis, a scree plot of eigenvalues for each PC could
be used to determine the proper number of PCs to be included in the annotation projection step (See the ”Perform
MOGSA in two steps” section).
2.2
Result analysis and interpretation
The function mogsa returns an object of class mgsa. This information could be extracted with function getmgsa.
First, we want to know the variance explained by each PC on different datasets (figure 1).
eigs <- getmgsa(mgsa1, "partial.eig") # get partial "eigenvalue" for separate data
barplot(as.matrix(eigs), legend.text = rownames(eigs))
mogsa: gene set analysis on multiple omics data
4
1500
0
Count
Color Key
and Histogram
−2
0
2
Row Z−Score
BR.MCF7
BR.MDA_MB_231
BR.HS578T
BR.BT_549
BR.T47D
CNS.SF_268
CNS.SF_295
CNS.SF_539
CNS.SNB_19
CNS.SNB_75
CNS.U251
CO.COLO205
CO.HCC_2998
CO.HCT_116
CO.HCT_15
CO.HT29
CO.KM12
CO.SW_620
LE.CCRF_CEM
LE.HL_60
LE.K_562
LE.MOLT_4
LE.RPMI_8226
LE.SR
ME.LOXIMVI
ME.MALME_3M
ME.M14
ME.SK_MEL_2
ME.SK_MEL_28
ME.SK_MEL_5
ME.UACC_257
ME.UACC_62
ME.MDA_MB_435
ME.MDA_N
LC.A549
LC.EKVX
LC.HOP_62
LC.HOP_92
LC.NCI_H226
LC.NCI_H23
LC.NCI_H322M
LC.NCI_H460
LC.NCI_H522
OV.IGROV1
OV.OVCAR_3
OV.OVCAR_4
OV.OVCAR_5
OV.OVCAR_8
OV.SK_OV_3
OV.NCI_ADR_RES
PR.PC_3
PR.DU_145
RE.786_0
RE.A498
RE.ACHN
RE.CAKI_1
RE.RXF_393
RE.SN12C
RE.TK_10
RE.UO_31
INTRINSIC_TO_PLASMA_MEMBRANE
INTEGRAL_TO_PLASMA_MEMBRANE
CTTTGA_V$LEF1_Q2
YTATTTTNR_V$MEF2_02
MILI_PSEUDOPODIA_CHEMOTAXIS_DN
BENPORATH_SUZ12_TARGETS
BENPORATH_ES_WITH_H3K27ME3
CTGCAGY_UNKNOWN
NUYTTEN_NIPP1_TARGETS_UP
RODRIGUES_THYROID_CARCINOMA_ANAPLASTIC_UP
TAATTA_V$CHX10_01
FORTSCHEGGER_PHF8_TARGETS_DN
ZWANG_CLASS_1_TRANSIENTLY_INDUCED_BY_EGF
GEORGES_TARGETS_OF_MIR192_AND_MIR215
GOZGIT_ESR1_TARGETS_DN
ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN
MODULE_52
BRUINS_UVC_RESPONSE_LATE
CUI_TCF21_TARGETS_2_DN
SYSTEM_DEVELOPMENT
SMID_BREAST_CANCER_BASAL_UP
ONKEN_UVEAL_MELANOMA_UP
WAKABAYASHI_ADIPOGENESIS_PPARG_RXRA_BOUND_8D
GGGTGGRR_V$PAX4_03
MODULE_18
CAGCTG_V$AP4_Q5
ESTABLISHMENT_OF_LOCALIZATION
TGCCTTA,MIR−124A
POSITIVE_REGULATION_OF_CELLULAR_PROCESS
POSITIVE_REGULATION_OF_BIOLOGICAL_PROCESS
BUYTAERT_PHOTODYNAMIC_THERAPY_STRESS_UP
SMID_BREAST_CANCER_LUMINAL_B_DN
TGACCTY_V$ERR1_Q2
TRANSPORT
FULCHER_INFLAMMATORY_RESPONSE_LECTIN_VS_LPS_UP
CTTTAAR_UNKNOWN
CELL_PROLIFERATION_GO_0008283
NUYTTEN_NIPP1_TARGETS_DN
YOSHIMURA_MAPK8_TARGETS_UP
LEE_BMP2_TARGETS_UP
GRAESSMANN_APOPTOSIS_BY_DOXORUBICIN_UP
CYTOPLASMIC_PART
PROTEIN_METABOLIC_PROCESS
BRUINS_UVC_RESPONSE_VIA_TP53_GROUP_B
CELLULAR_MACROMOLECULE_METABOLIC_PROCESS
CELLULAR_PROTEIN_METABOLIC_PROCESS
CREIGHTON_ENDOCRINE_THERAPY_RESISTANCE_5
FEVR_CTNNB1_TARGETS_UP
MODULE_137
MODULE_100
MODULE_66
MODULE_11
ONKEN_UVEAL_MELANOMA_DN
RUTELLA_RESPONSE_TO_HGF_VS_CSF2RB_AND_IL4_UP
GOBERT_OLIGODENDROCYTE_DIFFERENTIATION_DN
RUTELLA_RESPONSE_TO_HGF_UP
RODRIGUES_THYROID_CARCINOMA_POORLY_DIFFERENTIATED_DN
BYSTRYKH_HEMATOPOIESIS_STEM_CELL_QTL_TRANS
BUYTAERT_PHOTODYNAMIC_THERAPY_STRESS_DN
NAKAMURA_TUMOR_ZONE_PERIPHERAL_VS_CENTRAL_DN
LOPEZ_MBD_TARGETS
BENPORATH_NANOG_TARGETS
TTANTCA_UNKNOWN
HAN_SATB1_TARGETS_UP
REGULATION_OF_CELLULAR_METABOLIC_PROCESS
REGULATION_OF_METABOLIC_PROCESS
TRANSCRIPTION
TATAAA_V$TATA_01
IVANOVA_HEMATOPOIESIS_STEM_CELL_AND_PROGENITOR
KRIGE_RESPONSE_TO_TOSEDOSTAT_24HR_UP
TGACAGNY_V$MEIS1_01
BENPORATH_EED_TARGETS
RNGTGGGC_UNKNOWN
NEGATIVE_REGULATION_OF_BIOLOGICAL_PROCESS
NEGATIVE_REGULATION_OF_CELLULAR_PROCESS
ACEVEDO_LIVER_CANCER_UP
BLALOCK_ALZHEIMERS_DISEASE_DN
KRIGE_RESPONSE_TO_TOSEDOSTAT_6HR_UP
MARTINEZ_RB1_AND_TP53_TARGETS_UP
MARTINEZ_TP53_TARGETS_UP
ACEVEDO_LIVER_TUMOR_VS_NORMAL_ADJACENT_TISSUE_UP
KIM_ALL_DISORDERS_OLIGODENDROCYTE_NUMBER_CORR_UP
KIM_BIPOLAR_DISORDER_OLIGODENDROCYTE_DENSITY_CORR_UP
INTRACELLULAR_SIGNALING_CASCADE
CASORELLI_ACUTE_PROMYELOCYTIC_LEUKEMIA_DN
TGTTTGY_V$HNF3_Q6
PEREZ_TP53_TARGETS
CACGTG_V$MYC_Q2
GATTGGY_V$NFY_Q6_01
ZWANG_TRANSIENTLY_UP_BY_2ND_EGF_PULSE_ONLY
MARTINEZ_RB1_TARGETS_UP
RTAAACA_V$FREAC2_01
RYTTCCTG_V$ETS2_B
SCHLOSSER_SERUM_RESPONSE_DN
JOHNSTONE_PARVB_TARGETS_3_DN
GCANCTGNY_V$MYOD_Q6
BERENJENO_TRANSFORMED_BY_RHOA_UP
RNA_METABOLIC_PROCESS
NUCLEOBASENUCLEOSIDENUCLEOTIDE_AND_NUCLEIC_ACID_METABOLIC_PROCESS
KINSEY_TARGETS_OF_EWSR1_FLII_FUSION_UP
INTRACELLULAR_NON_MEMBRANE_BOUND_ORGANELLE
NON_MEMBRANE_BOUND_ORGANELLE
LINDGREN_BLADDER_CANCER_CLUSTER_2B
MODULE_88
MODULE_55
CREIGHTON_ENDOCRINE_THERAPY_RESISTANCE_3
GRAESSMANN_RESPONSE_TO_MC_AND_DOXORUBICIN_UP
SMID_BREAST_CANCER_BASAL_DN
GRADE_COLON_CANCER_UP
BENPORATH_MYC_MAX_TARGETS
DANG_BOUND_BY_MYC
NUYTTEN_EZH2_TARGETS_DN
GTGCCTT,MIR−506
INTEGRAL_TO_MEMBRANE
PLASMA_MEMBRANE_PART
PLASMA_MEMBRANE
KRIEG_HYPOXIA_NOT_VIA_KDM3A
DACOSTA_UV_RESPONSE_VIA_ERCC3_DN
MULTICELLULAR_ORGANISMAL_DEVELOPMENT
LIU_PROSTATE_CANCER_DN
ANATOMICAL_STRUCTURE_DEVELOPMENT
CHARAFE_BREAST_CANCER_LUMINAL_VS_BASAL_DN
PASINI_SUZ12_TARGETS_DN
DUTERTRE_ESTRADIOL_RESPONSE_24HR_DN
NUYTTEN_EZH2_TARGETS_UP
MILI_PSEUDOPODIA_HAPTOTAXIS_DN
CHICAS_RB1_TARGETS_CONFLUENT
WONG_ADULT_TISSUE_STEM_MODULE
LIM_MAMMARY_STEM_CELL_UP
JOHNSTONE_PARVB_TARGETS_3_UP
MASSARWEH_TAMOXIFEN_RESISTANCE_UP
TGANTCA_V$AP1_C
MEISSNER_BRAIN_HCP_WITH_H3K4ME3_AND_H3K27ME3
KOINUMA_TARGETS_OF_SMAD2_OR_SMAD3
CHARAFE_BREAST_CANCER_LUMINAL_VS_MESENCHYMAL_DN
REN_ALVEOLAR_RHABDOMYOSARCOMA_DN
PUJANA_ATM_PCC_NETWORK
WEI_MYCN_TARGETS_WITH_E_BOX
SCGGAAGY_V$ELK1_02
MGGAAGTG_V$GABP_B
MARSON_BOUND_BY_FOXP3_STIMULATED
MODULE_84
RCGCANGCGY_V$NRF1_Q6
INTRACELLULAR_ORGANELLE_PART
ORGANELLE_PART
MARSON_BOUND_BY_FOXP3_UNSTIMULATED
KRIGE_RESPONSE_TO_TOSEDOSTAT_24HR_DN
MARTENS_TRETINOIN_RESPONSE_DN
KRIGE_RESPONSE_TO_TOSEDOSTAT_6HR_DN
LEE_BMP2_TARGETS_DN
Figure 2: heatmap showing the gene set score (GSS) matrix
The main result returned by mogsa is the gene set score (GSS) matrix. The value in the matrix indicates the overall
active level of a gene set in a sample. The matrix could be extracted and visualized by
# get the score matrix
scores <- getmgsa(mgsa1, "score")
heatmap.2(scores, trace = "n", scale = "r", Colv = NULL, dendrogram = "row",
margins = c(6, 10), ColSideColors=colcode)
Figure 2 shows the gene set score matrix returned by mogsa. The rows of the matrix are all the gene sets used
to annotate the data. But we are mostly interested in the gene sets with large number of significant gene sets,
because these gene sets describe the difference across cell lines. The corresponding p-value for each gene set
score could be extracted by getmgsa. Then, the most significant gene sets could be defined as gene sets that
contain highest number of significantly p-values. For example, if we want to select the top 20 most significant gene
sets and plot them in heatmap, we do:
p.mat <- getmgsa(mgsa1, "p.val") # get p value matrix
# select gene sets with most signficant GSS scores.
top.gs <- sort(rowSums(p.mat < 0.01), decreasing = TRUE)[1:20]
top.gs.name <- names(top.gs)
top.gs.name
##
##
##
##
##
##
##
##
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
"PASINI_SUZ12_TARGETS_DN"
"CHARAFE_BREAST_CANCER_LUMINAL_VS_BASAL_DN"
"KOINUMA_TARGETS_OF_SMAD2_OR_SMAD3"
"CHARAFE_BREAST_CANCER_LUMINAL_VS_MESENCHYMAL_DN"
"DUTERTRE_ESTRADIOL_RESPONSE_24HR_DN"
"REN_ALVEOLAR_RHABDOMYOSARCOMA_DN"
"LIM_MAMMARY_STEM_CELL_UP"
"LIU_PROSTATE_CANCER_DN"
mogsa: gene set analysis on multiple omics data
5
150
0
Count
Color Key
and Histogram
−3 −1
1
3
Row Z−Score
CHICAS_RB1_TARGETS_CONFLUENT
WONG_ADULT_TISSUE_STEM_MODULE
NUYTTEN_EZH2_TARGETS_UP
CHARAFE_BREAST_CANCER_LUMINAL_VS_BASAL_DN
PASINI_SUZ12_TARGETS_DN
DUTERTRE_ESTRADIOL_RESPONSE_24HR_DN
LIM_MAMMARY_STEM_CELL_UP
MULTICELLULAR_ORGANISMAL_DEVELOPMENT
LIU_PROSTATE_CANCER_DN
ANATOMICAL_STRUCTURE_DEVELOPMENT
KRIEG_HYPOXIA_NOT_VIA_KDM3A
DACOSTA_UV_RESPONSE_VIA_ERCC3_DN
PLASMA_MEMBRANE_PART
ZWANG_CLASS_1_TRANSIENTLY_INDUCED_BY_EGF
KOINUMA_TARGETS_OF_SMAD2_OR_SMAD3
CHARAFE_BREAST_CANCER_LUMINAL_VS_MESENCHYMAL_DN
REN_ALVEOLAR_RHABDOMYOSARCOMA_DN
KRIGE_RESPONSE_TO_TOSEDOSTAT_6HR_DN
KRIGE_RESPONSE_TO_TOSEDOSTAT_24HR_DN
BR.MCF7
BR.MDA_MB_231
BR.HS578T
BR.BT_549
BR.T47D
CNS.SF_268
CNS.SF_295
CNS.SF_539
CNS.SNB_19
CNS.SNB_75
CNS.U251
CO.COLO205
CO.HCC_2998
CO.HCT_116
CO.HCT_15
CO.HT29
CO.KM12
CO.SW_620
LE.CCRF_CEM
LE.HL_60
LE.K_562
LE.MOLT_4
LE.RPMI_8226
LE.SR
ME.LOXIMVI
ME.MALME_3M
ME.M14
ME.SK_MEL_2
ME.SK_MEL_28
ME.SK_MEL_5
ME.UACC_257
ME.UACC_62
ME.MDA_MB_435
ME.MDA_N
LC.A549
LC.EKVX
LC.HOP_62
LC.HOP_92
LC.NCI_H226
LC.NCI_H23
LC.NCI_H322M
LC.NCI_H460
LC.NCI_H522
OV.IGROV1
OV.OVCAR_3
OV.OVCAR_4
OV.OVCAR_5
OV.OVCAR_8
OV.SK_OV_3
OV.NCI_ADR_RES
PR.PC_3
PR.DU_145
RE.786_0
RE.A498
RE.ACHN
RE.CAKI_1
RE.RXF_393
RE.SN12C
RE.TK_10
RE.UO_31
PUJANA_ATM_PCC_NETWORK
Figure 3: heatmap showing the gene set score (GSS) matrix for top 20 significant gene sets
##
##
##
##
##
##
##
##
##
##
##
##
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
"CHICAS_RB1_TARGETS_CONFLUENT"
"NUYTTEN_EZH2_TARGETS_UP"
"PUJANA_ATM_PCC_NETWORK"
"DACOSTA_UV_RESPONSE_VIA_ERCC3_DN"
"KRIGE_RESPONSE_TO_TOSEDOSTAT_24HR_DN"
"WONG_ADULT_TISSUE_STEM_MODULE"
"KRIEG_HYPOXIA_NOT_VIA_KDM3A"
"MULTICELLULAR_ORGANISMAL_DEVELOPMENT"
"ANATOMICAL_STRUCTURE_DEVELOPMENT"
"ZWANG_CLASS_1_TRANSIENTLY_INDUCED_BY_EGF"
"PLASMA_MEMBRANE_PART"
"KRIGE_RESPONSE_TO_TOSEDOSTAT_6HR_DN"
heatmap.2(scores[top.gs.name, ], trace = "n", scale = "r", Colv = NULL, dendrogram = "row",
margins = c(6, 10), ColSideColors=colcode)
The result is shown in figure 3. We can see that these gene sets reflect the difference between leukemia and other
tumors.
So far, we already had an integrative overview of gene sets active levels over the 60 cell lines. It is also interesting
to look into more detailed information for a specific gene set. For example, which dataset(s) contribute most to the
high or low gene set score of a gene set? And which genes are most important in defining the gene set score for a
gene set? The former question could be answered by the gene set score decomposition; the later question could
be solve by the gene influential score. These analysis can be done with decompose.gs.group and GIS.
In the first example, we explore the gene set that have most significant gene set scores. The gene set is
# gene set score decomposition
# we explore two gene sets, the first one
mogsa: gene set analysis on multiple omics data
6
0
−1
−2
−3
decomposed gene set score
1
data−wise decomposed gene set scores
agilent
hgu133
hgu133p2
hgu95
BR
CN
CO
LC
LE
ME
OV
PR
RE
Figure 4: gene set score (GSS) decomposition. The GSS decomposition are grouped according to the tissue of
origin of cell lines. The vertical bar showing the 95% of confidence interval of the means.
gs1 <- top.gs.name[1] # select the most significant gene set
gs1
## [1] "PASINI_SUZ12_TARGETS_DN"
The data-wise decomposition of this gene set over cancer types is
# decompose the gene set score over datasets
decompose.gs.group(mgsa1, gs1, group = cancerType)
Figure 4 shows leukemia cell lines have lowest GSS on this gene set. The contribution to the overall gene set score
by each dataset are separated in this plot. In general, there is a good concordance between different datasets.
But HGU133 platform contribute most and Agilent platform contributed least comparing with other datasets, represented as the longest or shortest bars.
Next, in order to know the most influential genes in this gene set. We call the function GIS:
gis1 <- GIS(mgsa1, gs1) # gene influential score
head(gis1) # print top 6 influencers
##
##
##
##
##
##
##
feature
GIS
data
1 TNFRSF12A 1.0000000
hgu95
2 TNFRSF12A 0.9783816 hgu133p2
3
CD151 0.9601622
hgu95
4
ITGB1 0.9449297
hgu133
5
CAPN2 0.8967664
hgu133
6
LHFP 0.8771236 agilent
In figure 5, the bars represent the gene influential scores for genes. Genes from different platforms are shown in
mogsa: gene set analysis on multiple omics data
7
agilent
hgu133
hgu133p2
hgu95
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
Figure 5: The gene influential score (GIS) plot. the GIS are represented as bars and the original data where the
gene is from is distingished by different colors.
different colors. The expression of genes with high positive GIS more likely to have a good positive correlation with
the gene set score. In this example, the most important genes in the gene set ”PASIN SUZ12 TARGETS DN” are
TNFRSF12A (identified in two different platforms), CD151, ITGB1, etc.
In the next example, we use the same methods to explore the ”PUJANA ATM PCC NETWORK” gene set.
# the section gene set
gs2 <- "PUJANA_ATM_PCC_NETWORK"
decompose.gs.group(mgsa1, gs2, group = cancerType, x.legend = "topright")
gis2 <- GIS(mgsa1, "PUJANA_ATM_PCC_NETWORK", topN = 6)
gis2
##
##
##
##
##
##
##
1
2
3
4
5
6
feature
PIK3CG
GMFG
ADRBK1
RHOH
CENPC1
VAV1
GIS
1.0000000
0.9229333
0.9145966
0.8979954
0.8553077
0.8290366
data
hgu133p2
hgu133
hgu133p2
hgu133p2
hgu133p2
hgu133
Figure 6 shows that the the leukemia cell lines have highest GSSs for this gene set. And the HGU133 and HGU95
platform have relative high contribution to the overall gene set score. The GIS analysis (figure 7) indicates the
PIK4CG and GMFG are the most important genes in this gene set.
mogsa: gene set analysis on multiple omics data
8
data−wise decomposed gene set scores
2
1
0
−1
decomposed gene set score
3
4
agilent
hgu133
hgu133p2
hgu95
BR
CN
CO
LC
LE
ME
OV
PR
RE
Figure 6: Data-wise decomposed GSS for gene set ’PUJANA ATM PCC NETWORK’
agilent
hgu133
hgu133p2
hgu95
−0.5
0.0
0.5
1.0
Figure 7: GIS plot for gene set ’PUJANA ATM PCC NETWORK’
mogsa: gene set analysis on multiple omics data
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
30
● ●
●
●
●
●
●
●
●
●
●
●
PUJANA_ATM_PCC_NETWORK
●●
●
●
● ●
●
●
PASINI_SUZ12_TARGETS_DN
●
● ●
●
●●●
●
●
●
●
●
●
−20
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
−30
●●
● ●
●
●
●
● ●
● ●
●
●●
●
●
●
●
●
●
● ●●
●
● ●●
●
●●●
●
● ●●
●
● ● ●
●● ●● ●
● ●
●●● ● ● ●
●
●
● ●●●
●
● ●
●
●●●
●
●
●● ●
●
●
●
●● ● ● ●
●●
●● ●
●
●
● ●
●●
●
●
●● ● ●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
−40
●
●
●
●
−10
●
● ●
●●
PC2
PC2
●
●●
●
●
●
20
●
10
●
● ●
BR
CN
CO
LE
ME
LC
OV
PR
RE
●
●
●
0
●
9
−50
PC1
0
50
PC1
Figure 8: cell line and gene sets projected on the PC1 and PC2
2.3
Plot gene sets in projected space
We can also see how the gene set are presented in the lower dimension space. Here we show the projection of
gene set annotations on first two dimensions. Then, the label the two gene sets we analyzed before.
fs <- getmgsa(mgsa1, "fac.scr") # extract the factor scores for cell lines (cell line space)
layout(matrix(1:2, 1, 2))
plot(fs[, 1:2], pch=20, col=colcode, axes = FALSE)
abline(v=0, h=0)
legend("topright", col=unique(colcode), pch=20, legend=unique(cancerType), bty = "n")
plotGS(mgsa1, label.cex = 0.8, center.only = TRUE, topN = 0, label = c(gs1, gs2))
2.4
Perform MOGSA in two steps
mogsa perform MOGSA in one step. But in practice, one need to determine how many PCs should be retained in
the step of reconstructing gene set score matrix. A scree plot of the eigenvalues, which result from the multivariate
analysis, could be used for this purpose. Therefore, we can perform the multivariate data analysis and gene set
annotation projection in two steps. To do the multivariate analysis, we call the moa:
# perform multivariate analysis
ana <- moa(NCI60_4arrays, proc.row = "center_ssq1", w.data = "inertia", statis = TRUE)
slot(ana, "partial.eig")[, 1:6] # extract the eigenvalue
##
##
##
##
##
agilent
hgu133
hgu133p2
hgu95
PC1
0.0005406833
0.0007410830
0.0007716595
0.0008042677
PC2
0.0004119778
0.0005850680
0.0005146566
0.0006210049
# show the eigenvalues in scree plot:
layout(matrix(1:2, 1, 2))
PC3
0.0002410063
0.0003507538
0.0003742008
0.0003942394
PC4
0.0004038087
0.0001448788
0.0001281515
0.0001506287
PC5
0.0001317894
0.0001685482
0.0001487516
0.0001752495
PC6
0.0001783712
0.0001042850
0.0001203610
0.0001102364
mogsa: gene set analysis on multiple omics data
10
Scaled variance of PCs
1.0
variance of PCs
0.8
hgu95
hgu133p2
hgu133
agilent
0.0
0.0000
0.2
0.0005
0.0010
0.4
0.0015
0.6
0.0020
0.0025
hgu95
hgu133p2
hgu133
agilent
PC1
PC7
PC14
V1
V6 V11
V17
Figure 9: cell line and gene sets projected on the PC1 and PC2
plot(ana, value="eig", type = 2, n=20, main="variance of PCs") # use '?"moa-class"' to check
plot(ana, value="tau", type = 2, n=20, main="Scaled variance of PCs")
The multivariate analysis (moa) returns an object of class moa-class. The scree plot shows the top 3 PC is the most
significant since they explain much more variance than others. Several other methods, such as the informal ”elbow
test” or more formal test could be used to determine the number of retained PCs [5]. In order to be consistent with
previous example, we use top 3 PCs in the analysis:
mgsa2 <- mogsa(x = ana, sup=NCI60_4array_supdata, nf=3)
## Warning in mogsa(x = ana, sup = NCI60 4array supdata, nf = 3):
statis is not used
x is an object of "moa",
identical(mgsa1, mgsa2) # check if the two methods give the same results
## [1] FALSE
3
Preparation of gene set data
Package GSEABase provides several methods to create a gene set list [6]. In mogsa there are two methods to create gene set list. The first one is generating gene set list from package graphite [7] using function prepGraphite.
library(graphite)
keggdb <- prepGraphite(db = pathways("hsapiens", "kegg")[1:50], id = "symbol")
## converting identifiers!
## converting identifiers done!
mogsa: gene set analysis on multiple omics data
11
keggdb[1:2]
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
$`Acute myeloid
[1] "PIK3CB"
[8] "PIK3CG"
[15] "AKT2"
[22] "KIT"
[29] "IKBKG"
[36] "RAF1"
[43] "LEF1"
[50] "RELA"
[57] "CCNA1"
leukemia`
"PIK3R5"
"FLT3"
"AKT1"
"SOS1"
"CHUK"
"GRB2"
"PIM1"
"RPS6KB1"
$`Adherens junction`
[1] "RAC1"
"RAC2"
[9] "ACTN2"
"ACTN3"
[17] "CTNNA2" "IGF1R"
[25] "EGFR"
"PTPN1"
[33] "TCF7L1" "ACP1"
[41] "TGFBR2" "TGFBR1"
[49] "PVRL1"
"PVRL3"
[57] "WAS"
"WASF3"
[65] "SMAD4"
"NLK"
"PIK3R1"
"RUNX1T1"
"AKT3"
"SOS2"
"IKBKB"
"CEBPA"
"PPARD"
"RPS6KB2"
"RAC3"
"ACTB"
"FYN"
"IQGAP1"
"ERBB2"
"SMAD2"
"PVRL4"
"WASF1"
"PARD3"
"PIK3CA"
"RUNX1"
"MTOR"
"ZBTB16"
"MAP2K1"
"PIM2"
"MAPK1"
"SPI1"
"WASF2"
"ACTG1"
"CSNK2A1"
"SRC"
"CDH1"
"SMAD3"
"PVRL2"
"MAPK3"
"SNAI2"
"PIK3CD"
"STAT3"
"NRAS"
"RARA"
"MAP2K2"
"EIF4EBP1"
"MAPK3"
"TCF7"
"VCL"
"PTPRB"
"PTPRF"
"TCF7L2"
"PTPRJ"
"SSX2IP"
"CDC42"
"MAPK1"
"SNAI1"
"PIK3R2"
"STAT5A"
"KRAS"
"PML"
"ARAF"
"MYC"
"BAD"
"TCF7L2"
"BAIAP2"
"CTNNA3"
"CSNK2B"
"CSNK2A2"
"PTPN6"
"SORBS1"
"CTNND1"
"FGFR1"
"TJP1"
"PIK3R3"
"STAT5B"
"HRAS"
"JUP"
"BRAF"
"NFKB1"
"CCND1"
"TCF7L1"
"ACTN4"
"CTNNA1"
"MET"
"PTPRM"
"YES1"
"LMO7"
"WASL"
"FARP2"
"ACTN1"
"FER"
"TCF7"
"LEF1"
"MLLT4"
"MAP3K7"
"RHOA"
"CTNNB1"
The second method is to create a gene set list from ”gmt” files, which could be downloaded from MSigDB [8].
dir <- system.file(package = "mogsa")
preGS <- prepMsigDB(file=paste(dir, "/extdata/example_msigdb_data.gmt.gz", sep = ""))
In order to use the gene set information in mogsa, we have to convert the list of gene sets to a list of annotation
matrix. This can be done with prepSupMoa. This function requires two obligatory inputs, first is the multiple
omics datasets and the second input could be a gene set list, GeneSet or GeneSetCollection. The output of
prepSupMoa could be directly passed into the mogsa.
# the prepare
sup_data1 <- prepSupMoa(NCI60_4arrays, geneSets=keggdb)
mgsa3 <- mogsa(x = NCI60_4arrays, sup=sup_data1, nf=3,
proc.row = "center_ssq1", w.data = "inertia", statis = TRUE)
4
Session info
toLatex(sessionInfo())
• R version 3.2.1 beta (2015-06-08 r68489), x86_64-unknown-linux-gnu
• Locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=C,
LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C,
LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8, LC_IDENTIFICATION=C
• Base packages: base, datasets, grDevices, graphics, methods, parallel, stats, stats4, utils
• Other packages: AnnotationDbi 1.30.1, Biobase 2.28.0, BiocGenerics 0.14.0, DBI 0.3.1,
GenomeInfoDb 1.4.0, IRanges 2.2.4, RSQLite 1.0.0, S4Vectors 0.6.0, gplots 2.17.0, graphite 1.14.0,
knitr 1.10.5, mogsa 1.0.1, org.Hs.eg.db 3.1.2
• Loaded via a namespace (and not attached): BiocStyle 1.6.0, GSEABase 1.30.2, KernSmooth 2.23-14,
XML 3.98-1.2, annotate 1.46.0, bitops 1.0-6, caTools 1.17.1, codetools 0.2-11, digest 0.6.8, evaluate 0.7,
mogsa: gene set analysis on multiple omics data
12
formatR 1.2, gdata 2.16.1, genefilter 1.50.0, graph 1.46.0, gtools 3.5.0, highr 0.5, magrittr 1.5, splines 3.2.1,
stringi 0.4-1, stringr 1.0.0, survival 2.38-2, tools 3.2.1, xtable 1.7-4
References
[1] Herve Abdi, Lynne J. Williams, and Domininique Valentin. Multiple factor analysis: principal component analysis for multitable and multiblock data sets. Wiley Interdisciplinary Reviews: Computational Statistics, 5:149–
179, 2013.
[2] M. de Tayrac, S. Le, M. Aubry, J. Mosser, and F. Husson. Simultaneous analysis of distinct omics data sets
with integration of biological knowledge: Multiple factor analysis approach. BMC Genomics, 10:32, 2009.
[3] Reinhold WC, Sunshine M, Liu H, Varma S, Kohn KW, Morris J, Doroshow J, and Pommier Y. Cellminer: A
web-based suite of genomic and pharmacologic tools to explore transcript and drug patterns in the nci-60 cell
line set. Cancer Research, 72(14):3499–511, 2012.
[4] Herve Abdi, Lynne J. Williams, Domininique Valentin, and Mohammed Bennani-Dosse. Statis and distatis:
optimum multitable principal component analysis and three way metric multidimensional scaling. Wiley Interdisciplinary Reviews: Computational Statistics, 4:124–167, 2012.
[5] Herve Abdi and Lynne J. Williams. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2:433–459, 2010.
[6] Morgan M, Falcon S, and Gentleman R. Gseabase: Gene set enrichment data structures and methods. R
package version 1.28.0.
[7] Gabriele Sales1, Enrica Calura1, Duccio Cavalieri, and Chiara Romualdi1. graphite - a bioconductor package
to convert pathway topology to gene network. BMC bioinformatics, 13:20, 2012.
[8] Aravind Subramanian, Pablo Tamayoa, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A.
Gillette, Amanda Paulovich, Scott L. Pomeroy, Todd R. Golub, Eric S. Lander, and Jill P. Mesirov. Gene set
enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102:1554515550, 2005.