mogsa: gene set analysis on multiple omics data
Transcription
mogsa: gene set analysis on multiple omics data
mogsa: gene set analysis on multiple omics data Chen Meng Modified: March 17, 2015. Compiled: June 13, 2015. Contents 1 MOGSA overview 2 Run mogsa 2.1 Quick start . . . . . . . . . . . . 2.2 Result analysis and interpretation 2.3 Plot gene sets in projected space 2.4 Perform MOGSA in two steps . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 9 9 3 Preparation of gene set data 10 4 Session info 11 1 MOGSA overview Modern ”omics” technologies enable quantitative monitoring of the abundance of various biological molecules in a high-throughput manner, accumulating an unprecedented amount of quantitative information on a genomic scale. Gene set analysis is a particularly useful method in high throughput data analysis since it can summarize single gene level information into the biological informative gene set levels. The mogsa provide a method doing gene set analysis based on multiple omics data that describes the same set of observations/samples. MOGSA algorithm consists of three steps. In the first step, multiple omics data are integrated using multi-table multivariate analysis, such as multiple factorial analysis (MFA) [1]. MFA projects the observations and variables (genes) from each dataset onto a lower dimensional space, resulting in sample scores (or PCs) and variables loadings respectively. Next, gene set annotations are projected as additional information onto the same space, generating a set of scores for each gene set across samples [2]. In the final step, MOGSA generates a gene set score (GSS) matrix by reconstructing the sample scores and gene set scores. A high GSS indicates that gene set and the variables in that gene set have measurement in one or more dataset that explain a large proportion of the correlated information across data tables. Variables (genes) unique to individual datasets or common among matrices may contribute to a high GSS. For example, in a gene set, a few genes may have high levels of gene expression, others may have increased protein levels and a few may have amplifications in copy number. In this document, we show with an example how to use MOGSA to integrate and annotate multiple omics data. 2 Run mogsa 1 mogsa: gene set analysis on multiple omics data 2.1 2 Quick start In this working example, we will analyze the NCI-60 transcriptomic data from 4 different microarray platforms. The goal is to explore which functions (gene sets) are associated with (high or low expressed) which type of tumor. First, load the library and data # loading gene expression data and supplementary data library(mogsa) library(gplots) # used for visulizing heatmap # loading gene expression data and supplementary data data(NCI60_4array_supdata) data(NCI60_4arrays) NCI60 4arrays is a list of data.frame. The list consists of microarray data for NCI-60 cell lines from different platforms. In each of the data.frame, columns are the 60 cell lines and rows are genes. The data was downloaded from [3], but only a small subset of genes were selected. Therefore, the result in this vignette is not intended for biological interpretation. NCI60 4array supdata is a list of matrix, representing gene set annotation data. For each of the microarray data, there is a corresponding annotation matrix. In the annotation data, the rows are genes (in the same order as their original dataset) and columns are gene sets. An annotation matrix is a binary matrix, where 1 indicates a gene is present in a gene set and 0 otherwise. See the ”Preparation of gene set data” section about how to create the gene set annotation matrices as required by mogsa. To have an overview of the two datasets: sapply(NCI60_4arrays, dim) # check dimensions of expression data ## agilent hgu133 hgu133p2 hgu95 ## [1,] 300 298 268 288 ## [2,] 60 60 60 60 sapply(NCI60_4array_supdata, dim) # check dimensions of supplementary data ## agilent hgu133 hgu133p2 hgu95 ## [1,] 300 298 268 288 ## [2,] 150 150 150 150 # check if the gene expression data and annotation data are mathced in the same order identical(names(NCI60_4arrays), names(NCI60_4array_supdata)) ## [1] TRUE head(rownames(NCI60_4arrays$agilent)) # the type of gene IDs ## [1] "ST8SIA1" "YWHAQ" "EPHA4" "GTPBP5" "PVR" "ATP6V1H" Also, we need to confirm the columns between the expression data and annotation data are mapped in the same order. To verify this, we do dataColNames <- lapply(NCI60_4arrays, colnames) supColNames <- lapply(NCI60_4arrays, colnames) identical(dataColNames, supColNames) ## [1] TRUE Before applying MOGSA, we first define a factor describing the tissue of origin of cell lines and color code, which will be used later. # define cancer type cancerType <- as.factor(substr(colnames(NCI60_4arrays$agilent), 1, 2)) # define color code to distinguish cancer types colcode <- cancerType mogsa: gene set analysis on multiple omics data 3 0.0000 0.0010 0.0020 hgu95 hgu133p2 hgu133 agilent PC1 PC13 PC27 PC41 PC55 Figure 1: The variance of each principal components (PC), the contributions of different data are distinguished by different colors levels(colcode) <- c("black", "red", "green", "blue", "cyan", "brown", "pink", "gray", "orange") colcode <- as.character(colcode) Then, we call the function mogsa to run MOGSA: mgsa1 <- mogsa(x = NCI60_4arrays, sup=NCI60_4array_supdata, nf=3, proc.row = "center_ssq1", w.data = "inertia", statis = TRUE) In this function, the input argument proc.row stands for the preprocessing of rows and argument w.data indicates the weight of datasets. The last argument statis is about which multiple table analysis method should be used. Two multivariate methods are available at present, one is ”STATIS” (statis=TRUE) [4], the other one is multiple factorial analysis (MFA; statis=FALSE, the default setting) [1]. In this analysis, we arbitrarily selected top three PCs (nf=3). But in practice, the number of PCs need to be determined before running the MOGSA. Therefore, it is also possible to run the multivariate analysis and projecting annotation data separately. After running the multivariate analysis, a scree plot of eigenvalues for each PC could be used to determine the proper number of PCs to be included in the annotation projection step (See the ”Perform MOGSA in two steps” section). 2.2 Result analysis and interpretation The function mogsa returns an object of class mgsa. This information could be extracted with function getmgsa. First, we want to know the variance explained by each PC on different datasets (figure 1). eigs <- getmgsa(mgsa1, "partial.eig") # get partial "eigenvalue" for separate data barplot(as.matrix(eigs), legend.text = rownames(eigs)) mogsa: gene set analysis on multiple omics data 4 1500 0 Count Color Key and Histogram −2 0 2 Row Z−Score BR.MCF7 BR.MDA_MB_231 BR.HS578T BR.BT_549 BR.T47D CNS.SF_268 CNS.SF_295 CNS.SF_539 CNS.SNB_19 CNS.SNB_75 CNS.U251 CO.COLO205 CO.HCC_2998 CO.HCT_116 CO.HCT_15 CO.HT29 CO.KM12 CO.SW_620 LE.CCRF_CEM LE.HL_60 LE.K_562 LE.MOLT_4 LE.RPMI_8226 LE.SR ME.LOXIMVI ME.MALME_3M ME.M14 ME.SK_MEL_2 ME.SK_MEL_28 ME.SK_MEL_5 ME.UACC_257 ME.UACC_62 ME.MDA_MB_435 ME.MDA_N LC.A549 LC.EKVX LC.HOP_62 LC.HOP_92 LC.NCI_H226 LC.NCI_H23 LC.NCI_H322M LC.NCI_H460 LC.NCI_H522 OV.IGROV1 OV.OVCAR_3 OV.OVCAR_4 OV.OVCAR_5 OV.OVCAR_8 OV.SK_OV_3 OV.NCI_ADR_RES PR.PC_3 PR.DU_145 RE.786_0 RE.A498 RE.ACHN RE.CAKI_1 RE.RXF_393 RE.SN12C RE.TK_10 RE.UO_31 INTRINSIC_TO_PLASMA_MEMBRANE INTEGRAL_TO_PLASMA_MEMBRANE CTTTGA_V$LEF1_Q2 YTATTTTNR_V$MEF2_02 MILI_PSEUDOPODIA_CHEMOTAXIS_DN BENPORATH_SUZ12_TARGETS BENPORATH_ES_WITH_H3K27ME3 CTGCAGY_UNKNOWN NUYTTEN_NIPP1_TARGETS_UP RODRIGUES_THYROID_CARCINOMA_ANAPLASTIC_UP TAATTA_V$CHX10_01 FORTSCHEGGER_PHF8_TARGETS_DN ZWANG_CLASS_1_TRANSIENTLY_INDUCED_BY_EGF GEORGES_TARGETS_OF_MIR192_AND_MIR215 GOZGIT_ESR1_TARGETS_DN ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN MODULE_52 BRUINS_UVC_RESPONSE_LATE CUI_TCF21_TARGETS_2_DN SYSTEM_DEVELOPMENT SMID_BREAST_CANCER_BASAL_UP ONKEN_UVEAL_MELANOMA_UP WAKABAYASHI_ADIPOGENESIS_PPARG_RXRA_BOUND_8D GGGTGGRR_V$PAX4_03 MODULE_18 CAGCTG_V$AP4_Q5 ESTABLISHMENT_OF_LOCALIZATION TGCCTTA,MIR−124A POSITIVE_REGULATION_OF_CELLULAR_PROCESS POSITIVE_REGULATION_OF_BIOLOGICAL_PROCESS BUYTAERT_PHOTODYNAMIC_THERAPY_STRESS_UP SMID_BREAST_CANCER_LUMINAL_B_DN TGACCTY_V$ERR1_Q2 TRANSPORT FULCHER_INFLAMMATORY_RESPONSE_LECTIN_VS_LPS_UP CTTTAAR_UNKNOWN CELL_PROLIFERATION_GO_0008283 NUYTTEN_NIPP1_TARGETS_DN YOSHIMURA_MAPK8_TARGETS_UP LEE_BMP2_TARGETS_UP GRAESSMANN_APOPTOSIS_BY_DOXORUBICIN_UP CYTOPLASMIC_PART PROTEIN_METABOLIC_PROCESS BRUINS_UVC_RESPONSE_VIA_TP53_GROUP_B CELLULAR_MACROMOLECULE_METABOLIC_PROCESS CELLULAR_PROTEIN_METABOLIC_PROCESS CREIGHTON_ENDOCRINE_THERAPY_RESISTANCE_5 FEVR_CTNNB1_TARGETS_UP MODULE_137 MODULE_100 MODULE_66 MODULE_11 ONKEN_UVEAL_MELANOMA_DN RUTELLA_RESPONSE_TO_HGF_VS_CSF2RB_AND_IL4_UP GOBERT_OLIGODENDROCYTE_DIFFERENTIATION_DN RUTELLA_RESPONSE_TO_HGF_UP RODRIGUES_THYROID_CARCINOMA_POORLY_DIFFERENTIATED_DN BYSTRYKH_HEMATOPOIESIS_STEM_CELL_QTL_TRANS BUYTAERT_PHOTODYNAMIC_THERAPY_STRESS_DN NAKAMURA_TUMOR_ZONE_PERIPHERAL_VS_CENTRAL_DN LOPEZ_MBD_TARGETS BENPORATH_NANOG_TARGETS TTANTCA_UNKNOWN HAN_SATB1_TARGETS_UP REGULATION_OF_CELLULAR_METABOLIC_PROCESS REGULATION_OF_METABOLIC_PROCESS TRANSCRIPTION TATAAA_V$TATA_01 IVANOVA_HEMATOPOIESIS_STEM_CELL_AND_PROGENITOR KRIGE_RESPONSE_TO_TOSEDOSTAT_24HR_UP TGACAGNY_V$MEIS1_01 BENPORATH_EED_TARGETS RNGTGGGC_UNKNOWN NEGATIVE_REGULATION_OF_BIOLOGICAL_PROCESS NEGATIVE_REGULATION_OF_CELLULAR_PROCESS ACEVEDO_LIVER_CANCER_UP BLALOCK_ALZHEIMERS_DISEASE_DN KRIGE_RESPONSE_TO_TOSEDOSTAT_6HR_UP MARTINEZ_RB1_AND_TP53_TARGETS_UP MARTINEZ_TP53_TARGETS_UP ACEVEDO_LIVER_TUMOR_VS_NORMAL_ADJACENT_TISSUE_UP KIM_ALL_DISORDERS_OLIGODENDROCYTE_NUMBER_CORR_UP KIM_BIPOLAR_DISORDER_OLIGODENDROCYTE_DENSITY_CORR_UP INTRACELLULAR_SIGNALING_CASCADE CASORELLI_ACUTE_PROMYELOCYTIC_LEUKEMIA_DN TGTTTGY_V$HNF3_Q6 PEREZ_TP53_TARGETS CACGTG_V$MYC_Q2 GATTGGY_V$NFY_Q6_01 ZWANG_TRANSIENTLY_UP_BY_2ND_EGF_PULSE_ONLY MARTINEZ_RB1_TARGETS_UP RTAAACA_V$FREAC2_01 RYTTCCTG_V$ETS2_B SCHLOSSER_SERUM_RESPONSE_DN JOHNSTONE_PARVB_TARGETS_3_DN GCANCTGNY_V$MYOD_Q6 BERENJENO_TRANSFORMED_BY_RHOA_UP RNA_METABOLIC_PROCESS NUCLEOBASENUCLEOSIDENUCLEOTIDE_AND_NUCLEIC_ACID_METABOLIC_PROCESS KINSEY_TARGETS_OF_EWSR1_FLII_FUSION_UP INTRACELLULAR_NON_MEMBRANE_BOUND_ORGANELLE NON_MEMBRANE_BOUND_ORGANELLE LINDGREN_BLADDER_CANCER_CLUSTER_2B MODULE_88 MODULE_55 CREIGHTON_ENDOCRINE_THERAPY_RESISTANCE_3 GRAESSMANN_RESPONSE_TO_MC_AND_DOXORUBICIN_UP SMID_BREAST_CANCER_BASAL_DN GRADE_COLON_CANCER_UP BENPORATH_MYC_MAX_TARGETS DANG_BOUND_BY_MYC NUYTTEN_EZH2_TARGETS_DN GTGCCTT,MIR−506 INTEGRAL_TO_MEMBRANE PLASMA_MEMBRANE_PART PLASMA_MEMBRANE KRIEG_HYPOXIA_NOT_VIA_KDM3A DACOSTA_UV_RESPONSE_VIA_ERCC3_DN MULTICELLULAR_ORGANISMAL_DEVELOPMENT LIU_PROSTATE_CANCER_DN ANATOMICAL_STRUCTURE_DEVELOPMENT CHARAFE_BREAST_CANCER_LUMINAL_VS_BASAL_DN PASINI_SUZ12_TARGETS_DN DUTERTRE_ESTRADIOL_RESPONSE_24HR_DN NUYTTEN_EZH2_TARGETS_UP MILI_PSEUDOPODIA_HAPTOTAXIS_DN CHICAS_RB1_TARGETS_CONFLUENT WONG_ADULT_TISSUE_STEM_MODULE LIM_MAMMARY_STEM_CELL_UP JOHNSTONE_PARVB_TARGETS_3_UP MASSARWEH_TAMOXIFEN_RESISTANCE_UP TGANTCA_V$AP1_C MEISSNER_BRAIN_HCP_WITH_H3K4ME3_AND_H3K27ME3 KOINUMA_TARGETS_OF_SMAD2_OR_SMAD3 CHARAFE_BREAST_CANCER_LUMINAL_VS_MESENCHYMAL_DN REN_ALVEOLAR_RHABDOMYOSARCOMA_DN PUJANA_ATM_PCC_NETWORK WEI_MYCN_TARGETS_WITH_E_BOX SCGGAAGY_V$ELK1_02 MGGAAGTG_V$GABP_B MARSON_BOUND_BY_FOXP3_STIMULATED MODULE_84 RCGCANGCGY_V$NRF1_Q6 INTRACELLULAR_ORGANELLE_PART ORGANELLE_PART MARSON_BOUND_BY_FOXP3_UNSTIMULATED KRIGE_RESPONSE_TO_TOSEDOSTAT_24HR_DN MARTENS_TRETINOIN_RESPONSE_DN KRIGE_RESPONSE_TO_TOSEDOSTAT_6HR_DN LEE_BMP2_TARGETS_DN Figure 2: heatmap showing the gene set score (GSS) matrix The main result returned by mogsa is the gene set score (GSS) matrix. The value in the matrix indicates the overall active level of a gene set in a sample. The matrix could be extracted and visualized by # get the score matrix scores <- getmgsa(mgsa1, "score") heatmap.2(scores, trace = "n", scale = "r", Colv = NULL, dendrogram = "row", margins = c(6, 10), ColSideColors=colcode) Figure 2 shows the gene set score matrix returned by mogsa. The rows of the matrix are all the gene sets used to annotate the data. But we are mostly interested in the gene sets with large number of significant gene sets, because these gene sets describe the difference across cell lines. The corresponding p-value for each gene set score could be extracted by getmgsa. Then, the most significant gene sets could be defined as gene sets that contain highest number of significantly p-values. For example, if we want to select the top 20 most significant gene sets and plot them in heatmap, we do: p.mat <- getmgsa(mgsa1, "p.val") # get p value matrix # select gene sets with most signficant GSS scores. top.gs <- sort(rowSums(p.mat < 0.01), decreasing = TRUE)[1:20] top.gs.name <- names(top.gs) top.gs.name ## ## ## ## ## ## ## ## [1] [2] [3] [4] [5] [6] [7] [8] "PASINI_SUZ12_TARGETS_DN" "CHARAFE_BREAST_CANCER_LUMINAL_VS_BASAL_DN" "KOINUMA_TARGETS_OF_SMAD2_OR_SMAD3" "CHARAFE_BREAST_CANCER_LUMINAL_VS_MESENCHYMAL_DN" "DUTERTRE_ESTRADIOL_RESPONSE_24HR_DN" "REN_ALVEOLAR_RHABDOMYOSARCOMA_DN" "LIM_MAMMARY_STEM_CELL_UP" "LIU_PROSTATE_CANCER_DN" mogsa: gene set analysis on multiple omics data 5 150 0 Count Color Key and Histogram −3 −1 1 3 Row Z−Score CHICAS_RB1_TARGETS_CONFLUENT WONG_ADULT_TISSUE_STEM_MODULE NUYTTEN_EZH2_TARGETS_UP CHARAFE_BREAST_CANCER_LUMINAL_VS_BASAL_DN PASINI_SUZ12_TARGETS_DN DUTERTRE_ESTRADIOL_RESPONSE_24HR_DN LIM_MAMMARY_STEM_CELL_UP MULTICELLULAR_ORGANISMAL_DEVELOPMENT LIU_PROSTATE_CANCER_DN ANATOMICAL_STRUCTURE_DEVELOPMENT KRIEG_HYPOXIA_NOT_VIA_KDM3A DACOSTA_UV_RESPONSE_VIA_ERCC3_DN PLASMA_MEMBRANE_PART ZWANG_CLASS_1_TRANSIENTLY_INDUCED_BY_EGF KOINUMA_TARGETS_OF_SMAD2_OR_SMAD3 CHARAFE_BREAST_CANCER_LUMINAL_VS_MESENCHYMAL_DN REN_ALVEOLAR_RHABDOMYOSARCOMA_DN KRIGE_RESPONSE_TO_TOSEDOSTAT_6HR_DN KRIGE_RESPONSE_TO_TOSEDOSTAT_24HR_DN BR.MCF7 BR.MDA_MB_231 BR.HS578T BR.BT_549 BR.T47D CNS.SF_268 CNS.SF_295 CNS.SF_539 CNS.SNB_19 CNS.SNB_75 CNS.U251 CO.COLO205 CO.HCC_2998 CO.HCT_116 CO.HCT_15 CO.HT29 CO.KM12 CO.SW_620 LE.CCRF_CEM LE.HL_60 LE.K_562 LE.MOLT_4 LE.RPMI_8226 LE.SR ME.LOXIMVI ME.MALME_3M ME.M14 ME.SK_MEL_2 ME.SK_MEL_28 ME.SK_MEL_5 ME.UACC_257 ME.UACC_62 ME.MDA_MB_435 ME.MDA_N LC.A549 LC.EKVX LC.HOP_62 LC.HOP_92 LC.NCI_H226 LC.NCI_H23 LC.NCI_H322M LC.NCI_H460 LC.NCI_H522 OV.IGROV1 OV.OVCAR_3 OV.OVCAR_4 OV.OVCAR_5 OV.OVCAR_8 OV.SK_OV_3 OV.NCI_ADR_RES PR.PC_3 PR.DU_145 RE.786_0 RE.A498 RE.ACHN RE.CAKI_1 RE.RXF_393 RE.SN12C RE.TK_10 RE.UO_31 PUJANA_ATM_PCC_NETWORK Figure 3: heatmap showing the gene set score (GSS) matrix for top 20 significant gene sets ## ## ## ## ## ## ## ## ## ## ## ## [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] "CHICAS_RB1_TARGETS_CONFLUENT" "NUYTTEN_EZH2_TARGETS_UP" "PUJANA_ATM_PCC_NETWORK" "DACOSTA_UV_RESPONSE_VIA_ERCC3_DN" "KRIGE_RESPONSE_TO_TOSEDOSTAT_24HR_DN" "WONG_ADULT_TISSUE_STEM_MODULE" "KRIEG_HYPOXIA_NOT_VIA_KDM3A" "MULTICELLULAR_ORGANISMAL_DEVELOPMENT" "ANATOMICAL_STRUCTURE_DEVELOPMENT" "ZWANG_CLASS_1_TRANSIENTLY_INDUCED_BY_EGF" "PLASMA_MEMBRANE_PART" "KRIGE_RESPONSE_TO_TOSEDOSTAT_6HR_DN" heatmap.2(scores[top.gs.name, ], trace = "n", scale = "r", Colv = NULL, dendrogram = "row", margins = c(6, 10), ColSideColors=colcode) The result is shown in figure 3. We can see that these gene sets reflect the difference between leukemia and other tumors. So far, we already had an integrative overview of gene sets active levels over the 60 cell lines. It is also interesting to look into more detailed information for a specific gene set. For example, which dataset(s) contribute most to the high or low gene set score of a gene set? And which genes are most important in defining the gene set score for a gene set? The former question could be answered by the gene set score decomposition; the later question could be solve by the gene influential score. These analysis can be done with decompose.gs.group and GIS. In the first example, we explore the gene set that have most significant gene set scores. The gene set is # gene set score decomposition # we explore two gene sets, the first one mogsa: gene set analysis on multiple omics data 6 0 −1 −2 −3 decomposed gene set score 1 data−wise decomposed gene set scores agilent hgu133 hgu133p2 hgu95 BR CN CO LC LE ME OV PR RE Figure 4: gene set score (GSS) decomposition. The GSS decomposition are grouped according to the tissue of origin of cell lines. The vertical bar showing the 95% of confidence interval of the means. gs1 <- top.gs.name[1] # select the most significant gene set gs1 ## [1] "PASINI_SUZ12_TARGETS_DN" The data-wise decomposition of this gene set over cancer types is # decompose the gene set score over datasets decompose.gs.group(mgsa1, gs1, group = cancerType) Figure 4 shows leukemia cell lines have lowest GSS on this gene set. The contribution to the overall gene set score by each dataset are separated in this plot. In general, there is a good concordance between different datasets. But HGU133 platform contribute most and Agilent platform contributed least comparing with other datasets, represented as the longest or shortest bars. Next, in order to know the most influential genes in this gene set. We call the function GIS: gis1 <- GIS(mgsa1, gs1) # gene influential score head(gis1) # print top 6 influencers ## ## ## ## ## ## ## feature GIS data 1 TNFRSF12A 1.0000000 hgu95 2 TNFRSF12A 0.9783816 hgu133p2 3 CD151 0.9601622 hgu95 4 ITGB1 0.9449297 hgu133 5 CAPN2 0.8967664 hgu133 6 LHFP 0.8771236 agilent In figure 5, the bars represent the gene influential scores for genes. Genes from different platforms are shown in mogsa: gene set analysis on multiple omics data 7 agilent hgu133 hgu133p2 hgu95 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 Figure 5: The gene influential score (GIS) plot. the GIS are represented as bars and the original data where the gene is from is distingished by different colors. different colors. The expression of genes with high positive GIS more likely to have a good positive correlation with the gene set score. In this example, the most important genes in the gene set ”PASIN SUZ12 TARGETS DN” are TNFRSF12A (identified in two different platforms), CD151, ITGB1, etc. In the next example, we use the same methods to explore the ”PUJANA ATM PCC NETWORK” gene set. # the section gene set gs2 <- "PUJANA_ATM_PCC_NETWORK" decompose.gs.group(mgsa1, gs2, group = cancerType, x.legend = "topright") gis2 <- GIS(mgsa1, "PUJANA_ATM_PCC_NETWORK", topN = 6) gis2 ## ## ## ## ## ## ## 1 2 3 4 5 6 feature PIK3CG GMFG ADRBK1 RHOH CENPC1 VAV1 GIS 1.0000000 0.9229333 0.9145966 0.8979954 0.8553077 0.8290366 data hgu133p2 hgu133 hgu133p2 hgu133p2 hgu133p2 hgu133 Figure 6 shows that the the leukemia cell lines have highest GSSs for this gene set. And the HGU133 and HGU95 platform have relative high contribution to the overall gene set score. The GIS analysis (figure 7) indicates the PIK4CG and GMFG are the most important genes in this gene set. mogsa: gene set analysis on multiple omics data 8 data−wise decomposed gene set scores 2 1 0 −1 decomposed gene set score 3 4 agilent hgu133 hgu133p2 hgu95 BR CN CO LC LE ME OV PR RE Figure 6: Data-wise decomposed GSS for gene set ’PUJANA ATM PCC NETWORK’ agilent hgu133 hgu133p2 hgu95 −0.5 0.0 0.5 1.0 Figure 7: GIS plot for gene set ’PUJANA ATM PCC NETWORK’ mogsa: gene set analysis on multiple omics data ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 ● ● ● ● ● ● ● ● ● ● ● ● PUJANA_ATM_PCC_NETWORK ●● ● ● ● ● ● ● PASINI_SUZ12_TARGETS_DN ● ● ● ● ●●● ● ● ● ● ● ● −20 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● −30 ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −40 ● ● ● ● −10 ● ● ● ●● PC2 PC2 ● ●● ● ● ● 20 ● 10 ● ● ● BR CN CO LE ME LC OV PR RE ● ● ● 0 ● 9 −50 PC1 0 50 PC1 Figure 8: cell line and gene sets projected on the PC1 and PC2 2.3 Plot gene sets in projected space We can also see how the gene set are presented in the lower dimension space. Here we show the projection of gene set annotations on first two dimensions. Then, the label the two gene sets we analyzed before. fs <- getmgsa(mgsa1, "fac.scr") # extract the factor scores for cell lines (cell line space) layout(matrix(1:2, 1, 2)) plot(fs[, 1:2], pch=20, col=colcode, axes = FALSE) abline(v=0, h=0) legend("topright", col=unique(colcode), pch=20, legend=unique(cancerType), bty = "n") plotGS(mgsa1, label.cex = 0.8, center.only = TRUE, topN = 0, label = c(gs1, gs2)) 2.4 Perform MOGSA in two steps mogsa perform MOGSA in one step. But in practice, one need to determine how many PCs should be retained in the step of reconstructing gene set score matrix. A scree plot of the eigenvalues, which result from the multivariate analysis, could be used for this purpose. Therefore, we can perform the multivariate data analysis and gene set annotation projection in two steps. To do the multivariate analysis, we call the moa: # perform multivariate analysis ana <- moa(NCI60_4arrays, proc.row = "center_ssq1", w.data = "inertia", statis = TRUE) slot(ana, "partial.eig")[, 1:6] # extract the eigenvalue ## ## ## ## ## agilent hgu133 hgu133p2 hgu95 PC1 0.0005406833 0.0007410830 0.0007716595 0.0008042677 PC2 0.0004119778 0.0005850680 0.0005146566 0.0006210049 # show the eigenvalues in scree plot: layout(matrix(1:2, 1, 2)) PC3 0.0002410063 0.0003507538 0.0003742008 0.0003942394 PC4 0.0004038087 0.0001448788 0.0001281515 0.0001506287 PC5 0.0001317894 0.0001685482 0.0001487516 0.0001752495 PC6 0.0001783712 0.0001042850 0.0001203610 0.0001102364 mogsa: gene set analysis on multiple omics data 10 Scaled variance of PCs 1.0 variance of PCs 0.8 hgu95 hgu133p2 hgu133 agilent 0.0 0.0000 0.2 0.0005 0.0010 0.4 0.0015 0.6 0.0020 0.0025 hgu95 hgu133p2 hgu133 agilent PC1 PC7 PC14 V1 V6 V11 V17 Figure 9: cell line and gene sets projected on the PC1 and PC2 plot(ana, value="eig", type = 2, n=20, main="variance of PCs") # use '?"moa-class"' to check plot(ana, value="tau", type = 2, n=20, main="Scaled variance of PCs") The multivariate analysis (moa) returns an object of class moa-class. The scree plot shows the top 3 PC is the most significant since they explain much more variance than others. Several other methods, such as the informal ”elbow test” or more formal test could be used to determine the number of retained PCs [5]. In order to be consistent with previous example, we use top 3 PCs in the analysis: mgsa2 <- mogsa(x = ana, sup=NCI60_4array_supdata, nf=3) ## Warning in mogsa(x = ana, sup = NCI60 4array supdata, nf = 3): statis is not used x is an object of "moa", identical(mgsa1, mgsa2) # check if the two methods give the same results ## [1] FALSE 3 Preparation of gene set data Package GSEABase provides several methods to create a gene set list [6]. In mogsa there are two methods to create gene set list. The first one is generating gene set list from package graphite [7] using function prepGraphite. library(graphite) keggdb <- prepGraphite(db = pathways("hsapiens", "kegg")[1:50], id = "symbol") ## converting identifiers! ## converting identifiers done! mogsa: gene set analysis on multiple omics data 11 keggdb[1:2] ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## $`Acute myeloid [1] "PIK3CB" [8] "PIK3CG" [15] "AKT2" [22] "KIT" [29] "IKBKG" [36] "RAF1" [43] "LEF1" [50] "RELA" [57] "CCNA1" leukemia` "PIK3R5" "FLT3" "AKT1" "SOS1" "CHUK" "GRB2" "PIM1" "RPS6KB1" $`Adherens junction` [1] "RAC1" "RAC2" [9] "ACTN2" "ACTN3" [17] "CTNNA2" "IGF1R" [25] "EGFR" "PTPN1" [33] "TCF7L1" "ACP1" [41] "TGFBR2" "TGFBR1" [49] "PVRL1" "PVRL3" [57] "WAS" "WASF3" [65] "SMAD4" "NLK" "PIK3R1" "RUNX1T1" "AKT3" "SOS2" "IKBKB" "CEBPA" "PPARD" "RPS6KB2" "RAC3" "ACTB" "FYN" "IQGAP1" "ERBB2" "SMAD2" "PVRL4" "WASF1" "PARD3" "PIK3CA" "RUNX1" "MTOR" "ZBTB16" "MAP2K1" "PIM2" "MAPK1" "SPI1" "WASF2" "ACTG1" "CSNK2A1" "SRC" "CDH1" "SMAD3" "PVRL2" "MAPK3" "SNAI2" "PIK3CD" "STAT3" "NRAS" "RARA" "MAP2K2" "EIF4EBP1" "MAPK3" "TCF7" "VCL" "PTPRB" "PTPRF" "TCF7L2" "PTPRJ" "SSX2IP" "CDC42" "MAPK1" "SNAI1" "PIK3R2" "STAT5A" "KRAS" "PML" "ARAF" "MYC" "BAD" "TCF7L2" "BAIAP2" "CTNNA3" "CSNK2B" "CSNK2A2" "PTPN6" "SORBS1" "CTNND1" "FGFR1" "TJP1" "PIK3R3" "STAT5B" "HRAS" "JUP" "BRAF" "NFKB1" "CCND1" "TCF7L1" "ACTN4" "CTNNA1" "MET" "PTPRM" "YES1" "LMO7" "WASL" "FARP2" "ACTN1" "FER" "TCF7" "LEF1" "MLLT4" "MAP3K7" "RHOA" "CTNNB1" The second method is to create a gene set list from ”gmt” files, which could be downloaded from MSigDB [8]. dir <- system.file(package = "mogsa") preGS <- prepMsigDB(file=paste(dir, "/extdata/example_msigdb_data.gmt.gz", sep = "")) In order to use the gene set information in mogsa, we have to convert the list of gene sets to a list of annotation matrix. This can be done with prepSupMoa. This function requires two obligatory inputs, first is the multiple omics datasets and the second input could be a gene set list, GeneSet or GeneSetCollection. The output of prepSupMoa could be directly passed into the mogsa. # the prepare sup_data1 <- prepSupMoa(NCI60_4arrays, geneSets=keggdb) mgsa3 <- mogsa(x = NCI60_4arrays, sup=sup_data1, nf=3, proc.row = "center_ssq1", w.data = "inertia", statis = TRUE) 4 Session info toLatex(sessionInfo()) • R version 3.2.1 beta (2015-06-08 r68489), x86_64-unknown-linux-gnu • Locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=C, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8, LC_IDENTIFICATION=C • Base packages: base, datasets, grDevices, graphics, methods, parallel, stats, stats4, utils • Other packages: AnnotationDbi 1.30.1, Biobase 2.28.0, BiocGenerics 0.14.0, DBI 0.3.1, GenomeInfoDb 1.4.0, IRanges 2.2.4, RSQLite 1.0.0, S4Vectors 0.6.0, gplots 2.17.0, graphite 1.14.0, knitr 1.10.5, mogsa 1.0.1, org.Hs.eg.db 3.1.2 • Loaded via a namespace (and not attached): BiocStyle 1.6.0, GSEABase 1.30.2, KernSmooth 2.23-14, XML 3.98-1.2, annotate 1.46.0, bitops 1.0-6, caTools 1.17.1, codetools 0.2-11, digest 0.6.8, evaluate 0.7, mogsa: gene set analysis on multiple omics data 12 formatR 1.2, gdata 2.16.1, genefilter 1.50.0, graph 1.46.0, gtools 3.5.0, highr 0.5, magrittr 1.5, splines 3.2.1, stringi 0.4-1, stringr 1.0.0, survival 2.38-2, tools 3.2.1, xtable 1.7-4 References [1] Herve Abdi, Lynne J. Williams, and Domininique Valentin. Multiple factor analysis: principal component analysis for multitable and multiblock data sets. Wiley Interdisciplinary Reviews: Computational Statistics, 5:149– 179, 2013. [2] M. de Tayrac, S. Le, M. Aubry, J. Mosser, and F. Husson. Simultaneous analysis of distinct omics data sets with integration of biological knowledge: Multiple factor analysis approach. BMC Genomics, 10:32, 2009. [3] Reinhold WC, Sunshine M, Liu H, Varma S, Kohn KW, Morris J, Doroshow J, and Pommier Y. Cellminer: A web-based suite of genomic and pharmacologic tools to explore transcript and drug patterns in the nci-60 cell line set. Cancer Research, 72(14):3499–511, 2012. [4] Herve Abdi, Lynne J. Williams, Domininique Valentin, and Mohammed Bennani-Dosse. Statis and distatis: optimum multitable principal component analysis and three way metric multidimensional scaling. Wiley Interdisciplinary Reviews: Computational Statistics, 4:124–167, 2012. [5] Herve Abdi and Lynne J. Williams. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2:433–459, 2010. [6] Morgan M, Falcon S, and Gentleman R. Gseabase: Gene set enrichment data structures and methods. R package version 1.28.0. [7] Gabriele Sales1, Enrica Calura1, Duccio Cavalieri, and Chiara Romualdi1. graphite - a bioconductor package to convert pathway topology to gene network. BMC bioinformatics, 13:20, 2012. [8] Aravind Subramanian, Pablo Tamayoa, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, Scott L. Pomeroy, Todd R. Golub, Eric S. Lander, and Jill P. Mesirov. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences, 102:1554515550, 2005.