MAGMA manual (v0.2)
Transcription
MAGMA manual (v0.2)
MAGMA manual (v0.2) The program is composed of three modules: annotation, gene analysis and gene-set analysis. Gene analysis feeds into the gene-set analysis through the [OUT].genes.raw file. All three steps can be performed at once, or in individual steps. When performing multiple steps in one go, intermediate files are stored well so later steps can be run again without rerunning the earlier steps. Input data is binary PLINK format. It is assumed that the data has undergone quality control. It is strongly advised that A) ancestry checks are performed during data QC and that population outliers are removed (or split into subsamples by population); and B) that principal components are used (computed using eg. Eigenstrat) to correct for population stratification. MAGMA can use both raw genotype data as well as SNP p-values as input, though in the latter case a reference data set (eg. 1000 Genomes European panel) must also be provided. It is recommended that raw genotype data is used if possible, as the p-value only analysis is less powerful due to the loss of information. Using MAGMA Input arguments for MAGMA take the form of flags (prefixed by --) followed by the relevant values needed for that flag (if any). Many flags accept additional optional modifiers, which are keywords specified after the values for that flag that modify the behaviour of that flag. Some modifiers consist of only the keyword itself (eg. --flag [VALUE] modifier), other modifiers take further parameters specified by the = sign and a comma-separated list of parameter values (eg. --flag [VALUE] modifier = param1, param2). Annotation ./magma --annotate --snp-loc [SNPLOC_FILE] --gene-loc [GENELOC_FILE] --out [PREFIX] Annotates SNPs to genes based on the SNP location file (no header, three columns: SNP id, chromosome, base-pair position) and the gene location file (no header, four columns: gene name, chromosome, start position, end position). A PLINK .bim file, for example from the data to be analysed, can also be used as [SNPLOC_FILE]; files ending in .bim are automatically recognised as such, and the appropriate columns selected. This has the advantage that all SNPs in the data can be annoted; when using an external SNP location file, only SNPs with rs-ID present in both the data and the SNP location file can be used in the analysis. WARNING: when using your data .bim file as SNP location file, make sure that the SNP locations in the .bim file refer to the same human reference genome build version as the gene location file! The --annotate flag accepts three modifiers. The chr modifier specifies a subset of chromosomes to annotate, either a single value or a range (eg. --annotate chr=3 or --annotate chr=20-X). The window modifier specifies a window (in kilobase) around genes to be included for that gene (default window is 0). Can either be symmetrical (single value, eg. window=5) or separately for before and after the gene (pair of values, eg. window=5,1.5). The filter modifier specifies a file with no header (eg. --annotate filter=data.bim), and it causes -annotate to retain only SNPs if they are specified in the first column of this file (or second, if a .bim file). This can be useful for example when using a very large SNP location file, which would otherwise produce very large gene annotation file as well. Gene analysis ./magma --bfile [PREFIX] --gene-annot [GENEANNOT_FILE] --out [OUT] This runs the default (PC regression) gene analysis on the PLINK data specified by [PREFIX], using a gene-annotation (.genes.annot) file previously produced with the --annotate function. This will also produce an [OUT].genes.raw file for subsequent gene-set analysis, unless the --genes-only is added. Gene analysis with on-the-fly annotation ./magma --annotate --bfile [PREFIX] --gene-loc [GENELOC_FILE] ./magma --annotate --bfile [PREFIX] --snp-loc [SNPLOC_FILE] --gene-loc [GENELOC_FILE] Does SNP-to-gene annotation and immediately does gene analysis. The annotation file is also saved, and is automatically filtered for SNPs in [PREFIX].bim. If the --snp-loc flag is not set, [PREFIX].bim is also used as the SNP location file. Gene analysis with covariates and/or alternate phenotype ./magma --bfile [PREFIX] --gene-annot [GENEANNOT_FILE] --pheno file=[PHENO_FILE] --out [OUT] ./magma --bfile [PREFIX] --gene-annot [GENEANNOT_FILE] --covar file=[COVAR_FILE] --out [OUT] Performs the analysis on an alternate phenotype or with covariates (or both). Files can contain an optional header, but the first two columns must be the family ID and individual ID, corresponding to those in the .fam file (only individuals both in the .fam file and the phenotype/covariate files are used in the analysis). For --pheno, the default is to use the first variable (after the two ID columns) in the file, use the use modifier to change this. Can be specified by name if a header is present (eg. --pheno file=[PHENO_FILE] use=variableX) or by variable index (eg. use=3; does not count the two ID columns, so this would be the *fifth* column in the file). For --covar, the default is to use all the variables. Use the include or exclude modifiers to use only a subset. Can be specified as a comma-separated list of names (if header is present) and/or variable indices. This can also include ranges of variables (eg. --covar file=[COVAR_FILE] include=1-5, 7, varXvarZ will use the first five variables, the 7th variable and the variables varX, varZ and all variables in between). In addition, the use-sex modifier includes the sex variable in the .fam file as covariate (if no other covariates are used, the file modifier can be omitted: --covar use-sex). Gene analysis on summary statistics ./magma --bfile [PREFIX] --gene-annot [BATCH_OUT] [GENEANNOT_FILE] --pval [SNPPVAL_FILE] N=[SAMPLE_SIZE] --out Performs gene analysis on SNP p-values, using an appropriate reference data-set to obtain estimates of the LD (a typical choice would be the 1,000 Genomes European panel). The p-value file needs to contain a column of SNP ids and of SNP p-values. If the file has a header the program looks for columns named SNP and P (not case-sensitive; should work automatically with PLINK SNP analysis output files), otherwise it uses the first (ids) and second (p-values) columns. Use the use modifier (use=[SNP_COL],[PVAL_COL]) to change this. Use the --snp-wise flag to change the model used for the gene analysis (see below). Gene analysis in batch mode ./magma --bfile [PREFIX] --gene-annot [GENEANNOT_FILE] --batch [INDEX] [TOTAL] --out [BATCH_OUT] ./magma --merge [BATCH_OUT] --out [MERGE_OUT] To facilitate parallel computation of the gene analysis, the --batch flag can be used. [TOTAL] specifies the total number of parts to split the computation into, [INDEX] the particular part to compute the gene analysis for. Thus, one could for example run MAGMA in 8 batches with --batch [INDEX] 8, running the program 8 times with [INDEX] = 1, …, [INDEX] = 8. The --merge function then combines the parts back into a single set of output files. Gene-set analysis ./magma --gene-results [NAME].genes.raw --set-annot [SETANNOT_FILE] --out [OUT] Runs gene-set analysis using the .genes.raw file generated by an earlier MAGMA gene analysis and the provided set annotation file. Sets in the set annotation file must be specified by line (whitespaceseparated), with the first value on each line the name or ID of the set and the values that follow it all gene IDs. Alternatively, --set-annot [SETANNOT_FILE] col=[GENE_COL],[SET_COL] can be used if the set annotation file is in column-based format, where the col modifier specifies which column contains the gene IDs and which column the gene-set names. The modifier no-size-correct can be added to turn off the correction for gene size and gene density for the competitive gene-set analysis. This is not recommended. Note that this will also turn off the correction for any concurrent gene property analyses. Gene-set analysis can also be run in conjunction with a gene analysis, or a merge, eg.: ./magma --bfile [PREFIX] --gene-annot [GENEANNOT_FILE] --set-annot [SETANNOT_FILE] --out [OUT] ./magma --merge [BATCH_OUT] --set-annot [SETANNOT_FILE] --out [OUT] Gene property analysis ./magma --gene-results [NAME].genes.raw --gene-covar [GCOV_FILE] --out [OUT] Runs gene property analysis, the generalization of the competitive gene-set analysis to continuous genelevel variables. The gene covariate file has the same kind of structure as regular covariate files used with --covar; columns correspond to gene properties, rows to genes. The first column should contain gene IDs. Like --covar all variables in the file are used by default, unless a subset is specified using the include or exclude modifier. Note that at present, mean imputation is used for genes for which values are missing (ie. value is set to NA or gene is missing from file entirely). The gene property analysis can be run simultaneously with the gene-set analysis, and like the gene-set analysis has a no-size-correct modifier and can be run in conjunction with a gene analysis or a merge. Conditional gene-set / gene property analysis ./magma --gene-results [NAME].genes.raw --gene-covar [GCOV_FILE] condition=[VARIABLES] --out [OUT] ./magma --gene-results [NAME].genes.raw --set-annot [SETANNOT_FILE] --gene-covar [GCOV_FILE] condition=[VARIABLES] --out [OUT] ./magma --gene-results [NAME].genes.raw --set-annot [SETANNOT_FILE] --gene-covar [GCOV_FILE] conditiononly=[VARIABLES] --out [OUT] Gene-set and gene property analyses can be conditioned on variables in the gene covariate file (they are always also conditioned on gene size, gene density and the log value of each, unless the no-size-correct modifier is added). This can be done by specifying which variables to condition on using the condition modifier. These variables are not analysed themselves. To perform conditional gene-set analysis only, use the condition-only modifier instead. Additional options SNP-wise gene analysis ./magma --bfile [PREFIX] ./magma --bfile [PREFIX] ./magma --bfile [PREFIX] ./magma --bfile [PREFIX] --gene-annot --gene-annot --gene-annot --gene-annot [GENEANNOT_FILE] [GENEANNOT_FILE] [GENEANNOT_FILE] [GENEANNOT_FILE] --snp-wise --snp-wise --snp-wise --snp-wise --out [OUT] model=[MODEL] --out [OUT] stat=[STAT] --out [OUT] model=[MODEL] stat=[STAT] --out [OUT] The --snp-wise flag can be used to perform a SNP-wise analysis rather than using the PC regression model; or when used in conjunction with the --pval flag, to change the settings of the SNP-wise model used. The stat modifier can be ‘chi’ or ‘chisq’ to use chi-square SNP-statistics, or ‘z’ or ‘Z’ to use standard normal SNP-statistics (default is chi-square). The model modifier specified how SNP-statistics are aggregated: it can be ‘unweighted’ for unweighted mean SNP-statistic, ‘weighted’ for weighted (based on SNP LD matrix) mean SNP-statistic or ‘top’ for highest SNP-statistic (default is unweighted mean). For model=top, a second value can be specified to use the mean of several of the highest SNP-statistics; this second value specifies either an absolute number (eg. model=top,3 to use the mean of the top 3 SNP- statistics in the gene) or a fraction (eg. model=top,0.1 to use the top 10% SNP-statistics in the gene). Note that model=top requires permutation, and as such will take considerably longer to compute than other analyses. Rare variant analysis ./magma --bfile [PREFIX] --gene-annot [GENEANNOT_FILE] --rare [MAF_CUTOFF] --out [OUT] ./magma --bfile [PREFIX] --gene-annot [GENEANNOT_FILE] --rare [MAF_CUTOFF] max=[MAX] --out [OUT] The --rare flag can be used to aggregate rare variants in a gene to a compound variable (the minor allele burden score), where rare variants are defined by the MAF_CUTOFF specified. The rare variants themselves are removed from the gene, and are replaced by the burden score. If the max modifier is specified, no more than MAX rare variants are aggregated into a single burden score variable. If more than MAX rare variants are present in a gene, multiple burden score variables are created. Fixed-effects meta-analysis ./magma --meta genes=[FILENAMES] --out [OUT] ./magma --meta sets=[FILENAMES] --out [OUT] Merges the provided comma-separated list of either .genes.out or .sets.out weighted Z method, with the square root of the sample size as weights. files using Stouffer’s Outlier removal in gene-set and gene property analysis Use --set-settings truncate=[LOWER],[UPPER] or --set-settings truncate=[MAX] to truncate outlier gene Zvalues during the gene-set and gene property analysis (if truncate=[MAX], LOWER and UPPER are both equal to MAX). Bound are specified as mean(Z) - LOWER × sd(Z) and mean(Z) + UPPER × sd(Z), and Z-values outside those bounds are set to those bounds instead.