The aroma.affymetrix package - on a notebook

Transcription

The aroma.affymetrix package - on a notebook
The aroma.affymetrix package
How to analyze huge Affymetrix data sets in R
on a notebook
Henrik Bengtsson - hb@stat.berkeley.edu
University of California, Berkeley
Sept 28, 2006
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Affymetrix chips
Hybridization of target sequences to probes
Target sequence:
Perfect-match (PM) probe:
...GGTTACCATCGGTAAGTACTCAATGATTA...
ATCATGCGCCATTCATGAGTTACTA
Hybridization of target sequences to probes
Target sequence:
Perfect-match (PM) probe:
Mismatch (MM) probe:
...GGTTACCATCGGTAAGTACTCAATGATTA...
ATCATGCGCCATTCATGAGTTACTA
ATCATGCGCCATACATGAGTTACTA
Scanning & Image analysis
Example array: 1600x1600 cells; 65536 intensity levels (16 bits).
Scanning & Image analysis
Example array: 1600x1600 cells; 65536 intensity levels (16 bits).
Scanned image: 9x9 (+ cell margins) pixels/cell.
Scanning & Image analysis
Example array: 1600x1600 cells; 65536 intensity levels (16 bits).
Scanned image: 9x9 (+ cell margins) pixels/cell.
Analyzed image: (mean pixel, stddev pixel, #pixels).
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Amount of data per array
Affymetrix chip data is stored in “CEL” files.
Per cell [10 bytes]:
◮
Average pixel intensity [float = 4 bytes = 40%]
◮
Std dev pixel intensity [float = 4 bytes = 40%]
◮
# pixels [integer = 2 bytes = 20%]
With an array of 1600x1600 cells this sums up to 25.6 · 106 bytes
= 24.4 MB/array1 .
1
1 kB = 1024 bytes, 1Mb = 1024 kB = 1048576 bytes.
Example of different Affymetrix chips
Chip type
Hu6800
HG U95Av2
Mapping10K Xba142
HG-U133A
HG-U133B
Mapping10K Xba131
Mouse 430 v2
Mapping50K Hind240
Mapping50K Xba240
Mapping250K Nsp
Mapping250K Sty
HuEx-1 0-st-v2
Dimension
536x536
640x640
658x658
712x712
712x712
712x712
1002x1002
1600x1600
1600x1600
2560x2560
2560x2560
2560x2560
# cells
0.29 · 106
0.41 · 106
0.43 · 106
0.51 · 106
0.51 · 106
0.51 · 106
1.00 · 106
2.56 · 106
2.56 · 106
6.55 · 106
6.55 · 106
6.55 · 106
# Units
7129
12625
10208
22283
22645
11564
45101
57299
59015
262338
238378
1432154
*) Sizes of binary CEL files; ASCII CEL files are much larger.
File size
2.9MB
3.9MB
4.1MB
4.8MB
4.8MB
4.8MB
9.6MB
24.4MB
24.4MB
62.5MB
62.5MB
62.5MB
Example of data sets
Some public data sets:
Name
Affymetrix CEPH 100K
Affymetrix CEPH 500K
# samples
90x2 chips
48x2 chips
Chip type
Mapping 100K
Mapping 500K
Size
4.5GB
6.0GB
Signals
1.8GB
2.4GB
Some data sets we’ve been working on:
Name
Slater 100K
Affymetrix CEPH 500K
Broad Institute 500K
Affymetrix Services Laboratory
Sinclair 500K
WEHI Exon
# samples
22+21 chips
270x2 chips
96x2 chips
190+154 chips
19+16 chips
35x2 chips
Chip type
Mapping 100K
Mapping 500K
Mapping 500K
Mapping 500K
Mapping 500K
Human Exon
Size
1.0GB
33.8GB
11.7GB
21.1GB
2.4GB
2.2GB
Signals
0.4GB
13.5GB
4.7GB
8.4GB
1.0GB
0.9GB
Some large data sets we know of:
Name
Sanger’s 500K
# samples
15,000x2 chips
Chip type
Mapping 500K
Size
1746GB
Signals
698GB
“Signals” = amount of RAM required for probe intensities only.
Example of data sets
Some public data sets:
Name
Affymetrix CEPH 100K
Affymetrix CEPH 500K
# samples
90x2 chips
48x2 chips
Chip type
Mapping 100K
Mapping 500K
Size
4.5GB
6.0GB
Signals in R
3.6GB
4.8GB
Some data sets we’ve been working on:
Name
Slater 100K
Affymetrix CEPH 500K
Broad Institute 500K
Affymetrix Services Laboratory
Sinclair 500K
WEHI Exon
# samples
22+21 chips
270x2 chips
96x2 chips
190+154 chips
19+16 chips
35x2 chips
Chip type
Mapping 100K
Mapping 500K
Mapping 500K
Mapping 500K
Mapping 500K
Human Exon
Size
1.0GB
33.8GB
11.7GB
21.1GB
2.4GB
2.2GB
Signals in R
0.8GB
27.0GB
9.4GB
16.8GB
2.0GB
1.8GB
Some large data sets we know of:
Name
Sanger’s 500K
# samples
15,000x2 chips
Chip type
Mapping 500K
Size
1746GB
Signals in R
1396GB
“Signals” = amount of RAM required for probe intensities only.
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Computer systems
Operating systems:
Operating system
Max address space
Windows XP
4GB*
Windows XP 64-bit
17 billion GB
Linux 32-bit
4GB
Linux 64-bit
17 billion GB
*) A single application can only use 2GB.
Hardware:
Hardware limits the amount of memory to about 32-64 GB.
Department of Statistics, UC Berkeley:
The main computational Linux (64-bit) server has 16 GB RAM.
The R software
Overview of R:
◮
A free open source application (GPL).
◮
Great community forums.
◮
Widely used. Dominant in Bioinformatics applications.
Overview of the language:
◮
All floating-point values are stored as double:s.
⇒ float (4 bytes) to double (8 bytes); 200% more RAM.
◮
Functional language (no pointers/reference variables)
⇒ there often 2-3 copies of a data object at any time.
◮
A workaround is to use environments (or the R.oo package)
⇒ one copy of each data object.
Main issue is memory! (not only R)
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Probeset design
In gene-expression analysis, the “activity” (amount of RNA
transcripts) of a single gene is measured by a about 10-50 probes
each quering a small fraction of the genes DNA.
A probeset (aka unit):
1
2
3
PM
MM
30 31 32
Probeset design
In gene-expression analysis, the “activity” (amount of RNA
transcripts) of a single gene is measured by a about 10-50 probes
each quering a small fraction of the genes DNA.
A probeset (aka unit):
1
2
3
cell
PM
MM
30 31 32
Probeset design
In gene-expression analysis, the “activity” (amount of RNA
transcripts) of a single gene is measured by a about 10-50 probes
each quering a small fraction of the genes DNA.
A probeset (aka unit):
1
2
3
cell
PM
MM
probe
pair
30 31 32
Probeset design
In gene-expression analysis, the “activity” (amount of RNA
transcripts) of a single gene is measured by a about 10-50 probes
each quering a small fraction of the genes DNA.
A probeset (aka unit):
1
2
3
cell
PM
MM
probe
pair
30 31 32
Most of the modelling is done probeset by probeset.
Thus this the #1 way we access data.
There are 10,000s probesets on each array.
Probe-level modelling (PLM)
Ignoring the mismatch probes, model the PMs only:
1
2
3
PM
14 15 16
Consider a given SNP with PM probes k = 1, . . . , K and samples
i = 1, . . . , I . The PLM used in RMA is:
log yik = αi + βk + εik
with PM signal yik , chip effect αi for sample i, probe affinity
(sensitivity) βk for probe k, and random error ξik .
Probe-level modelling (PLM)
Ignoring the mismatch probes, model the PMs only:
1
2
3
PM
14 15 16
Consider a given SNP with PM probes k = 1, . . . , K and samples
i = 1, . . . , I . The PLM used in RMA is:
log yik = αi + βk + εik
with PM signal yik , chip effect αi for sample i, probe affinity
(sensitivity) βk for probe k, and random error ξik .
The PLM used by dChip (MBEI) is:
yik = θi · φk · ξik .
Chip-definition files (CDFs)
Probesets are defined in CDF files (one per chip type), e.g.
Mapping250K Nsp.CDF. A fraction of this CDF:
$ SNP_A-1782949:List of 3
..$ type
: int 2
..$ direction: int 1
..$ groups
:List of 2
.. ..$ A:List of 5
.. .. ..$ x
: int [1:12]
.. .. ..$ y
: int [1:12]
.. .. ..$ pbase: chr [1:12]
.. .. ..$ tbase: chr [1:12]
.. .. ..$ expos: int [1:12]
.. ..$ G:List of 5
.. .. ..$ x
: int [1:12]
.. .. ..$ y
: int [1:12]
.. .. ..$ pbase: chr [1:12]
.. .. ..$ tbase: chr [1:12]
.. .. ..$ expos: int [1:12]
651 652 458 457 940 939 ...
1772 1772 1388 1388 221 ...
"c" "g" "a" "t" ...
"g" "g" "a" "a" ...
13 13 15 15 16 16 17 17 ...
651 652 458 457 940 939 ...
1771 1771 1389 1389 220 ...
"c" "g" "a" "t" ...
"g" "g" "a" "a" ...
13 13 15 15 16 16 17 17 ...
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Allelic cross-talk calibration (genotyping chips)
PMA probe:
PMB probe:
ATCATGCGCCATCCATGAGTTACTA
ATCATGCGCCATACATGAGTTACTA
Allelic cross-talk calibration (genotyping chips)
ATCATGCGCCATCCATGAGTTACTA
ATCATGCGCCATACATGAGTTACTA
15000
5000
0
yA
25000
PMA probe:
PMB probe:
0
5000
15000
yC
25000
Allelic cross-talk calibration
An affine model for cross-talk between allele A and allele B is
(ignoring sample index i) is:
yA,j,k
a
WAA WAB xA,j,k
ε
= A,j +
+ A,j,k
yB,j,k
aB,j
WBA WBB xB,j,k
εB,j,k
and in vector format
yj = a + Wxj + εj
We estimate this robustly using unpublished work by Wirapati &
Speed (2002).
Allelic cross-talk calibration
> path <- findCelSet("SinclairA_etal_2006")
> ds <- AffymetrixCelSet$fromFiles(path)
> dsC <- calibrateAllelicCrosstalk(ds)
Benchmarking:
# arrays Chip type
90x2 Mapping 100K
270x2 Mapping 500K
15,000x2 Mapping 500K
Total time
1:26h
5:30h
13.0 days*
Time/array
28s
75s
75s*
Overheads (approx.): Reading: 15%, Fitting: 50%, Writing: 30%.
Allelic cross-talk calibration
25000
15000
5000
yA
15000
0
5000
0
yA
25000
(699,771)
1.000 0.035
0.121 0.959
0
5000
15000
yC
before
25000
0
5000
15000
yC
after
25000
Quantile normalization
> path <- findCelSet("SinclairA_etal_2006")
> ds <- AffymetrixCelSet$fromFiles(path)
> dsN <- normalizeQuantile(ds)
Calculating target distribution (averaging):
# arrays Chip type
Total time Time/array
90x2 Mapping 100K
0:07h
2.1s
270x2 Mapping 500K
1:45h
11.8s
15,000x2 Mapping 500K
4.1 days*
11.8s*
Normalizing
# arrays
90x2
270x2
15,000x2
arrays to target distribution:
Chip type
Total time
Mapping 100K
0:55h
Mapping 500K
9:20h
Mapping 500K 21.5 days*
Time/array
18s
62s
62s*
Overheads (approx.): Reading: 10%, Fitting: 65%, Writing: 25%.
Fitting RMA PLM for total copy numbers
> path <- findCelSet("SinclairA_etal_2006")
> ds <- AffymetrixCelSet$fromFiles(path)
> model <- RmaCnPlm(ds, mergeStrands=TRUE,
combineAlleles=TRUE)
> fit(model)
Benchmarking:
# arrays Chip type
22+21 Mapping 100K
19+16 Mapping 500K
90x2 Mapping 100K
270x2 Mapping 500K
15,000x2 Mapping 500K
Total time
1:00h
3:20h
7:15h
2.0 days*
119 days*
Time/array & unit
1.4ms
1.3ms
1.3ms
1.3ms*
1.3ms*
Overheads (approx.): Reading: 20%, Fitting: 15%, Writing: 65%.
Fitting multiple copy-number PLMs at once
path <- findCelSet("SinclairA_etal_2006")
ds <- AffymetrixCelSet$fromFiles(path)
models <- list(
rma = RmaCnPlm(ds, mergeStrands=TRUE),
mbei = MbeiCnPlm(ds, mergeStrands=TRUE),
affine = AffineCnPlm(ds, mergeStrands=TRUE)
}
lapply(models, fit, units=1:5000)
Note: Read data is cached ⇒ average reading time scales down.
Displaying results
Graphical output is still under development, but...
User feedback
- fit()[.ProbeLevelModel] worked perfectly. Ran rma on 18
mouse 430 2 chips [1002x1002 cells, 45101 units] in 14 minutes.
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1063975 28.5*
2403845 64.2 2403845 64.2*
Vcells 957616 7.4*
4826040 36.9 15966139 121.9*
Compare to memory usage for fitPLM():
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2088619 55.8*
4953636 132.3 3950498 105.5*
Vcells 42847107 326.9* 90835631 693.1 90538060 690.8*
:)
Cheers / Ken
Software robustness
All transformed data and parameter estimates are stored to file
immediately (in chunks). This means:
◮
If/when R crashes (it happens!), or when there is a power
failure, algorithms can pick from where they were interrupted.
This is automagically taken care of by aroma.affymetrix.
◮
The algorithms may be interrupted in order to temporarily
release computer resources for other needs.
◮
The algorithms can be restarted on a different host.
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Interfacing to Bioconductor
◮
Pre-processed data is already stored as CEL files, which can
be imported to Bioconductor (and other software).
◮
Ongoing: Porting algorithms to aroma.affymetrix. Most of
this can be done by simple wrappers calling existing
implementations, cf. fitPLM(), crlmm(), gcrma() etc.
To do: Provide an eSet interface to the data classes;
◮
1. Extract data in memory.
2. Virtual eSet class to still work with data on file (more tricky).
Parallelization
Since all data and parameter estimates are kept in a shared
persistent memory (the file system), multiple processes/hosts can
access the data and estimates at any time.
Speed up:
With N parallel hosts, total time T shrinks to ≈T /N, e.g.
N = 20, T = 2.0 days ⇒ T /N = 2.4 hours.
One process writing, multiple reading:
With this setup there are no conflicts. All readers can access the
estimates as soon as they are available. Examples:
◮
Visualizers, e.g. CN plots, SNP scatter plots.
◮
Progress bars.
Multiple writing processes:
A file-lock mechanism for writing is required (todo). Examples:
◮
Single-array calibration and normalization methods.
◮
Modelling of data subsets, e.g. chromsome by chromsome.
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Conclusions
◮
We are now capable of analyzing very large data sets.
◮
Almost all models and algorithm we work with can be
performed with bounded memory constraints.
Advantages of storing “intermediate” data and estimates on
the file system are:
◮
◮
◮
◮
◮
Standard CEL files: data is ready to be imported in other tools.
Persistent memory: can be restarted after a software failure.
Parallelization: Data and estimates can be shared and process
by multiple hosts simultaneously.
The package can easily be extended by other developers.
Outline
Introduction to the Affymetrix platform
Description of data
Computer systems & software
More on the design of Affymetrix arrays
Examples
Ongoing work & Future directions
Conclusions
General discussion on computation on large data sets
Suggestions
◮
ASCII (tab-delimited) data files are orders of magnitude
slower to parse than binary files.
◮
Know how to use read.table(..., colClasses=...).
◮
Understand how data is kept in memory.
◮
Understand that data in matrices in R are stored as stacked
columns, that is, it is more efficient (caching) to work column
by column, rather than row by row.
◮
Understand how I/O of data can be optimized: contiguous
data is much faster to access than scattered data.
HDF5
National Center for Supercomputing Applications,
University of Illinois at Urbana
◮
File format for storing scientific data:
Primary objects: data sets and groups. A dataset is essentially
a multidimensional array of data elements, and a group is a
structure for organizing objects in an HDF5 file.
◮
Efficient storage and I/O:
Meets data management needs of scientists and engineers
working in high performance, data intensive computing
environments. Compressed or chunked data. Read and write
data efficiently on parallel computing systems.
◮
Large user community:
Engineering, scientific, and other fields, ranging from
computational fluid dynamics to film making.
◮
R package hdf5:
Interface to the NCSA HDF5 library. Experimental.
Package R.huge
◮
Provides in memory like access to extremely large-size data
living on the file system, e.g. x[1:30,56:60] and
x[939220+1:20,2] <- NA.
◮
Supported dimensions: vectors, matrices, and
multi-dimensional arrays.
◮
Supported data types: byte (1 byte), single (2 bytes),
integer (4 bytes), float (4 bytes), and double (8 bytes).
◮
Written using plain R; easy to extend.
◮
Experimental.
◮
Most of it’s usage in aroma.affymetrix have been replaced by
I/O support of CEL/CDF files in order to ease migration of
data and analysis.
Package R.huge - Example
> x <- FileByteMatrix("x.Rmatrix", nrow=1e6, ncol=1e4)
> x
[1] "FileByteMatrix: Pathname: ./x.Rmatrix. Opened: TRUE.
File size: 10000004268 bytes (9.3 GB). Dimensions: 1e+06x
1e+04. Number of elements: 1e+10. Bytes per cell: 1."
> x[939220+1:20,2] <- 1:40
> x[939220+5:8,1:3]
[,1] [,2] [,3]
[1,]
0
5
0
[2,]
0
6
0
[3,]
0
7
0
[4,]
0
8
0
Acknowledments
In no specific order:
◮
James Bullard, UC Berkeley.
◮
Pratyaksha Wirapati, Swiss Cancer Research Center.
◮
Ben Bolstad, UC Berkeley.
◮
Rafael Irizarry, John Hopkins University.
◮
Ken Simpson, WEHI, Melbourne, Australia.
◮
Benilton Carvalho, John Hopkins University.
◮
Terry Speed, UC Berkeley/WEHI.
◮
Jan Holst, Lund University, Sweden.
◮
Kasper D. Hansen, UC Berkeley.
◮
Jane Fridlyand, UCSF.
◮
Ola H¨
ossjer, Stockholm University, Sweden.
aroma.affymetrix is available at http://www.braju.com/R/.
Converting an ASCII CDF to a binary CDF
Example: Human Exon array with > 1.4·106 units.
> cdf <- AffymetrixCdfFile$fromChipType("HuEx-1_0-st-v2")
> cdf
AffymetrixCdfFile:
Filename: HuEx-1_0-st-v2.text.cdf
Chip type: HuEx-1_0-st-v2
Number of units: 1432154
File size: 933.84 MB
> cdf2 <- convert(cdf)
> cdf2
AffymetrixCdfFile:
Filename: HuEx-1_0-st-v2.cdf
Chip type: HuEx-1_0-st-v2
Number of units: 1432154
File size: 376.78 MB