COPO - TGAC Documentation


COPO - TGAC Documentation
COPO: Collaborative Open Plant Omics
Rob Davey
Data Infrastructure and Algorithms Group Leader
Toni Etuk
Oxford eResearch Centre
Susanna Sansone
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Alfie Abdul-Rahman
Felix Shaw
Jim Beynon
Katherine Denby
Ruth Bastow
Paul Kersey
Vicky Schneider
Tanya Dickie
Emily Angiolini
Matt Drew
Recently awarded BBSRC BBR grant
TGAC, Univ. Oxford, Univ. Warwick, EMBL-EBI
Supported by GARNet, iPlant, Eagle Genomics
Empower bioscience plant researchers to:
1. Enable standards-compliant data collection, curation and
2. Enhance access to data analysis and visualisation pipelines
3. Facilitate data sharing and publication to promote reuse
Train plant researchers in best practice for data sharing and
producing citable Research Objects
(Good) Science is founded on reproducibility
Reproducibility depends on:
reducing reinvention (“friction”)*
describing methods and data
maximising benefit to the researcher
Describing methods well established through “traditional” publishing
Data description sorely under-represented and used
Benefits are often opaque
Fear of being scooped, loss of control, reputation, etc
What prevents plant scientists from openly depositing their data
and metadata?
Lack of interoperability between:
metadata annotation services
data repository services
data analysis services
data publishing services
Researchers might not:
be aware that the services exist
have the expertise to use them
see the value in properly describing their data
figshare, Scientific Data, Dryad, F1000, PeerJ, Gigascience
Beyond the PDF:
Galaxy, iPlant, Bioconductor, Taverna, local code/services
GitHub, BitBucket, Zenodo
Sample, Sequence, Genome, Proteome, Metabolome, Imaging
Utopia, GitHub
Materials, examples, workshops, bootcamps
It's not because these services don't exist!
Clearly, barriers exist between the scientist and the service
Infrastructure can help by:
wiring existing services together
improving access to services
facilitating collaboration
raising profile of the benefits of open science
How do we collaborate successfully to make this happen?
Mapping services with Application Programming Interfaces
Grace signs into COPO with her ORCID ID
This signs her into all other services as required
She starts a new COPO Profile
She uploads to the COPO platform:
Three FASTQs (two Illumina HiSeq2500, one PacBio P6-C4)
representing her velociraptor sequencing reads
She tells COPO to push her data to a Galaxy server and run a workflow,
An assembly of the reads from ALLPATHS-LG v51551
A draft automated annotation from RAST v33-1
The interface prompts her to add metadata to her data in order to deposit
them in the public repositories
Metadata fields will be shown based on data, and redundant fields will be merged automatically
Sample name, sample organism, data type, sequencer used, software name, software version....
She clicks “Upload”, and everything is submitted
Single-sign on (SSO), e.g. ORCID
Deposit multi-omics data in one go
No context-switching between services
Run and deposit analytical workflows
Describe software used, versions
Pull into platforms, e.g. Galaxy, iPlant
Support virtualisation, e.g. iPlant Atmosphere, Docker, Amazon AWS
Data is well-described, open, and everything has DOIs
Finding and integrating data is improved greatly
Make suggestions to users based on their data/workflows
Programmatic access to all layers
Not just raw/processed data is valuable
COPO supports submission of supplementary data to Figshare
PDFs (posters, papers)
movies/images (size permitting)
Zenodo/Github releases for code DOIs
Marked up with ENCODE Digital Curation Center’s software
metadata descriptors, for example
What have we achieved so far?
TGAC infrastructure to support brokering of data
iRODS and web server virtual machines
High speed transfer Aspera links to EBI
Prototype user interface for multi-omics data submissions
Developing JSON specification for COPO objects
Oauth2 support (“sign in with” ORCiD, Google, Twitter)
Easily stored in document-based databases, e.g. MongoDB
Interconversion between ISA formats
ISATab (CSV based) to JSON, and vice versa
Linked Data specifications
Community interactions
Metabolights group at EBI
Setting up this workshop!
COPO will:
Facilitate easy relevant data description to:
Submit data and metadata to multiple public repositories
The reasons most of you are here…
What are the barriers for you and your data?
Facilitate access to workflows used to analyse the data, e.g. to
GigaDB, Scientific Data
This will form part of another COPO workshop