Cell lines

Transcription

Cell lines
neXtProt: recent
developments in the
context of biocuration
Amos Bairoch, Beijing April 24, 2015
www.nextprot.org
A ‘one stop shop for human proteins’
To visualize all the integrated information
To extract/export relevant annotations
To perform complex and precise queries
neXtProt contents [Jan 2015]
 20’061 entries representing 41’902 protein sequences (isoforms created
by alternative splicing/initiation)
 All of Swiss-Prot human annotations
PLUS
 All GOA human proteins annotations
 Chromosomal location and exons mapping from Ensembl
 Many additional identifiers (Affymetrix, Antibodypedia, IPI, etc.)
 Variants: over 1.4 million SAPs from COSMIC and dbSNP
 Subcellular localization data from different sources, incl. HPA
 Tissular expression data at mRNA level (microarray/EST) from Bgee
(meta-analysis of ArrayExpress/UniGene data)
 Tissular expression data at protein level (IHC) from HPA
 34’000 PTMs from manually curated large scale proteomic studies
 Over 1 million MS/MS peptides from PeptideAtlas (“Human-all” build
at 1% FDR at protein level)
 All synthetic peptides from SRMAtlas
 Many “terminology” resources: GO, OMIM, NCI Thesaurus, MeSH, etc.
including two that are “home-grown”: CALOHA and the Cellosaurus
Quality filtering: bronze, silver, gold
• A three-tiered approach in term of data quality :
– Gold: estimated error rate <1%
– Silver: estimated error rate 1-5%
– Bronze: noisy or low quality data, not imported in
neXtProt
• Quality assessment is a dynamic process that
involves data providers whenever possible
The HUPO Human Proteome Project
www.thehpp.org
HPP
Chromosome-HPP
Biology/Disease-HPP
To characterize the human proteins
encoded
by
genes
on
each
chromosome in a country-by-country
manner (e.g. China Chr. 1, 8 and 20,
Switzerland: Chr. 2)
Abs
proteomics
To characterize human proteins in
organs/biofluids in healthy or
disease state (e.g.: China with the
liver proteome)
MS
proteomics
Integrated
KB
neXtProt in the context of HPP
• HUPO has selected neXtProt to be the
knowledge resource for the HPP project.
• That means:
– To integrate the results of HPP experimental studies
(mass spectrometry and antibodies)
– To validate protein “existence”
– To provide metrics to assess the project progress
– To represent the “functional” knowledge on human
proteins as best as we can
neXtProt proteomics view
https://search.nextprot.org
NEW!!
• a new search engine with two components:
– A simple “google-like” full text search
– An advanced search capability based on a SPARQL/RDF
technology
• a new API that allows to retrieve precisely any
annotation in neXtProt in XML or Jason and thus also
allows to build applications on top of neXtProt
• a new version of the XML export format
• and later a RDF version of neXtProt
neXtProt 2015 software architecture
Custom
client apps
neXtProt
website
json
xml
User
management
with
Authentication
service
API service
DB
users, lists,
queries
DB
data
SPARQL
Advanced
search
Fulltext
Simple search
Simple search
Examples of advanced searches
List management tool
The new search interface is linked to a protein list management
tool, so that your own experimental result lists can be imported
and seamlessly combined with search results lists
Kinase Knowledge Platform (KKP)
• The KKP project was a collaboration between MerckSerono and SIB/GeneBio aimed at providing
annotations for human protein kinases to support the
Merck-Serono drug-screening platform
• Detailed structured annotations were produced by the
CALIPHO group from Jan 2012 to Dec 2013:
– 300 protein kinases
– 30’000 annotations
– Abstracted from 13’640 papers
• A web-based annotation platform named 'BioEditor'
was developed to support the project
• The KKP annotations will be provided to the
community through neXtProt in 2016
BioEditor
Structured, flexible, accurate, traceable annotation tool

Web-based biocuration platform

Allows biological knowledge to be captured in a highly
structured manner

Annotations consist of triplets or extended triplets

Data model is flexible and can be adapted to represent
multiple data types

Sources of annotations are completely described to ensure
accuracy and traceability

Annotations can be exported as XML and could also be
tailored to produce RDF or OpenBEL statements
BioEditor annotation model
Annotation consists of a ‘triplet’ or extended triplet
subject + relation + object + relation + precision
Subject: BioObject (Protein, Protein isoform, Variant, Complex, or Protein Group)
Relations: in-house developed controlled vocabulary
- Examples: binds, phosphorylates, causes phenotype, causes disease
- Currently 82 valid relations
Object: BioObject, Chemical (ChEBI), Gene Ontology (GO), Disease (NCI)
Precision: “producing “ and a BioObject
For example: LYN phosphorylates BTK producing BTK-P-Tyr551
Annotation note: Free text
BioEditor annotation examples
BioEditor annotation evidences
•
Evidence code (ECO+)
- Describing the type of experiment
•
Experimental details (Free text)
- Explain how the evidence supports the annotation
•
Biological Model (NCBI Taxonomy, CALOHA, Cellosaurus)
- Organism, anatomy or cell line in which the experiment was done
•
Protein Origin (NCBI Taxonomy)
- Species from which the protein was obtained for the experiment
•
Reference (PubMed, DOI, URI)
- Publication, website, database, etc.
Human protein variation annotation
• Currently we have two grants to perform deep
annotation using the BioEditor on the phenotypic
effects of protein variation in genetic diseases:
– One to annotate variants in HBOC (BRCA1, BRCA2)
and Lynch syndrome (MLH1, MSH2, MSH6, PMS2,
EPCAM)
– One to annotate variants in ‘Nav’ sodium channels
(SCNA1 to SCNA9)
• These annotations will start appearing in
neXtProt at the end of this year
The cellosaurus: a cell line thesaurus
Cell lines
• Cell lines are used by labs all over the world
• Some are acquired from cell collections such as
ATCC, DSMZ, ECACC, Riken, etc. while many others
are transferred between labs
• There are a number of varieties of cell lines: cancer,
transformed, embryonic stem cells, iPSC,
hybridoma, etc.
• In term of resources: there are cell lines catalogs,
lists, ontologies such as CLO and BTO and many
specialized databases
• But up to now, no single resource where all this
information is available
Another important thing about cell lines
• A huge mess in term of standardization: names,
origin (due to frequent contaminations)
Cellosaurus
• What:
A thesaurus of cell lines
• Scope:
 Immortalized cell lines
 Naturally immortal cell lines (ie stem cell lines and IPSc)
 Finite life cell lines when those are distributed and used widely
 Vertebrate cell lines with an emphasis on human, mouse and
rat cell lines
 Invertebrate (insects and ticks) cell lines
• Does not include:
Primary cells
Plant cell lines
Cellosaurus: current
depth of information
• The Cellosaurus contains information on:
 The cell “category” (cancer, transformed, stem cell, etc.)
 Sex of individual from which a line was derived
 Species of origin (using NCBI TaxID)
 Breed/subspecies of original animal
 Known contaminated cell lines
 For “diseased” cell lines, annotation of the disease using the
NCI thesaurus
 The site of origin for metastatic cancer cell lines
 The genes that have been transfected (with links to HGNC,
MGI, RGD, FlyBase, UniProtKB)
 If a cell line has been discontinued from a catalog
 If a cell line is registered in an official list (example: NIH
registry of approved hESCs)
 Common misspellings
Cellosaurus
statistics
• 36’400 cell lines from 421 species (73% from
human, 13% from mouse, 3% from rat)
• 25’200 synonyms
• 29’600 references to 7’200 publications
• 64’200 cross-references to cell line catalogs,
ontologies, databases and other resources
• 6’800 web links
Cellosaurus availability
• Available in flat file and OBO format
• By FTP from ftp.nextprot.org and
ftp.expasy.org
• Integrated in neXtProt
terminologies and now available as
a searchable resource on ExPASy
web.expasy.org/cellosaurus
What could be added?
• Scope:
– Complete coverage of Coriell repository cell lines
• Depth:
– Tissue of origin (using Uberon)
– Info regarding integrated viruses in cell lines
– Info on transformation method (SV40, EBV, etc.)
– Info on translocations in cancer cell lines
The neXtProt team
Content: Pascale Gaudet, Aurore Britan, Jonas Cicenas, Isabelle
Cusin, Paula Duek, Valérie Hinard
Software: Pierre-André Michel, Alain Gateau, Anne Gleizes, Frédéric
Nikitin, Valentine Rech de Laval, Daniel Teixeira
QA: Monique Zahn
Directed by: Amos Bairoch, Lydie Lane