Cell lines
Transcription
Cell lines
neXtProt: recent developments in the context of biocuration Amos Bairoch, Beijing April 24, 2015 www.nextprot.org A ‘one stop shop for human proteins’ To visualize all the integrated information To extract/export relevant annotations To perform complex and precise queries neXtProt contents [Jan 2015] 20’061 entries representing 41’902 protein sequences (isoforms created by alternative splicing/initiation) All of Swiss-Prot human annotations PLUS All GOA human proteins annotations Chromosomal location and exons mapping from Ensembl Many additional identifiers (Affymetrix, Antibodypedia, IPI, etc.) Variants: over 1.4 million SAPs from COSMIC and dbSNP Subcellular localization data from different sources, incl. HPA Tissular expression data at mRNA level (microarray/EST) from Bgee (meta-analysis of ArrayExpress/UniGene data) Tissular expression data at protein level (IHC) from HPA 34’000 PTMs from manually curated large scale proteomic studies Over 1 million MS/MS peptides from PeptideAtlas (“Human-all” build at 1% FDR at protein level) All synthetic peptides from SRMAtlas Many “terminology” resources: GO, OMIM, NCI Thesaurus, MeSH, etc. including two that are “home-grown”: CALOHA and the Cellosaurus Quality filtering: bronze, silver, gold • A three-tiered approach in term of data quality : – Gold: estimated error rate <1% – Silver: estimated error rate 1-5% – Bronze: noisy or low quality data, not imported in neXtProt • Quality assessment is a dynamic process that involves data providers whenever possible The HUPO Human Proteome Project www.thehpp.org HPP Chromosome-HPP Biology/Disease-HPP To characterize the human proteins encoded by genes on each chromosome in a country-by-country manner (e.g. China Chr. 1, 8 and 20, Switzerland: Chr. 2) Abs proteomics To characterize human proteins in organs/biofluids in healthy or disease state (e.g.: China with the liver proteome) MS proteomics Integrated KB neXtProt in the context of HPP • HUPO has selected neXtProt to be the knowledge resource for the HPP project. • That means: – To integrate the results of HPP experimental studies (mass spectrometry and antibodies) – To validate protein “existence” – To provide metrics to assess the project progress – To represent the “functional” knowledge on human proteins as best as we can neXtProt proteomics view https://search.nextprot.org NEW!! • a new search engine with two components: – A simple “google-like” full text search – An advanced search capability based on a SPARQL/RDF technology • a new API that allows to retrieve precisely any annotation in neXtProt in XML or Jason and thus also allows to build applications on top of neXtProt • a new version of the XML export format • and later a RDF version of neXtProt neXtProt 2015 software architecture Custom client apps neXtProt website json xml User management with Authentication service API service DB users, lists, queries DB data SPARQL Advanced search Fulltext Simple search Simple search Examples of advanced searches List management tool The new search interface is linked to a protein list management tool, so that your own experimental result lists can be imported and seamlessly combined with search results lists Kinase Knowledge Platform (KKP) • The KKP project was a collaboration between MerckSerono and SIB/GeneBio aimed at providing annotations for human protein kinases to support the Merck-Serono drug-screening platform • Detailed structured annotations were produced by the CALIPHO group from Jan 2012 to Dec 2013: – 300 protein kinases – 30’000 annotations – Abstracted from 13’640 papers • A web-based annotation platform named 'BioEditor' was developed to support the project • The KKP annotations will be provided to the community through neXtProt in 2016 BioEditor Structured, flexible, accurate, traceable annotation tool Web-based biocuration platform Allows biological knowledge to be captured in a highly structured manner Annotations consist of triplets or extended triplets Data model is flexible and can be adapted to represent multiple data types Sources of annotations are completely described to ensure accuracy and traceability Annotations can be exported as XML and could also be tailored to produce RDF or OpenBEL statements BioEditor annotation model Annotation consists of a ‘triplet’ or extended triplet subject + relation + object + relation + precision Subject: BioObject (Protein, Protein isoform, Variant, Complex, or Protein Group) Relations: in-house developed controlled vocabulary - Examples: binds, phosphorylates, causes phenotype, causes disease - Currently 82 valid relations Object: BioObject, Chemical (ChEBI), Gene Ontology (GO), Disease (NCI) Precision: “producing “ and a BioObject For example: LYN phosphorylates BTK producing BTK-P-Tyr551 Annotation note: Free text BioEditor annotation examples BioEditor annotation evidences • Evidence code (ECO+) - Describing the type of experiment • Experimental details (Free text) - Explain how the evidence supports the annotation • Biological Model (NCBI Taxonomy, CALOHA, Cellosaurus) - Organism, anatomy or cell line in which the experiment was done • Protein Origin (NCBI Taxonomy) - Species from which the protein was obtained for the experiment • Reference (PubMed, DOI, URI) - Publication, website, database, etc. Human protein variation annotation • Currently we have two grants to perform deep annotation using the BioEditor on the phenotypic effects of protein variation in genetic diseases: – One to annotate variants in HBOC (BRCA1, BRCA2) and Lynch syndrome (MLH1, MSH2, MSH6, PMS2, EPCAM) – One to annotate variants in ‘Nav’ sodium channels (SCNA1 to SCNA9) • These annotations will start appearing in neXtProt at the end of this year The cellosaurus: a cell line thesaurus Cell lines • Cell lines are used by labs all over the world • Some are acquired from cell collections such as ATCC, DSMZ, ECACC, Riken, etc. while many others are transferred between labs • There are a number of varieties of cell lines: cancer, transformed, embryonic stem cells, iPSC, hybridoma, etc. • In term of resources: there are cell lines catalogs, lists, ontologies such as CLO and BTO and many specialized databases • But up to now, no single resource where all this information is available Another important thing about cell lines • A huge mess in term of standardization: names, origin (due to frequent contaminations) Cellosaurus • What: A thesaurus of cell lines • Scope: Immortalized cell lines Naturally immortal cell lines (ie stem cell lines and IPSc) Finite life cell lines when those are distributed and used widely Vertebrate cell lines with an emphasis on human, mouse and rat cell lines Invertebrate (insects and ticks) cell lines • Does not include: Primary cells Plant cell lines Cellosaurus: current depth of information • The Cellosaurus contains information on: The cell “category” (cancer, transformed, stem cell, etc.) Sex of individual from which a line was derived Species of origin (using NCBI TaxID) Breed/subspecies of original animal Known contaminated cell lines For “diseased” cell lines, annotation of the disease using the NCI thesaurus The site of origin for metastatic cancer cell lines The genes that have been transfected (with links to HGNC, MGI, RGD, FlyBase, UniProtKB) If a cell line has been discontinued from a catalog If a cell line is registered in an official list (example: NIH registry of approved hESCs) Common misspellings Cellosaurus statistics • 36’400 cell lines from 421 species (73% from human, 13% from mouse, 3% from rat) • 25’200 synonyms • 29’600 references to 7’200 publications • 64’200 cross-references to cell line catalogs, ontologies, databases and other resources • 6’800 web links Cellosaurus availability • Available in flat file and OBO format • By FTP from ftp.nextprot.org and ftp.expasy.org • Integrated in neXtProt terminologies and now available as a searchable resource on ExPASy web.expasy.org/cellosaurus What could be added? • Scope: – Complete coverage of Coriell repository cell lines • Depth: – Tissue of origin (using Uberon) – Info regarding integrated viruses in cell lines – Info on transformation method (SV40, EBV, etc.) – Info on translocations in cancer cell lines The neXtProt team Content: Pascale Gaudet, Aurore Britan, Jonas Cicenas, Isabelle Cusin, Paula Duek, Valérie Hinard Software: Pierre-André Michel, Alain Gateau, Anne Gleizes, Frédéric Nikitin, Valentine Rech de Laval, Daniel Teixeira QA: Monique Zahn Directed by: Amos Bairoch, Lydie Lane