slides
Transcription
slides
QSTAR An integrated analysis of biological, chemical and gene expression data in drug discovery 9 October 2014 IWT grant QSTAR (2011-2013) Joint analysis of three high dimensional data types [Image: Andreas Bender] The team Janssen scientists from Oncology, Neurology, Infectious Diseases Academic collaborations: chemoinformatics, statistics, machine learning, platform-specic data preprocessing Further Janssen team members: Medicinal chemistry, chemoinformactics, biology, molecular proling, statistics, IT, HTS, exploratory toxicology The team [Images: Jörg Kurt Wegner] Motivation Life science becomes ever more complex Craig Federighi, Apple's Senior VP of Software Engineering [Image: http://lifehacker.com/new-is-easy-right-is-hard-1371919541] Motivation Drug Discovery: Innovation <> Approval of new drugs Scannell et al., Diagnosing the decline in pharmaceutical R&D eciency [Image: Nature Reviews Drug Discovery 11, 191-200 (March 2012)] Motivation Consequences of increasing complexity and technological innovations 1 Specic expertise not present within a company 2 Diculties in reproducing research Need to establish collaborations with people who have the missing expertise who generate reproducible (= reliable) data [Graphic: http://www.nature.com/scitable/ebooks/englishcommunication-for-scientists-14053993/writingscientic-papers-14239285] Increasing complexity in life sciences Lack of expertise Pharmaceutical companies are trying to expand the number of collaborations to be successful. Since 2011, six of Janssen's nine newly approved drugs have come from outside. About half of Janssen's drug-development pipeline is from outside, versus about 20% in 2002. [WSJ, 9 March 2014] [Graphic: Jonathan D. Rocko, The Wall Street Journal] Increasing complexity in life sciences Diculties in reproducing research A pharmaceutical company must rely upon high quality scientic ndings However, diculties in reproducing results statistical hypothesis inference testing signicance is no indicator of practical relevance [R. Nuzzo, Scientic method: Statistical [Graphic: Regina Nuzzo, Nature, 12 February 2014] errors, Nature, 12 February 2014] Possible cause: researchers tackling questions beyond their level of expertise Possible solution: broader and more interdisciplinary teams Increasing complexity in life sciences Diculties in reproducing research [Graphic: Janssen DDI Sensibilisation Campaign] IWT grant QSTAR (2011-2013) Joint analysis of three high dimensional data types [Image: Andreas Bender] Drug discovery Target-based assays Phenotypic assays [Image: http://www.idigitalmotion.com/drugdisc.htm] Phenotypic assays Unknown target [Image: Yeom et al. Journal of Translational Medicine 2009 7:70] [Image: Science Vol. 307, no. 5707, 14 January 2005] Scientic challenges Hit classication Bioactivity and promiscuity Grouping by functional similarity Number, magnitude of changes Annotation or screening Target deconvolution Similarity to external perturbagens Concentration, analogs, cells Target / Scaold hopping Early detection of liabilities Based on functional similarity Functional similarity to known liabilities Context IWT R&D project QSTAR (2011-2013) Can we detect and utilize Quantitative Relations between three high dimensional data sets? chemical data on compound Structures and substructures biological data from Transcriptional proles as a global measure of bioactivity / polypharmacology biological Assays (IC50s) Data Chemical structures Need to be represented in a computer-understandable way but also need to be interpretable by a medicinal chemist! Binary data for presence / absence of specic compound substructures Numerical data for global physicochemical properties Fingerprinting Algorithm Data Bioassays (ABCD) Level of activity in specic biological assay (e.g., IC50) Lots of missing values for a compound Can we ll the matrix via target prediction? Compounds x Bioassays Data Gene activity proles Need for holistic bioactivity measures of a compound (vs. specic) Complex molecular characterization of a cellular phenotype Transcriptional proles as a global measure of bioactivity / polypharmacology Compounds x Genes Toolbox Data analysis: Standardized data objects (chemical structures / bioassays / gene activity proles) Same ECFP6 - ngerprint ID across all projects Same compound-specic IUPAC InChIKey ID across all Same Entrez Gene ID across bioassay and gene expression data Medicinal chemistry: Translatable structure encoding ECFP6 ngerprints Global molecular characteristics Toolbox Complete data sets Internally (business relevant) Project specic All of J&J Public dataset (academic partners) ChEMBL Connectivity map Toolbox Version control for data objects and data preprocessing R packages for sharing documented analysis functions Sweave documents explaining data analysis processes reproducible results OpenAtrium communication platform Accessible data server White paper QSTAR Pipeline Stable, reproducible data generation Stable ECFP6 - ngerprint IDs across data sets Visualization of ECFP6 - dened substructures on molecule (Arcadia) Scientic challenges Focus of the project 1. Chemotype / compound prioritization (hit to lead) 2. Compound optimization (lead optimization) 3. Bioactivity and promiscuity Link transcriptional prole <> bioassay ROS1 example Number of dierentially expressed genes as an indicator for compound promiscuity Level of promiscuity largely stable during optimization for potency in primary assay Transcriptional prole for chemotype selection Ongoing: Link transcriptional prole <> bioassay L1000 platform Increase number of compound proles (>10x) Connect data to LINCS data (http://www.lincsproject.org/) Started pre-competitive L1000 user/interest group (AZ, Boehringer, BMS, Eli Lilly, EMD Serono, GSK, Janssen, Novartis, Pzer) Ongoing: Link compound data <> bioassay Target Prediction Using Machine Learning ap- proaches we can currently predict the activity on these targets correctly for 4 out 5 for 300 protein targets (Virtual assay") get an enrichment with active compounds for 600 protein targets (Enrichment") Bioassays linked to a specic have no predictive power for protein target 200 targets ChEMBL: 734 Janssen: >1000 Acknowledgements Janssen University of Hasselt Durham University Ilse Van den Wyngaert Ziv Shkedy Adetayo Kasim An De Bondt Nolen Joy P. Pieter Peeters Martin Otavain Miroslav Cik Philippe Haldermans Dirk Wuyts Marc Mercken Tarig Bashir Karine Smans Berthold Wroblowski Joerg Wegner University of Michigan Andreas Mayr University of Rochester Matthew McCall Lieven Clement Willem Talloen Pushpike Thilakarathne Dhammika Amaratunga Nandini Raghavan Harrie Gijsen Ulrich Bodenhofer Günter Klambauer KU Leuven Herman Van Vlijmen Djork-Arne Clevert Fan Meng Geert Verheyen Patrick Marichal Sepp Hochreiter Manhong Dai Karin Verstraeten Luc Bijnens University of Linz Cambridge University Andreas Bender Aakash Ravindranath University of Ghent Olivier Thas Bie Verbist Andreas Mitterecker Martin Heusel OpenAnalytics Tobias Verbeke Aditya Bhagwat Sogeti Steven Osselaer Randstad Liesbet Vervoort