INPA's Biological Collection Data Quality Improvement Laurindo Campos lcampos@inpa.gov.br MCTI/ INPA The National Institute for Amazonian Research Information Technology Coordination BioGeo Informatics Unit Semantic Interoperability Laboratory 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Outline q q q q q q Biodiversity Scenarios: Global and National INPA and its presence in Amazonia Data Quality Issues Disseminating Biological Data INPA’s IT Evolution Concluding Remarks 4th 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Biodiversity Scenarios: Global and National Global Diversity q From ~ 1.7 million of known species ¤ 56% are insects! ¤ 14% are plants ¤ 2.7% are mammals and birds q It is es?mated that 4-‐20 million of species have not been described yet 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Megadiverse countries America and the Caribbean is Latin the region with the greatest biological diversity on the planet: 50% of the world’s tropical forests 33% of its total mammals 35% of its reptilian species 41% of its birds 50% of its amphibians Six countries in Latin America 4th 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 National Scenario: Brazil - higher rate of biodiversity in the world (~ 20%) - (Assunção, 2011) q Six biomes (disruption and degradation are the main threats) q Combined pressures are forcing the loss of habitat and species q Planning and Decisions are dependent on data/ metadata management q 4th 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 National Scenario (cont.): Data and metadata are collected as a preliminary process in scientific experiments q Management is mandatory q Sharing data, analysis and synthesis are crucial q Data Governance - data and information as commodities (Jason Kolb, 2011) q 4th 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Barriers to overcome: 1. 2. 3. 4. Data policy for organizations; Improvements in infrastructure; Improvements in data quality; Effective management and use of data/ metadata and information. 4th 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 INPA and its presence in Amazonia Mission: ”To generate and disseminate knowledge and technologies, and to enable human resources for the development of the Amazon" 4th 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 THEMES Biodiversity Knowledge of the biological diversity of the Amazon region. Society, Environment and Health Dynamics of human popula?ons of the Amazon and its social and environmental implica?ons. Environmental dynamics Understanding the Amazon ecosystem. Technology & InnovaHon Applica?on of the knowledge produced on natural resources for the development of techniques, processes and products that meet the socioeconomic demands. 4th 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 10 Research Areas MSc e PhD Programs ecology botany aquaculture entomology Aqua?c Biology Health Science Natural Products Forest Products tropical forest agronomy Food Technology Climate and Water Resources Humani?es and Social Sciences ecology botany entomology agriculture tropical forest Aqua?c Biology and Fisheries Gene?cs, Conserva?on & Evolu?onary Biology Biological Reserves Management Biotechnology (UFAM) Regional Products and Biotechnology (UEA) Science of Food (UFAM) 4th 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Brazilian Amazonia - INPA Central and State Centers São Gabriel da Cachoeira Consolidated Partnership Santarém Tefé 4th 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 INPA’s New Geographic Approach Amazonia sensu latissimo, Source: Eva & Huber (2004). 4th 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Expertise - Projects, Partnerships & Training 13 4th 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 14 Program of ScienHfic CollecHons and Archives Zoological CollecHons INVERTEBRATES ~ 3,500,000 insects Reptiles and Amphibians ~ 17000 PEIXES ~ 120,000 specimens of BIRDS various river flows. ~ 800 specimens HERBARIUM 217.462 records MAMMALS CARPOTECA 2.500 samples Wood Collection ~ 10.445 samples ~ 5.242 specimens COLLECTIONS MICROBIOLOGICAL Medical and agroforestry 4th 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 To follow principles ¨ ¨ ¨ The way to fulfill INPA’s mission is to treat data as longterm asset and managing it within a coordinated framework. Principles of data quality need to be applied at all stages of the data management process (capture, digitization, storage, analysis, presentation and use). Focus on two keys to the improvement of data quality: prevention and correction. 4th 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Data Quality POOR QUALITY • Jeopardize decision making process, credibility of data, satisfaction of users; • High costs of data management and the effective use and value of data (Redman, 1996). 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 POOR DATA QUALITY: IMPACTS • • • • Pervasiveness of poor data; Troublesome data and collection management; Difficult data integration and database merging; Scientific and institutional reputation. (Dalcin, 2005) 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 APPROACH • Since data users have wide range of needs, and data are collected from different sources, INPA must enable data of known (good) quality to be shared. • For specific dataset, it must document the way data has been compiled and verified, and use it to provide valuable information to metadata description. • Implement data curation activities 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Data and Computational Resources 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Diretoria INPA Curadoria de Dados Estrutura SDIN COAE CPAF Pesquisas CTIN Programa de Coleções LIS Curadoria de Dados Científicos NBGI Grande Projetos 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 DATA CURATION @ INPA • On going management activities to maintain scientific data in long-term mode such that it is available for reuse and preservation. • Ex.: LBA, GEOMA, PELD,PPBIO, TEAM, GO AMAZON, ATTO, etc; • Institutional Data Committee (Researchers & IT Professionals) – Implementing data policy and its enforcement; 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 DATA CURATION @ INPA Best practices and the development/adoption of specific tools. Focus on: • Accuracy of taxonomic identification • Precision over the location and associated information in the record • Clarity of the recording approach and methodology • Accuracy of producing and documenting the record • Quality of data transmission 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Generic Error Pattern (English, 1999) ¨ Data cleansing ¨ Error patterns ¤ Domain value redundancy ¤ Missing data values ¤ Incorrect data values ¤ Nonatomic data values ¤ Domain schizophrenia ¤ Duplicate occurrences ¤ Inconsistent data values ¤ Information quality contamination 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 DisseminaHng Biological Data 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Information Network: traditional Infrastructure Science Application Policy Tools for Presentation Tools for Synthesis Information Tools for Analysis Information Infrastructure Data Providers Data Digitization Adapted from Erick Mata, 2008. 4th 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Applications / data access today... Web site INPA Portal DiGIR INPA Servidor colecoes.inpa.gov.br Coleções Biológicas do INPA Web site Biodiversidade amazonica (PPBio Amazônia) Web site INPA Portal DiGIR Biodiversidade amazônica Portal DiGIR INPA Servidor Biodiversidadeamazonica.net.br Acervos de outras coleções da Amazônia Ocidental Servidor colecoes.inpa.gov.br Coleções Biológicas do INPA Web site speciesLink Web site Biodiversidade amazonica (PPBio Amazônia) Web site INPA Portal DiGIR speciesLink Portal DiGIR Biodiversidade amazônica Portal DiGIR INPA Servidor Biodiversidadeamazonica.net.br Servidor colecoes.inpa.gov.br Rede Paraná Taxon-line Rede Espírito Santo Rede São Paulo Rede Rio de Janeiro Acervos de outras coleções da Amazônia Ocidental Coleções Biológicas do INPA Ferramentas • Mapas • Modelagem • Datacleaning • Georreferenciamento automático Web site speciesLink Web site Biodiversidade amazonica (PPBio Amazônia) Web site INPA Portal DiGIR speciesLink Portal DiGIR Biodiversidade amazônica Portal DiGIR INPA Servidor Biodiversidadeamazonica.net.br Servidor colecoes.inpa.gov.br Rede Paraná Taxon-line Rede Espírito Santo Rede São Paulo Rede Rio de Janeiro Acervos de outras coleções da Amazônia Ocidental Coleções Biológicas do INPA GBIF Web site Biodiversidade amazonica (PPBio Amazônia) IABIN SIBBr Portal DiGIR Biodiversidade amazônica Web site INPA Portal DiGIR INPA Rede speciesLink Servidor Biodiversidadeamazonica.net.br Servidor colecoes.inpa.gov.br Rede Paraná Taxon-line Rede Espírito Santo Rede São Paulo Rede Rio de Janeiro Acervos de outras coleções da Amazônia Ocidental Coleções Biológicas do INPA SIBBr: A NaHonal IniHaHve 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Slide from SIBBR/LNCC, 2013 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Towards a beUer cyberinfrastrucure 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Conceptual Map Slide from D. Pennington 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Conceptual Landscape of Technology Enabling Science - CLTES Mental Model ¨ Research Design ¨ Collect Data ¨ Conduct Analyses ¨ Dissemination/Publishing ¨ Cyberinfrastructure Systems ¨ 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 CLTES – Technology & Research Cicle Slide from D. Pennington 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 CLTES no INPA, MPEG, SIBBr, GBIF, Probio II, Biota, Cria, etc Slide Adapted From D. Penningtonth SIBBr 4 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Infrastructure for improving data management 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 From Data on the Web to a Web of Data 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 EVOLUTION OF DATA/METADATA 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 (Tim Bernes-Lee’s Open Data Classification, 2010) ★ On the web, open license ★★ Machine-readable data ★★★ Non-proprietary format ★★★★ RDF standards ★★★★★ Linked RDF ★★★★★ Linked to rich, descriptions capable of supporting interoperability Linked Open (Biological Data) - LOD 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 EVOLUÇÃO DA WEB 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Concluding Remarks • Data quality refer to the understanding and description of the processes in data acquisition, treatment and management; information production, usage and delivery; and data modeling and implementation. • The significant aspect of data quality issues is related with the Internet and robust cyberinfrastructure which promotes a better way information is delivered. • Researchers (Biologists) must follow/trust the new way data is managed. 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Thank you! Laurindo Campos lcampos@inpa.gov.br 4th SIBBR Workshop - Petrópolis-RJ, 25-29 of August 2014 Partners and Collaborators