Cover Sheet for Proposals JISC Grant Funding Call
Transcription
Cover Sheet for Proposals JISC Grant Funding Call
Cover Sheet for Proposals JISC Grant Funding Call Name of Programme & Strand: Information Environment 2011 Programme (2/10): Deposit of research outputs and Exposing digital content for education and research Programme Tags: "INF11" and “JISCexpo" Name of Call Area Bidding For: Strand B - Expose Name of Lead Institution: University of Cambridge Name of Department where project would be based: Chemistry Full Name of Proposed Project: Open Bibliography Name(s) of Partner HE/FE Institutions Involved: Name(s) of Partner Company/Consultants Involved: Open Knowledge Foundation, International Union of Crystallography, Rufus Pollock, Ben O’Steen Full Contact Details for Primary Lead and/or Contact for the Project: Name: Dr Peter Murray-Rust Position: Reader Email: pm286@cam.ac.uk Tel: 01223 763069 Skype/VoIP: peter.murray.rust Address: University Chemical Laboratory, Lensfield Rd, Cambridge, CB2 1EW Length of Project: 9 months Project Start Date: 14th June, 2010 Project End Date: 13 March, 2011 Total Funding Requested from JISC: £77,050 Funding Broken Down over Financial Years (April - March) 2010: £77, 050 Project Description / Abstract: This project will deliver a substantial corpus of bibliographic metadata as Linked Open Data, using existing semantic web tools, standards (RDF, SPARQL), linked data patterns and accepted Open ontologies (FoaF, Bibo, DC, FRBR, etc). The data will be from two sources: traditional library catalogues (Cambridge University Library and the British Library) and ToCs from a scientific publisher, the IUCr. None of the material is currently available as LOD. Key strategies are (i) transformation of current publishers' model to create Open Bibliography as part of their future business, and (ii) the immediate and continuing engagement of the scholarly community. Deliverables include a maintained and growing bibliography on the IUCr site and engagements with other like-minded publishers such as PLoS. Interoperability, Open Technologies, Research & Innovation, Resource Discovery, Tools & Techniques Keywords describing project: th I have looked at the example FOI form at Appendix B and included an FOI form in the attached bid YES I have read the Call, Briefing Paper and associated Terms and Conditions of Grant at Appendix D YES Open Bibliography (OB) 1 Background 1. Bibliographic data is useful: A number of organisations such as CERN1 and Library of Congress2 have recognised that providing open access to bibliographic records and controlled vocabularies is a natural and necessary step to begin to identify errors and to avoid erroneous or divergent duplication, thereby improving the metadata accuracy. A key 3 point from Karen Coyle is "The change that libraries will need to make in response [to user demand] must include the transformation of the library’s public catalogue from a stand-alone database of bibliographic records to a highly hyperlinked data set that can interact with information resources on the World Wide Web." 2. Bibliographic data is generally not open or linked: this limits its usefulness to the academic community4,5,6. The project will deliver bibliographic material that is truly open (conformant with the OKF’s http://opendefinition.org (OKD)). Many attempts to create LOD suffer because there are no useful resources to link to. OB will expose Author names, Institutions and Geographical Locations with semantic targets in the LOD ecosystem (e.g. Geonames, Wikipedia); the project will put significant effort into disambiguation so that OB can become an important node in the LOD graph. 3. Processes to Linked Open Data are not familiar to libraries and publishers: Much modern bibliographic data is created implicitly or explicitly by the scholarly publication process but exposed poorly or not at all. Working with cooperating publishers can rapidly transform their output to complete open semantic bibliography. By providing a clear working model for bibliographic metadata as semantic, referenceable links with a reusable workflow to gather, add provenance, refine and disambiguate existing metadata information, members of the JISC community can apply the same model and techniques with the opensource code and services we will provide to use data from and contribute to the aforementioned 'highly hyperlinked data set'. 2 Appropriateness and Fit to Programme Objectives 2.1 Collections 4. The project will focus on the exposure of on two major bibliographic collections that are being made available by the project's collaborators, but are not currently Linked or Open. 5. Cambridge University Library and British Library catalogue. As two of the official UK legal deposit libraries their catalogues are comprehensive and substantial. The CUL will provide an initial set of at least 100,000 bibliographic records in MARC 21 format and have agreed for these records to be made available under an appropriate open license (i.e. one compliant with http://opendefinition.org). Depending on progress on clearing relevant rights it may be possible for CUL to make available substantially more material as the project progresses. The BL have also committed to provide open bibliographic metadata (exact amount to be confirmed). 6. International Union of Crystallography publications. This is the International Scientific Union for crystallography which has a distinguished record enhancing the quality of practice in crystallography. Most relevantly it has run a metadata (CIF dictionary) project for 30 years which define the syntax and semantics of crystallographic scholarly communication. It is a major publisher with 8 journals (Acta Crystallographica) and many 1. 1 http://library.web.cern.ch/library/Library/announcement.html - "The CERN Library publishes its book catalogue as Open Data" 2 http://id.loc.gov/authorities - "Library of Congress Subject Headings, published as Linked Open Data" 3 Karen Coyle for ALA - "Understanding the Semantic Web: Bibliographic Data and Metadata" http://www.alatechsource.org/library-technologyreports/understanding-the-semantic-web-bibliographic-data-and-metadata 4 http://blog.okfn.org/2008/03/06/open-bibliographic-data-the-state-of-play/ - "Open Bibliographic Data - The state of play" 5 http://data.gov.uk/ - All data is open 6 http://blog.okfn.org/2010/03/15/libraries-in-cologne-open-up-bibliographic-data/ - "Libraries in Cologne open up bibliographic data" reference works. IUCr partnered with JISC to prototype Open Access in crystallography and in 2005 Peter Strickland, Managing Editor of IUCr, said: "As a result of the JISC funding so far, I am strongly convinced that providing authors the opportunity to make their papers open access works well, provides authors with extra choice and improves access to published content. The funding of IUCr journals has allowed UK authors to publish over 570 open access articles; 255 in 2004 and 322 to date in 2005." In 2008 IUCr converted Acta Cryst E to fully Open Access and since then it has published nearly 10,000 peer-reviewed research articles; this forms our second collection. 2.2 Technical Issues 78 Figure 1 – Open Bibliography Curational Archive (OAIS-derived , ) 7. The archival model and storage form is informed by preservation/archival meetings such as PASIG9. Based on a risk-management approach to archive design it utilises an OAIS-like workflow combined with incremental refinement of the canonical data. It uses open specifications (e.g. from the California Digital Library's microservices (pairtree, 10 checkm, dflat ) and semantic web ontologies and markup. A prominent service using 11 CDL's microservices is the HathiTrust , with 5,643,119 volumes in 210 terabytes whose code has been used in BoS projects. 12 8. Dissemination will use an RDF triplestore (4Store ) with a web interface and a SPARQL endpoint with 'canned' community-based queries. Mashups such as our example of the explosion of Asian science (geography-timeline-crystallography, (http://wwmm.ch.cam.ac.uk/blogs/walkingshaw/, 1 minute video) show how Linked Open Bibliography can have immediate impact. We and our collaborators already have large communities and we shall develop tools and resources to engage them throughout the process in order to guide and collaborate on the development of tools and services in Work Package 3). 9. Design of bibliographic metadata. An assessment of current and planned services that provide bibliographic metadata and/or authority metadata consistent with Linked Open 13 14 15 Data, such as the National Library of Sweden , the National Library of Hungary , VIAF , 16 17 the JISC-funded Names Project , the Library of Congress's LCSH name authority service 18,19 and the bibliographic data curated by Talis and hosted on their Platform . The project will 1. 7 “Reference Model for an Open Archival Information System (OAIS). CCSDS 650.0-B-1, Blue Book, January 2002” http://public.ccsds.org/publications/archive/650x0b1.pdf - [ISO Standard 14721:2003 8 The Ingest methodology is informed by the “Producer-Archive Interface Methodology Abstract Standard” http://public.ccsds.org/publications/archive/651x0b1.pdf 9 http://lib.stanford.edu/pasig - Preservation and Archiving Special Interest Group (PASIG) - Oct 2009 Meeting had 175 attendees included numerous national libraries, and large archival libraries and organisations. 10 http://www.cdlib.org/gateways/technology/ - California Digital library: List of specifications 11 http://www.hathitrust.org/ - HathiTrust: ... collaborative digital collections of 13 universities. 12 http://4store.org - 4Store open source semantic web 'quadstore' - used in http://data.gov.uk –related work 13 http://libris.kb.se 14 http://nektar.oszk.hu 15 http://www.viaf.org/ 16 http://www.jisc.ac.uk/whatwedo/programmes/reppres/sharedservices/names.aspx 17 http://id.loc.gov/authorities 18 http://www.talis.com/platform/ 19 http://vocab.org/ be publicised to the bibliographic community and consultation regarding best practices will be sought. 10. Metadata Extraction. A major initial aspect of the project is to systematize the bibliographic metadata to a form that can be converted to RDF Linked Open Data (LOD). Project members Dr Pollock (of University of Cambridge and the Open Knowledge Foundation, OKF), and Mr Ben O'Steen have already created several bibliographic resources (open and otherwise). The project (OB) will example the technical issues in converting MARC and CIF to RDF-based standards suitable for LOD. It will also identify potential RDF endpoints to which we can link such as DBPedia, Pubchem, and Geonames. 11. Initial conversion of metadata. Initial conversion of records (from MARC/RDA) to RDF including entity extraction and assertion using relevant authority files or by algorithmic determination. Agile methodology will be used, iteratively improving the accuracy and completeness of the metadata conversion. The conversion of IUCr XHTML to an RDFexpressed form will occur in parallel, using similar processes and methodology. For conversion of metadata into an RDF-expressible form we will engage with the both the W3C semantic web and the library community (through the existing OKFN Working group on 20 21 Open Bibliographic Data and members of Code4Lib such as Ed Summers, developer 22 behind the LCSH Linked Data set and 'Chronicling America' and Mike Giarlo, Digital Library Architect at The Pennsylvania State University) to help refine and inform our efforts. 12. Entity disambiguation, 'truth management' and refinement. A major challenge in the use of bibliography in LOD is the assignment of identifier URIs. Lexical variants in names are common ("Darwin, Charles", "C Darwin"; "Cambridge Univ.", "University of Cambridge") and there is much polysemy ("George Bush"). OB will use additional relations in the complete RDF to resolve ambiguity. This can include co-authors, dates (including biography), institution names, and syntactic similarities and differences. The IUCr set is particularly exciting as it was created as part of the author submission process and details such as phone numbers and emails are common. These are effectively unique identifier systems and have great power in resolving names and institutions. 13. Disambiguation and co-reference will be achieved using a variety of techniques, including heuristic and probabilistic methods, as well as using known and trusted authority lists such as the IUCr's dictionary of crystallographers and the bibliographic authority files. 14. The model used for the truth management of this disambiguation will borrow heavily 23 on the work done by Hugh Glaser, Iain Millard et al at Uni of So’ton , as displayed notably 24 in the JISC funded Rapid Innovation project 'dotAC' and the service http://sameas.org. 15. Semantification and exposing links. All team members are highly experienced in the creation of RDF, including funding from JISC projects such as ICE-TheOREm while the OKF currently run triplestores for CKAN.net, data.gov.uk, and a prototype bibliographic store. We have written CIF-RDF converters for crystallographic data and metadata as shown in Crystaleye, which is the UCMSI's derived resource for crystallography. Here we trawl the Open web for data on publisher sites and transform it to Chemical Markup Language (CML) and RDF. This has currently ~150,000 retrieved data sets each with full bibliography. This technology is now being used in the JISC CLARION project to populate our departmental repository with experimental data and metadata. OB will seek LOD resources which either already expose endpoints or which may wish to use ours. 16. Working with community to create user mashup / tool..There is a real need in the community for bibliography (see Impact) which can be harnessed to software and repurpose content. The Blue Obelisk Open source chemistry community (http://www.blueobelisk.org co-founder PMR) has already created a bibliographic mashup http://blueobelisk.sourceforge.net/wiki/index.php/Using_Javascript_and_Greasemonkey_for _Chemistry) which decorates publisher articles with links to the blogosphere and 1. 20 http://wiki.okfn.org/wg/bibliography 21 http://code4lib.org 22 http://chroniclingamerica.loc.gov/ 23 http://eprints.ecs.soton.ac.uk/15245/ 24 http://www.dotac.info/about/ crystallography. This community will react enthusiastically to incentives (including prize(s) from the project) to create mashups and we expect similar activity in other domains. 3 Quality of Proposal and Robustness of Workplan 3.1 Work Packages 17. WP 1: Project Management. Project management provided by RP (see Experience section) and will strongly emphasize iterative development using Agile technologies with intensive involvement of community (scientists, librarians). Software and content are available to the community through nightly Open builds. Quarterly meetings with IUCr. Meetings with Cambridge University Library as necessary. An advisory group (Ed Summers, Mike Giarlo, and members of our current advisory boards for CheTA and CLARION), Southampton and UKOLN (I2S2), Simon Coles and the National Crystallographic Service. Project website and code repository will be setup at the beginning of this project and all code made available under an Open Source License recommended for use by OSSWatch. Participants: RP, UCSMI, BoS Deliverables: Project plan. Progress reports. Final report 18. WP 2: Infrastructure development. Initial in-depth meeting with IUCr editorial staff to define valuable outcomes from the exposure of Open crystallographic bibliography. Iterative planning on transfer of bibliography to IUCr and feedback on quality. Meeting with CUL and BL and acquisition of initial set of bibliographic metadata from them. Planning for any subsequent provision of metadata. Participants: PMR, IUCr, RP, CUL, OKF, BoS Deliverables: Protocol for Open crystallographic bibliography Project website, code repository, blog and wiki (OKF) 19. WP 3: User tools and services. All created RDF data will be loaded into triplestore. Exploration of useful SPARQL queries informed by community use. Customisation of production access to SPARQL end-point to make it user-friendly using direct and continuous communication with user groups including local chemistry researchers, the IUCr and other stakeholders. The SPARQL endpoint will also be publicised to encourage mashups. Participants: RP, OKF, BoS, PMR, IUCr Deliverables: SPARQL endpoint, web interface to the created RDF utilising standard Linked Data publishing patterns. Potential mashups with other data sources. Bounty (prize) for community-based mashups. 20. WP 4: Evaluation and dissemination. Needs-driven evaluation of SPARQL endpoint by IUCr editorial staff. Iterative refinement of metadata in triplestore and third-party links. Publication in Acta Crystallographica and at IUCr congresses and exposure on Acta Crystallographica website. (See Evaluation in Impact) Participants: RP, OKF, BoS, PMR, IUCr; Deliverables: Evaluation report. Publications and exposure/promotion by IUCr. Engagement with Blue Obelisk (chemistry) and Bio2RDF communities for blogging, and viral dissemination. 21. WP 5: Sustainability and service transfer. All organisations currently run servers which are guaranteed to run for up to 2 years beyond the project end. They agree to maintain these and collect usage and maintenance metrics in a public fashion. OB will transfer all software to both partnering institutions (OKF, IUCr) and the appropriate LOD bibliography to each. Participants: RP, OKF, BoS, PMR, IUCr; Deliverables: (i) Transfer of live and growing OB to IUCr editorial staff for alpha release. (ii) Identification of OB project team within OKF for handover of prototype. 3.2 Project Timetable 22. The following shows the timings for each work package. Degree of effort is proportional to shading intensity. Month 1 2 3 4 5 6 7 8 9 WP1 WP2 WP3 WP4 WP5 3.3 Risk Assessment Risk. Proba bility / 5 Severit y/5 Score (P x S) Action to Prevent/Manage Risk Staff retention 3 5 15 Ensure staff are satisfied and challenged and have chance to give feedback by means of regular one-to-ones. Ensure sharing of expertise thus enabling cover. Key academic staff leave 1 5 5 There is sufficient overlap in expertise to allow a reduced delivery. Rescoping would be necessary. Unexpected insuperable technical problems 1 4 4 Similar problems already solved by project members. Iterative development allows for graceful degradation of deliverables. Hardware failure resulting in loss of data 1 4 4 Preventative approaches to data and service backup including automated backup, off-site replication and server redundancy. Collections are unavailable or intractable 2 4 8 Engage with data acquisition as early as possible to allow alternatives to be arranged (if necessary) in good time. OKF / IUCr services not supplied 1 5 5 Ongoing engagement with these key stakeholders to maintain level of commitment. 1 5 5 Close consultation with University legal services, establish clear project staff guideline w.r.t. commercial partners. Staffing Technical External suppliers Legal Copyright infringement 3.4 Intellectual Property Position 23. The OKF uses CC-BY for authored content and PDDL for Open data. The University of Cambridge asserts its rights to IP created by employees in the course of their employment. All software is distributed under the Artistic Licence (BSD style). The IUCr uses CC-BY for its Open Access material and will use the services of the OKF to advise on the best ways of Opening services. 4 Engagement with the Community 24. The OKF and UCSMI have created large communities who enthusiastically adopt their outputs (see Appendix C and letters of support from IUCr, PLoS, and Carleton). Demand for OB is high (see Impact section) and a granted project will attract interest and requests for participation including publishers, researchers and readers. Both partners have extensive experience of voluntary contributions in Open Source and Open Content projects. 4.1 Dissemination 25. The main targets for dissemination are the primary stakeholders, focusing on the library community and scientific publishing. OB will disseminate through a number of channels (i) Presentations, posters and demonstration stands at JISC Information Environment, Repositories, Infrastructure programme events and All Hands Meetings. Presentation of papers/posters at IUCr and Brit. Cryst. Assoc. meetings, (ii) Formal publication(s) in Open Access journals, (iii) the OREChem project (app. C), (iv) OKF meetings (attended by academia, government, public and private research, publishers, and (v) Blog postings. 26. The project will use part of its dissemination budget to offer prize(s) for the best mashup(s) using OB outputs. 4.2 Evaluation 27. Bibliography is by default usually not Open (i.e. it may require permission to view, redistribute or repurpose) and this causes major impedance in creating the web of Linked Open Data. The impact in the LOD community of the project from this baseline is simple: Can the project deliver two examples of bibliography which fulfill the following criteria? 1. The resource is fully Open according to the OKD, and there is agreement from the providers of the bibliography that no IPR appears to have been violated 2. The bibliography is linked to at least one resource in the LOD collection (http://en.wikipedia.org/wiki/Linked_Data) such as names, places, chemical entities, etc. 28. This forms a clear evaluation criterion. The Cambridge UL catalogue is not currently semantically linked and it will be possible to measure the accesses to the new resource by online accesses or downloads. The IUCr bibliographic data is not currently disambiguated. The project will analyse the confusion of identities through manual sampling and repeat and project end. The value of a new linked bibliography can be measured by IUCr server stats. 4.3 Stakeholder Analysis 29. Key Stakeholders UCMI [High interest, high power] The crystallographic bibliography will immediately enhance the value of the Crystaleye knowledge base. Bibliographic data will allow the JISC-CLARION project to link data to publications The links will be of value to UCMI and partners using ORE in OREChem (Microsoft-funded) OKF [High interest, high power] Immediate enhancement of bibliographic software for re-use in community. High visibility and prospect of future engagement in bibliographic projects. IUCr [High interest, high power] The RDF resource will be of great value in managing the publication process (e.g. author disambiguation), and supporting online searching. High visibility in showing quality and commitment of IUCr. JISC [High interest, high power] Reaping the benefit of investment in Open Access funding (IUCr) and proving the value of RDF for managing collections and metadata 30. Primary Stakeholders Academic libraries [High interest, high power] The ability to connect libraries to the Open web will greatly enhance their visibility (especially among non-academics) Crystallography [Indirect medium-high interest, medium interest] Enhanced Open searching Researchers will be used. Author disambiguation may help citation management. 31. Secondary Stakeholders General Library technology [Medium interest, medium power] Technology and standards developed in this program will support interoperability in libraries Researchers [high interest, medium power] The discovery of difficult resources by whatever means may be critical to (non)academic research Semantic Web [high interest, medium power] Any successful exemplar of Linked Open Data community is of great value in proving the value of the approach. 5 Impact 32. The collaboration of IUCr (see letter) is a very high-profile exemplar to the Open Access community and to re-distributors of OA material such as the British Library and UKPubMedCentral. IUCr also maintain and develop the CIF ontology for crystallography (including bibliographic concepts) and have a communal exposed World Directory of Crystallographers (an invaluable aid in disambiguation of names, places and institutions). Another Open Access publisher (PLoS, see letter) will eagerly follow what we do. 33. Researchers are desperate for good metadata; a typical one writes: "...publishers seem reluctant to maintain quality feeds. I have, in the past, written my own 'screen scrapers' that digest publishers' websites and present them as XML, but the layouts of the websites often changes frequently and it became too time consuming to keep up. I would eagerly use any open-source software package that automated this process and allowed me to aggregate the tables of contents of various journals, customizable feeds of various databases, etc..." 5.1 Sustainability 34. The project is committed to Open sustainable outcomes for both collections at the end of the project. The key strategy is for the IUCr to embed the ongoing creation of bibliography into their publication; this will be at worst marginal costs and will certainly enhance quality. IUCr and OKF commit to sustaining the resources on the web for a further 2 years with active maintenance and ingest and cost-effective enhancements to software 5.2 Evaluation 35. (see also Evaluation section in Engagement) The primary evaluation is through the OKF and IUCr participating staff. Success is measured by increased use and interest above the baseline; success in making resources Open; and in linking to existing LOD targets. The project will collect community feedback, through forms on exposed resources and responses to blog posts. Bio2RDF (see letters) will be asked for an assessment of the value of the exposed metadata and the tools used to create and maintain it. 6 Budget Directly Incurred Staff Apr10– Mar11 TOTAL £ Peter Murray-Rust, PI, Grade 11, 30% £16,089 £16,089 Total Directly Incurred Staff (A) £16,089 £16,089 Non-Staff Apr10– Mar11 TOTAL £ Consultant, Technical Lead. 100 days @£400 £40,000 £40,000 Consultant, Project manager and developer. 50 days @£400 £20,000 £20,000 IUCr, integration work, 10 days @£500 £5,000 £5,000 OKF, support and service maintenance for project + 2 years £3,000 £3,000 IUCr maintenance and hosting for Crystaleye bibliography £7,000 £7,000 Travel and expenses £1,150 £1,150 Hardware/software £500 £500 Dissemination £250 £250 Total Directly Incurred Non-Staff (B) £76,900 £76,900 Directly Incurred Total (C = A + B) £92,989 £92,989 Directly Allocated Apr10– Mar11 TOTAL £ Estates £3,965 £3,965 Directly Allocated Total (D) £3,965 £3,965 Indirect Costs (E) £8,733 £8,733 Total Project Cost (_ = C+D+E) £105,687 £105,687 Amount Requested from JISC £77,050 £77,050 Institutional and Partner Contributions† £28,637 £28,637 Percentage Contributions over the life of the project JISC 72.9% Total Uni of Cambridge 12.9% 100% #FTEs used to calculate indirect & estates charges (staff included) OKF 2.8% IUCr 11.4% 0.3 (PM-R) † Institutional Contributions include specific directly incurred contributions from partners OKF and IUCr, 20% of Cambridge’s fEC and discounts on the rates from both of the consultants involved. IUCr are also contributing the use of the Acta E corpus. It is difficult to put a figure to this contribution, but the journal itself represents many years of effort. CUL are similarly contributing a unique catalogue with many unique records. 6.1 Value for Money 36. Costs have been kept to a minimum, representing an inexpensive laptop and a minimum of travel for dissemination and project F2F meetings. The benefits from the project are primarily in favour of the community rather than the University of Cambridge specifically, accordingly we feel an institutional commitment of 20% of the FEC is appropriate. Taking the impact (especially the sustainability of the outputs) into consideration with the contributions by the OKF, IUCr and the day rate discounts from the individual partners, we believe this project represents excellent value for money. 7 Experience of the Project Team 37. The Unilever Centre for Molecular Science Informatics (UCSMI) at the University of Cambridge is a world leading centre for chemical informatics with over 300 publications, many grants and substantial continued funding from Unilever. 38. The Open Knowledge Foundation (OKF) is a not-for-profit organization founded in 2004 and dedicated to promoting open knowledge in all its forms. The Foundation has pioneered work in developing robust legal mechanisms for sharing data - and has taken a central role in helping to develop standards for openness in knowledge and services. It also works on the infrastructure for open data users and producers. This includes CKAN, a community driven resource for open data, which is currently being used by the UK Government for its Data.gov.uk project, It hosts a number of open data and content projects, from a web application to represent UK public finance using best of breed visualization technologies to project to a database of artistic works that have fallen into the public domain. The Foundation has a working group for bibliographic information (http://wiki.okfn.org/wg/bibliography) with many key individuals and institutions working on open bibliographic information in the EU and elsewhere. Its Bibliographica project is funded by the University of Edinburgh's IDEALab; it is collaborating with Wikimedia Germany. 39. The International Union of Crystallography (IUCr) have been longstanding partners of the UCMI and have funded summer studentships and contributed in-kind (to EPSRC/SciBorg project). OB will concentrate on bibliographic metadata and identify the most useful areas for automation. IUCr staff are experienced in document searching and manual annotation of metadata in documents and will be invaluable with project evaluation. specific dictionaries, as in project CheTA. 40. Principal Investigator: Dr Peter Murray-Rust (PMR, UCMSI) is an Emeritus Reader in Molecular Informatics with nearly 200 publications. He develops the Chemical Semantic Web with support from DTI/EPSRC ("Molecular Standards for the Grid"), Unilever Research ("polymer informatics"), Microsoft Research(Chem4Word, OREChem), EPSRC(SciBorg), JISC (SPECTRa, SPECTRa-T (with Cambridge UL), eCrystals, TheOREm, ICE-Theorem, CLARION, CheTA, AMI), Royal. Soc. Chem. (OSCAR), IBM/Accelrys("Materials Grid", with Earth Sci), IUCr, Nature, Merck, MDL and COST. Member of the OKF Advisory Board and a co-author of their IsItOpen resource and Panton Principles;a founder member of the Blue Obelisk Open Source group for chemistry; Advisory Board of UKPubMedCentral (British Library); member of the IUCr's COMCIFS (dictionary/metadata) group. Peter will focus on the project direction and the co-ordination of the project partners, and will contribute major software enhancements to his JUMBO library for converting CIF to RDF. http://en.wikipedia.org/wiki/Peter_Murray-Rust, http://wwmm.ch.cam.ac.uk/ 41. Dr. Rufus Pollock (RP, Mead Fellow in Economics, Emmanuel College, University of Cambridge. Chairman OKF) will contribute to most areas of OB’s work, focusing on the project direction, management and dissemination. He will also develop the metadata design, store architecture and disambiguation work. RP has managed several similar projects including a 2-yr EU-funded project, CKAN development for data.gov.uk and several others. Co-founder Open Knowledge Foundation; member of the Board. Extensive experience of the legal, social and technical aspect of open information and bibliographic data in particular. Lead economist on 2-yr EU project (size and value of the public-domain funded) [http://www.rightscom.com/Default.aspx?tabid=20397]. Worked extensively with bibliographic metadata including the full CUL catalogue and on developing databases, processing of bibliographic formats (including MARC), matching of entities from different datasets. 42. Ben O'Steen (BoS) has 13 years IT development experience and has most recently worked at the Oxford University Library Service as the software architect for the Bodleian Library's DAMS (Digital Asset Management System). Extensive experience working with RDF, bibliographic and related metadata standards and distributed system design. Lead developer for Oxford on many projects, including the JISC-funded BRII [http://brii.bodleian.ox.ac.uk/] project and the Preserv2 project (partnered with the British Library and The National Archives) [http://preserv.eprints.org/]. The infrastructure he has designed and built enables the storage and preservation of complex semantic metadata models, a system that underpins projects and services such as the joint JISC and Mellon Foundation funded 'Cultures of Knowledge' [http://www.history.ox.ac.uk/cofk/infrastructure] project and 'Medieval libraries of Great Britain' [http://www.history.ox.ac.uk/research/projects/MLGB3.htm]. He was part of the winning team of Repository Challenge 08 with an entry that provided a RDF Linked Data view on two of the leading open source repository systems [http://bit.ly/dgtZvR]. O'Steen will contribute to most areas of OB’s infrastructure including metadata design and realisation, triple storage and SPARQL endpoints and user-facing interfaces and query systems. Appendix A: Letters of Support 43. Eight Letters of Support are attached as .DOCs or .PDFs in the .ZIP archive: 1 = Prof. Jeremy Sanders, Cambridge University, Head of School of Physical Sciences “Letter of support - Jeremy Sanders Apr10 - JISC funding.doc” 2 = Dr Peter Strickland, International Union of Crystallography “Letter of support - Peter Strickland - IUCr Apr10.pdf” 3 = Dr Anne Jarvis, Cambridge University Librarian “Letter of support - Anne Jarvis - Cambridge Uni Library Apr10 - Jisc Bid 14-04-2010.pdf” 4 = Dr Mark Patterson, Public Library of Science (PLoS) “Letter of support - Mark Patterson - PLoS Apr10.docx” 5 = Dr Rufus Pollock, Open Knowledge Foundation “Letter of support - Rufus Pollack - Open Knowledge Foundation 17Apr10 cambridge_support.pdf” 6 = Dr Benjamin White, British Library “Letter of support - Benjamin White - British Library Apr10 - jisc bid letter.pdf” 7 = Dr Michel Dumontier, Bio2RDF, Carleton University “Letter of support - Bio2RDF Apr10 - peter_murray-rust_LOS-open_bibliograph.pdf” 8 = Dr David Shotton, Oxford University, “Open Citation” project leader “Letter of Support - David Shotton Oxford for Peter Murray-Rust JISC application.pdf” Appendix B: Freedom of Information FOI Withheld Information Form We would like JISC to consider withholding the following sections or paragraphs from disclosure, should the contents of this proposal be requested under the Freedom of Information Act, or if we are successful in our bid for funding and our project proposal is made available on JISC’s website. We acknowledge that the FOI Withheld Information Form is of indicative value only and that JISC may nevertheless be obliged to disclose this information in accordance with the requirements of the Act. We acknowledge that the final decision on disclosure rests with JISC. Section / Paragraph No. Relevant exemption from disclosure Justification under FOI Appendix C: Additional JISC Projects 44. UCSMI has had an ongoing program of developing semantic resources in science including repositories (SPECTRa, SPECTRA-T, and CLARION), semantic infrastructure (TheOREm, ICE-Theorem) and software (OMII-Engage for OSCAR3). Currently (see below) Dr Murray-Rust leads 3 projects (CLARION, CheTA, AMI) and participates in I2S2. The common theme is to develop an Open infrastructure and applications with a primary aim to get community use and participation. The success of this is shown in projects such as OREChem funded by Microsoft (PI Carl Lagoze, Cornell) which uses software developed in SPECTRa and other projects to create RDF resources. As evidence of community engagement our recent release of one item of software (Chem4Word) has had over 80,000 downloads within one month of release. In this we have worked with Microsoft to create one of their first fully Open Source scientific offerings. We have had in-depth discussion with David Shotton on his "Open Citation" project in this strand. Our projects are complementary; Open Citations will extract citations from OA publications, and Open Bibliography will extract bibliographic information and enhance it with Open metadata. This is an excellent example of two groups which have communicated regularly, shared advisory functions and designed projects deliberately to be complementary. CLARION. Chemical Laboratory Repository in In/Organic Notebooks. http://www.jisc.ac.uk/whatwedo/programmes/inf11/sue2/clarion.aspx The Cambridge Chemistry Department has a basic repository which stores crystallographic data. Project CLARION (Cambridge Laboratory Repository In/Organic Notebooks) will create an enhanced repository that captures core types of chemistry data and ensures their access and preservation. The Chemistry Department is implementing a commercial Electronic Laboratory Notebook (ELN) system; CLARION will work closely with the ELN team to create a system for ingesting chemistry data directly into the repository with minimum effort by the researcher. CheTA. Chemistry Using Text Annotations: http://www.jisc.ac.uk/whatwedo/programmes/inf11/resdis/cheta.aspx This project (CheTA) will integrate Cambridge's chemical text mining tool OSCAR with 1 the U-Compare workflow infrastructure developed by NaCTeM and others. This integration adds chemistry to the world's largest public collection of interoperable text mining tools and will be highly valued by influential stakeholders both in the JISC community and the wider chemistry community. AMI The chemist's amanuensis: http://www.jisc.ac.uk/publications/briefingpapers/2010/bpvrev3.aspx 45. AMI is developing a speech interface to enable chemists to access files and archives when their hands are occupied in the lab I2S2 : Infrastructure for Integration in Structural Sciences http://www.ukoln.ac.uk/projects/I2S2/ I2S2 will identify requirements for a data-driven research infrastructure in "Structural Science", focusing on the domain of Chemistry, but with a view towards inter-disciplinary application. I2S2 will develop use cases that explore perspectives of scale and complexity and research discipline throughout the data lifecycle. 46. Current JISC funded projects involving Ben O'Steen ADMIRAL: A Data Management Infrastructure for Research JISC PIMS Record: https://pims.jisc.ac.uk/projects/view/1523 Project site: http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL Ben O'Steen as collaborator on behalf of Bodleian Libraries (as providers of 'Oxford University Data Store') The purpose of the ADMIRAL Project is to create a two-tier federated data management infrastructure for use by life science researchers, that will provide services (a) to meet their local data management needs for the collection, digital organization, metadata annotation and controlled sharing of biological datasets; and (b) to provide an easy and secure route for archiving annotated datasets to an institutional repository, The Oxford University Data Store, for long-term preservation and access, complete with assigned Digital Object Identifiers and Creative Commons open access licences. Sudamih: Supporting Data Management Infrastructure for the Humanities JISC PIMS Record: https://pims.jisc.ac.uk/projects/view/1524 Project site: http://sudamih.oucs.ox.ac.uk/ Ben O'Steen as 'Metadata Consultant' on behalf of Bodleian Libraries The Supporting Data Management Infrastructure for the Humanities (Sudamih) project aims to address a coherent range of requirements for the more effective management of data (broadly defined) within the Humanities at an institutional level. The project is an intrainstitutional collaboration led by Computing services (OUCS) and bringing together the Library (OULS) and the Research Service Office (RSO) with researchers from the Institute of Archaeology and the Faculty of History. Recently completed JISC projects - Ben O'Steen Building the Research Information Infrastructure JISC PIMS record: https://pims.jisc.ac.uk/projects/view/1205 Project site: http://brii.ouls.ox.ac.uk/ Ben O'Steen as lead developer/architect. BRII will enable efficient sharing of research management information using semantic web technologies. Ontologies and taxonomies will define and describe data objects (eg people, research groups, funding agencies, publications, research ‘themes’) to forge connections between them and provide web-based services to disseminate and reuse this information in new contexts. It will create efficiencies, greater accuracy of data, and better discovery of research activities at Oxford. University data sources will include academic departments and central services. Half the project will be devoted to stakeholder input, collaboration and ‘buyin’ aimed at evolving current work practices and processes. The project started on September 2008 and will end on March 2010. Open Access Repository System for Forced Migration Online JISC PIMS Record: https://pims.jisc.ac.uk/projects/view/417 Service site: http://www.forcedmigration.org/ Ben O'Steen as infrastructure developer/collaborator on behalf of Bodleian Libraries This project will migrate a fragmented digital repository of scholarly resources, currently managed by two proprietary software systems, to a single open source platform. This repository, based at the Refugee Studies Centre, University of Oxford, is the largest in the world on its subject area of forced migration. It is a unique, widely used and constantly expanding collection of resources. The enhancement of this repository will make it more manageable for those maintaining it, and also make it globally interoperable with other open systems, as well as with the University of Oxford’s institutional repository. DISC-UK DataShare https://pims.jisc.ac.uk/projects/view/398 Ben O'Steen as collaborator on behalf of Bodleian Libraries at the University of Oxford DISC-UK DataShare, led by Edina, arises from an existing UK consortium of data support professionals working in departments and academic libraries in universities (Data Information Specialists Committee-UK), and builds on an international network with a tradition of data sharing and data archiving dating back to the 1960s in the social sciences. By working together across four universities and internally with colleagues already engaged in managing open access repositories for e-prints, this partnership will introduce and test a new model of data sharing and archiving to UK research institutions. By supporting academics within the four partner institutions who wish to share datasets on which written research outputs are based, this network of institution-based data repositories develops a niche model for deposit of ‘orphaned datasets’ currently filled neither by centralised subject-domain data archives/centres/grids nor by e-print based institutional repositories (IRs).