Cover Sheet for Proposals JISC Grant Funding Call

Transcription

Cover Sheet for Proposals JISC Grant Funding Call
Cover Sheet for Proposals
JISC Grant Funding Call
Name of Programme & Strand:
Information Environment 2011 Programme (2/10):
Deposit of research outputs and Exposing digital
content for education and research
Programme Tags:
"INF11" and “JISCexpo"
Name of Call Area Bidding For:
Strand B - Expose
Name of Lead Institution:
University of Cambridge
Name of Department where project
would be based:
Chemistry
Full Name of Proposed Project:
Open Bibliography
Name(s) of Partner HE/FE
Institutions Involved:
Name(s) of Partner
Company/Consultants Involved:
Open Knowledge Foundation, International Union of
Crystallography, Rufus Pollock, Ben O’Steen
Full Contact Details for Primary Lead
and/or Contact for the Project:
Name: Dr Peter Murray-Rust
Position: Reader
Email: pm286@cam.ac.uk
Tel: 01223 763069
Skype/VoIP: peter.murray.rust
Address: University Chemical Laboratory, Lensfield
Rd, Cambridge, CB2 1EW
Length of Project:
9 months
Project Start Date:
14th June, 2010
Project End Date:
13 March, 2011
Total Funding Requested from JISC:
£77,050
Funding Broken Down over
Financial Years (April - March)
2010: £77, 050
Project Description / Abstract:
This project will deliver a substantial corpus of
bibliographic metadata as Linked Open Data, using
existing semantic web tools, standards (RDF,
SPARQL), linked data patterns and accepted Open
ontologies (FoaF, Bibo, DC, FRBR, etc). The data will
be from two sources: traditional library catalogues
(Cambridge University Library and the British Library)
and ToCs from a scientific publisher, the IUCr. None
of the material is currently available as LOD. Key
strategies are (i) transformation of current publishers'
model to create Open Bibliography as part of their
future business, and (ii) the immediate and continuing
engagement of the scholarly community. Deliverables
include a maintained and growing bibliography on the
IUCr site and engagements with other like-minded
publishers such as PLoS.
Interoperability, Open Technologies, Research &
Innovation, Resource Discovery, Tools & Techniques
Keywords describing project:
th
I have looked at the example FOI
form at Appendix B and included an
FOI form in the attached bid
YES
I have read the Call, Briefing Paper
and associated Terms and
Conditions of Grant at Appendix D
YES
Open Bibliography (OB)
1 Background
1.
Bibliographic data is useful: A number of organisations such as CERN1 and Library
of Congress2 have recognised that providing open access to bibliographic records and
controlled vocabularies is a natural and necessary step to begin to identify errors and to
avoid erroneous or divergent duplication, thereby improving the metadata accuracy. A key
3
point from Karen Coyle is "The change that libraries will need to make in response [to user
demand] must include the transformation of the library’s public catalogue from a stand-alone
database of bibliographic records to a highly hyperlinked data set that can interact with
information resources on the World Wide Web."
2.
Bibliographic data is generally not open or linked: this limits its usefulness to the
academic community4,5,6. The project will deliver bibliographic material that is truly open
(conformant with the OKF’s http://opendefinition.org (OKD)). Many attempts to create LOD
suffer because there are no useful resources to link to. OB will expose Author names,
Institutions and Geographical Locations with semantic targets in the LOD ecosystem (e.g.
Geonames, Wikipedia); the project will put significant effort into disambiguation so that OB
can become an important node in the LOD graph.
3.
Processes to Linked Open Data are not familiar to libraries and publishers:
Much modern bibliographic data is created implicitly or explicitly by the scholarly publication
process but exposed poorly or not at all. Working with cooperating publishers can rapidly
transform their output to complete open semantic bibliography. By providing a clear working
model for bibliographic metadata as semantic, referenceable links with a reusable workflow
to gather, add provenance, refine and disambiguate existing metadata information,
members of the JISC community can apply the same model and techniques with the opensource code and services we will provide to use data from and contribute to the
aforementioned 'highly hyperlinked data set'.
2 Appropriateness and Fit to Programme Objectives
2.1 Collections
4.
The project will focus on the exposure of on two major bibliographic collections that
are being made available by the project's collaborators, but are not currently Linked or
Open.
5.
Cambridge University Library and British Library catalogue. As two of the official
UK legal deposit libraries their catalogues are comprehensive and substantial. The CUL will
provide an initial set of at least 100,000 bibliographic records in MARC 21 format and have
agreed for these records to be made available under an appropriate open license (i.e. one
compliant with http://opendefinition.org). Depending on progress on clearing relevant rights it
may be possible for CUL to make available substantially more material as the project
progresses. The BL have also committed to provide open bibliographic metadata (exact
amount to be confirmed).
6.
International Union of Crystallography publications. This is the International
Scientific Union for crystallography which has a distinguished record enhancing the quality
of practice in crystallography. Most relevantly it has run a metadata (CIF dictionary) project
for 30 years which define the syntax and semantics of crystallographic scholarly
communication. It is a major publisher with 8 journals (Acta Crystallographica) and many
1.
1 http://library.web.cern.ch/library/Library/announcement.html - "The CERN Library publishes its book catalogue as Open Data"
2 http://id.loc.gov/authorities - "Library of Congress Subject Headings, published as Linked Open Data"
3 Karen Coyle for ALA - "Understanding the Semantic Web: Bibliographic Data and Metadata" http://www.alatechsource.org/library-technologyreports/understanding-the-semantic-web-bibliographic-data-and-metadata
4 http://blog.okfn.org/2008/03/06/open-bibliographic-data-the-state-of-play/ - "Open Bibliographic Data - The state of play"
5 http://data.gov.uk/ - All data is open
6 http://blog.okfn.org/2010/03/15/libraries-in-cologne-open-up-bibliographic-data/ - "Libraries in Cologne open up bibliographic data"
reference works. IUCr partnered with JISC to prototype Open Access in crystallography and
in 2005 Peter Strickland, Managing Editor of IUCr, said: "As a result of the JISC funding so
far, I am strongly convinced that providing authors the opportunity to make their papers open
access works well, provides authors with extra choice and improves access to published
content. The funding of IUCr journals has allowed UK authors to publish over 570 open
access articles; 255 in 2004 and 322 to date in 2005." In 2008 IUCr converted Acta Cryst E
to fully Open Access and since then it has published nearly 10,000 peer-reviewed research
articles; this forms our second collection.
2.2 Technical Issues
78
Figure 1 – Open Bibliography Curational Archive (OAIS-derived , )
7.
The archival model and storage form is informed by preservation/archival meetings
such as PASIG9. Based on a risk-management approach to archive design it utilises an
OAIS-like workflow combined with incremental refinement of the canonical data. It uses
open specifications (e.g. from the California Digital Library's microservices (pairtree,
10
checkm, dflat ) and semantic web ontologies and markup. A prominent service using
11
CDL's microservices is the HathiTrust , with 5,643,119 volumes in 210 terabytes whose
code has been used in BoS projects.
12
8.
Dissemination will use an RDF triplestore (4Store ) with a web interface and a
SPARQL endpoint with 'canned' community-based queries. Mashups such as our example
of the explosion of Asian science (geography-timeline-crystallography,
(http://wwmm.ch.cam.ac.uk/blogs/walkingshaw/, 1 minute video) show how Linked Open
Bibliography can have immediate impact. We and our collaborators already have large
communities and we shall develop tools and resources to engage them throughout the
process in order to guide and collaborate on the development of tools and services in Work
Package 3).
9.
Design of bibliographic metadata. An assessment of current and planned services
that provide bibliographic metadata and/or authority metadata consistent with Linked Open
13
14
15
Data, such as the National Library of Sweden , the National Library of Hungary , VIAF ,
16
17
the JISC-funded Names Project , the Library of Congress's LCSH name authority service
18,19
and the bibliographic data curated by Talis and hosted on their Platform
. The project will
1.
7 “Reference Model for an Open Archival Information System (OAIS). CCSDS 650.0-B-1, Blue Book, January 2002”
http://public.ccsds.org/publications/archive/650x0b1.pdf - [ISO Standard 14721:2003
8 The Ingest methodology is informed by the “Producer-Archive Interface Methodology Abstract Standard”
http://public.ccsds.org/publications/archive/651x0b1.pdf
9 http://lib.stanford.edu/pasig - Preservation and Archiving Special Interest Group (PASIG) - Oct 2009 Meeting had 175 attendees included numerous
national libraries, and large archival libraries and organisations.
10 http://www.cdlib.org/gateways/technology/ - California Digital library: List of specifications
11 http://www.hathitrust.org/ - HathiTrust: ... collaborative digital collections of 13 universities.
12 http://4store.org - 4Store open source semantic web 'quadstore' - used in http://data.gov.uk –related work
13 http://libris.kb.se
14 http://nektar.oszk.hu
15 http://www.viaf.org/
16 http://www.jisc.ac.uk/whatwedo/programmes/reppres/sharedservices/names.aspx
17 http://id.loc.gov/authorities
18 http://www.talis.com/platform/
19 http://vocab.org/
be publicised to the bibliographic community and consultation regarding best practices will
be sought.
10.
Metadata Extraction. A major initial aspect of the project is to systematize the
bibliographic metadata to a form that can be converted to RDF Linked Open Data (LOD).
Project members Dr Pollock (of University of Cambridge and the Open Knowledge
Foundation, OKF), and Mr Ben O'Steen have already created several bibliographic
resources (open and otherwise). The project (OB) will example the technical issues in
converting MARC and CIF to RDF-based standards suitable for LOD. It will also identify
potential RDF endpoints to which we can link such as DBPedia, Pubchem, and Geonames.
11.
Initial conversion of metadata. Initial conversion of records (from MARC/RDA) to
RDF including entity extraction and assertion using relevant authority files or by algorithmic
determination. Agile methodology will be used, iteratively improving the accuracy and
completeness of the metadata conversion. The conversion of IUCr XHTML to an RDFexpressed form will occur in parallel, using similar processes and methodology. For
conversion of metadata into an RDF-expressible form we will engage with the both the W3C
semantic web and the library community (through the existing OKFN Working group on
20
21
Open Bibliographic Data and members of Code4Lib such as Ed Summers, developer
22
behind the LCSH Linked Data set and 'Chronicling America' and Mike Giarlo, Digital
Library Architect at The Pennsylvania State University) to help refine and inform our efforts.
12.
Entity disambiguation, 'truth management' and refinement. A major challenge in
the use of bibliography in LOD is the assignment of identifier URIs. Lexical variants in names
are common ("Darwin, Charles", "C Darwin"; "Cambridge Univ.", "University of Cambridge")
and there is much polysemy ("George Bush"). OB will use additional relations in the
complete RDF to resolve ambiguity. This can include co-authors, dates (including
biography), institution names, and syntactic similarities and differences. The IUCr set is
particularly exciting as it was created as part of the author submission process and details
such as phone numbers and emails are common. These are effectively unique identifier
systems and have great power in resolving names and institutions.
13.
Disambiguation and co-reference will be achieved using a variety of techniques,
including heuristic and probabilistic methods, as well as using known and trusted authority
lists such as the IUCr's dictionary of crystallographers and the bibliographic authority files.
14.
The model used for the truth management of this disambiguation will borrow heavily
23
on the work done by Hugh Glaser, Iain Millard et al at Uni of So’ton , as displayed notably
24
in the JISC funded Rapid Innovation project 'dotAC' and the service http://sameas.org.
15.
Semantification and exposing links. All team members are highly experienced in
the creation of RDF, including funding from JISC projects such as ICE-TheOREm while the
OKF currently run triplestores for CKAN.net, data.gov.uk, and a prototype bibliographic
store. We have written CIF-RDF converters for crystallographic data and metadata as shown
in Crystaleye, which is the UCMSI's derived resource for crystallography. Here we trawl the
Open web for data on publisher sites and transform it to Chemical Markup Language (CML)
and RDF. This has currently ~150,000 retrieved data sets each with full bibliography. This
technology is now being used in the JISC CLARION project to populate our departmental
repository with experimental data and metadata. OB will seek LOD resources which either
already expose endpoints or which may wish to use ours.
16.
Working with community to create user mashup / tool..There is a real need in the
community for bibliography (see Impact) which can be harnessed to software and repurpose
content. The Blue Obelisk Open source chemistry community (http://www.blueobelisk.org
co-founder PMR) has already created a bibliographic mashup
http://blueobelisk.sourceforge.net/wiki/index.php/Using_Javascript_and_Greasemonkey_for
_Chemistry) which decorates publisher articles with links to the blogosphere and
1.
20 http://wiki.okfn.org/wg/bibliography
21 http://code4lib.org
22 http://chroniclingamerica.loc.gov/
23 http://eprints.ecs.soton.ac.uk/15245/
24 http://www.dotac.info/about/
crystallography. This community will react enthusiastically to incentives (including prize(s)
from the project) to create mashups and we expect similar activity in other domains.
3 Quality of Proposal and Robustness of Workplan
3.1 Work Packages
17.
WP 1: Project Management. Project management provided by RP (see Experience
section) and will strongly emphasize iterative development using Agile technologies with
intensive involvement of community (scientists, librarians). Software and content are
available to the community through nightly Open builds. Quarterly meetings with IUCr.
Meetings with Cambridge University Library as necessary. An advisory group (Ed
Summers, Mike Giarlo, and members of our current advisory boards for CheTA and
CLARION), Southampton and UKOLN (I2S2), Simon Coles and the National
Crystallographic Service. Project website and code repository will be setup at the beginning
of this project and all code made available under an Open Source License recommended for
use by OSSWatch.
Participants: RP, UCSMI, BoS
Deliverables: Project plan. Progress reports. Final report
18.
WP 2: Infrastructure development. Initial in-depth meeting with IUCr editorial staff
to define valuable outcomes from the exposure of Open crystallographic bibliography.
Iterative planning on transfer of bibliography to IUCr and feedback on quality.
Meeting with CUL and BL and acquisition of initial set of bibliographic metadata from them.
Planning for any subsequent provision of metadata.
Participants: PMR, IUCr, RP, CUL, OKF, BoS
Deliverables: Protocol for Open crystallographic bibliography
Project website, code repository, blog and wiki (OKF)
19.
WP 3: User tools and services. All created RDF data will be loaded into triplestore.
Exploration of useful SPARQL queries informed by community use. Customisation of
production access to SPARQL end-point to make it user-friendly using direct and continuous
communication with user groups including local chemistry researchers, the IUCr and other
stakeholders. The SPARQL endpoint will also be publicised to encourage mashups.
Participants: RP, OKF, BoS, PMR, IUCr
Deliverables: SPARQL endpoint, web interface to the created RDF utilising standard
Linked Data publishing patterns. Potential mashups with other data sources. Bounty
(prize) for community-based mashups.
20.
WP 4: Evaluation and dissemination. Needs-driven evaluation of SPARQL endpoint by IUCr editorial staff. Iterative refinement of metadata in triplestore and third-party
links. Publication in Acta Crystallographica and at IUCr congresses and exposure on Acta
Crystallographica website. (See Evaluation in Impact)
Participants: RP, OKF, BoS, PMR, IUCr;
Deliverables: Evaluation report. Publications and exposure/promotion by IUCr.
Engagement with Blue Obelisk (chemistry) and Bio2RDF communities for blogging,
and viral dissemination.
21.
WP 5: Sustainability and service transfer. All organisations currently run servers
which are guaranteed to run for up to 2 years beyond the project end. They agree to
maintain these and collect usage and maintenance metrics in a public fashion. OB will
transfer all software to both partnering institutions (OKF, IUCr) and the appropriate LOD
bibliography to each.
Participants: RP, OKF, BoS, PMR, IUCr;
Deliverables: (i) Transfer of live and growing OB to IUCr editorial staff for alpha
release. (ii) Identification of OB project team within OKF for handover of prototype.
3.2 Project Timetable
22.
The following shows the timings for each work package. Degree of effort is
proportional to shading intensity.
Month
1
2
3
4
5
6
7
8
9
WP1
WP2
WP3
WP4
WP5
3.3 Risk Assessment
Risk.
Proba
bility /
5
Severit
y/5
Score
(P x S)
Action to Prevent/Manage Risk
Staff retention
3
5
15
Ensure staff are satisfied and challenged and
have chance to give feedback by means of
regular one-to-ones. Ensure sharing of
expertise thus enabling cover.
Key academic staff
leave
1
5
5
There is sufficient overlap in expertise to
allow a reduced delivery. Rescoping would
be necessary.
Unexpected
insuperable technical
problems
1
4
4
Similar problems already solved by project
members. Iterative development allows for
graceful degradation of deliverables.
Hardware failure
resulting in loss of
data
1
4
4
Preventative approaches to data and service
backup including automated backup, off-site
replication and server redundancy.
Collections are
unavailable or
intractable
2
4
8
Engage with data acquisition as early as
possible to allow alternatives to be arranged
(if necessary) in good time.
OKF / IUCr services
not supplied
1
5
5
Ongoing engagement with these key
stakeholders to maintain level of
commitment.
1
5
5
Close consultation with University legal
services, establish clear project staff
guideline w.r.t. commercial partners.
Staffing
Technical
External suppliers
Legal
Copyright infringement
3.4 Intellectual Property Position
23.
The OKF uses CC-BY for authored content and PDDL for Open data. The University
of Cambridge asserts its rights to IP created by employees in the course of their
employment. All software is distributed under the Artistic Licence (BSD style). The IUCr
uses CC-BY for its Open Access material and will use the services of the OKF to advise on
the best ways of Opening services.
4 Engagement with the Community
24.
The OKF and UCSMI have created large communities who enthusiastically adopt
their outputs (see Appendix C and letters of support from IUCr, PLoS, and Carleton).
Demand for OB is high (see Impact section) and a granted project will attract interest and
requests for participation including publishers, researchers and readers. Both partners have
extensive experience of voluntary contributions in Open Source and Open Content projects.
4.1
Dissemination
25.
The main targets for dissemination are the primary stakeholders, focusing on the
library community and scientific publishing. OB will disseminate through a number of
channels (i) Presentations, posters and demonstration stands at JISC Information
Environment, Repositories, Infrastructure programme events and All Hands Meetings.
Presentation of papers/posters at IUCr and Brit. Cryst. Assoc. meetings, (ii) Formal
publication(s) in Open Access journals, (iii) the OREChem project (app. C), (iv) OKF
meetings (attended by academia, government, public and private research, publishers, and
(v) Blog postings.
26.
The project will use part of its dissemination budget to offer prize(s) for the best
mashup(s) using OB outputs.
4.2
Evaluation
27.
Bibliography is by default usually not Open (i.e. it may require permission to view,
redistribute or repurpose) and this causes major impedance in creating the web of Linked
Open Data. The impact in the LOD community of the project from this baseline is simple:
Can the project deliver two examples of bibliography which fulfill the following criteria?
1.
The resource is fully Open according to the OKD, and there is agreement from the
providers of the bibliography that no IPR appears to have been violated
2.
The bibliography is linked to at least one resource in the LOD collection
(http://en.wikipedia.org/wiki/Linked_Data) such as names, places, chemical entities, etc.
28.
This forms a clear evaluation criterion. The Cambridge UL catalogue is not currently
semantically linked and it will be possible to measure the accesses to the new resource by
online accesses or downloads. The IUCr bibliographic data is not currently disambiguated.
The project will analyse the confusion of identities through manual sampling and repeat and
project end. The value of a new linked bibliography can be measured by IUCr server stats.
4.3 Stakeholder Analysis
29.
Key Stakeholders
UCMI
[High interest, high power] The crystallographic bibliography will immediately
enhance the value of the Crystaleye knowledge base. Bibliographic data will
allow the JISC-CLARION project to link data to publications The links will be of
value to UCMI and partners using ORE in OREChem (Microsoft-funded)
OKF
[High interest, high power] Immediate enhancement of bibliographic software for
re-use in community. High visibility and prospect of future engagement in
bibliographic projects.
IUCr
[High interest, high power] The RDF resource will be of great value in managing
the publication process (e.g. author disambiguation), and supporting online
searching. High visibility in showing quality and commitment of IUCr.
JISC
[High interest, high power] Reaping the benefit of investment in Open Access
funding (IUCr) and proving the value of RDF for managing collections and
metadata
30.
Primary Stakeholders
Academic
libraries
[High interest, high power] The ability to connect libraries to the Open web will
greatly enhance their visibility (especially among non-academics)
Crystallography [Indirect medium-high interest, medium interest] Enhanced Open searching
Researchers will be used. Author disambiguation may help citation management.
31.
Secondary Stakeholders
General
Library
technology
[Medium interest, medium power] Technology and standards developed in this
program will support interoperability in libraries
Researchers [high interest, medium power] The discovery of difficult resources by whatever
means may be critical to (non)academic research
Semantic Web [high interest, medium power] Any successful exemplar of Linked Open Data
community
is of great value in proving the value of the approach.
5
Impact
32.
The collaboration of IUCr (see letter) is a very high-profile exemplar to the Open
Access community and to re-distributors of OA material such as the British Library and
UKPubMedCentral. IUCr also maintain and develop the CIF ontology for crystallography
(including bibliographic concepts) and have a communal exposed World Directory of
Crystallographers (an invaluable aid in disambiguation of names, places and institutions).
Another Open Access publisher (PLoS, see letter) will eagerly follow what we do.
33.
Researchers are desperate for good metadata; a typical one writes:
"...publishers seem reluctant to maintain quality feeds. I have, in the past, written my own
'screen scrapers' that digest publishers' websites and present them as XML, but the layouts
of the websites often changes frequently and it became too time consuming to keep up. I
would eagerly use any open-source software package that automated this process and
allowed me to aggregate the tables of contents of various journals, customizable feeds of
various databases, etc..."
5.1 Sustainability
34.
The project is committed to Open sustainable outcomes for both collections at the
end of the project. The key strategy is for the IUCr to embed the ongoing creation of
bibliography into their publication; this will be at worst marginal costs and will certainly
enhance quality. IUCr and OKF commit to sustaining the resources on the web for a further
2 years with active maintenance and ingest and cost-effective enhancements to software
5.2 Evaluation
35.
(see also Evaluation section in Engagement) The primary evaluation is through the
OKF and IUCr participating staff. Success is measured by increased use and interest above
the baseline; success in making resources Open; and in linking to existing LOD targets. The
project will collect community feedback, through forms on exposed resources and responses
to blog posts. Bio2RDF (see letters) will be asked for an assessment of the value of the
exposed metadata and the tools used to create and maintain it.
6 Budget
Directly Incurred Staff
Apr10– Mar11
TOTAL £
Peter Murray-Rust, PI, Grade 11, 30%
£16,089
£16,089
Total Directly Incurred Staff (A)
£16,089
£16,089
Non-Staff
Apr10– Mar11
TOTAL £
Consultant, Technical Lead. 100 days @£400
£40,000
£40,000
Consultant, Project manager and developer. 50 days @£400
£20,000
£20,000
IUCr, integration work, 10 days @£500
£5,000
£5,000
OKF, support and service maintenance for project + 2 years
£3,000
£3,000
IUCr maintenance and hosting for Crystaleye bibliography
£7,000
£7,000
Travel and expenses
£1,150
£1,150
Hardware/software
£500
£500
Dissemination
£250
£250
Total Directly Incurred Non-Staff (B)
£76,900
£76,900
Directly Incurred Total (C = A + B)
£92,989
£92,989
Directly Allocated
Apr10– Mar11
TOTAL £
Estates
£3,965
£3,965
Directly Allocated Total (D)
£3,965
£3,965
Indirect Costs (E)
£8,733
£8,733
Total Project Cost (_ = C+D+E)
£105,687
£105,687
Amount Requested from JISC
£77,050
£77,050
Institutional and Partner Contributions†
£28,637
£28,637
Percentage Contributions over the life of the project
JISC
72.9%
Total
Uni of Cambridge
12.9%
100%
#FTEs used to calculate indirect & estates charges (staff included)
OKF
2.8%
IUCr
11.4%
0.3 (PM-R)
† Institutional Contributions include specific directly incurred contributions from partners
OKF and IUCr, 20% of Cambridge’s fEC and discounts on the rates from both of the
consultants involved. IUCr are also contributing the use of the Acta E corpus. It is difficult to
put a figure to this contribution, but the journal itself represents many years of effort. CUL
are similarly contributing a unique catalogue with many unique records.
6.1 Value for Money
36.
Costs have been kept to a minimum, representing an inexpensive laptop and a
minimum of travel for dissemination and project F2F meetings. The benefits from the project
are primarily in favour of the community rather than the University of Cambridge specifically,
accordingly we feel an institutional commitment of 20% of the FEC is appropriate. Taking the
impact (especially the sustainability of the outputs) into consideration with the contributions
by the OKF, IUCr and the day rate discounts from the individual partners, we believe this
project represents excellent value for money.
7 Experience of the Project Team
37.
The Unilever Centre for Molecular Science Informatics (UCSMI) at the University
of Cambridge is a world leading centre for chemical informatics with over 300 publications,
many grants and substantial continued funding from Unilever.
38.
The Open Knowledge Foundation (OKF) is a not-for-profit organization founded in
2004 and dedicated to promoting open knowledge in all its forms. The Foundation has
pioneered work in developing robust legal mechanisms for sharing data - and has taken a
central role in helping to develop standards for openness in knowledge and services. It also
works on the infrastructure for open data users and producers. This includes CKAN, a
community driven resource for open data, which is currently being used by the UK
Government for its Data.gov.uk project, It hosts a number of open data and content projects,
from a web application to represent UK public finance using best of breed visualization
technologies to project to a database of artistic works that have fallen into the public domain.
The Foundation has a working group for bibliographic information
(http://wiki.okfn.org/wg/bibliography) with many key individuals and institutions working on
open bibliographic information in the EU and elsewhere. Its Bibliographica project is funded
by the University of Edinburgh's IDEALab; it is collaborating with Wikimedia Germany.
39.
The International Union of Crystallography (IUCr) have been longstanding
partners of the UCMI and have funded summer studentships and contributed in-kind (to
EPSRC/SciBorg project). OB will concentrate on bibliographic metadata and identify the
most useful areas for automation. IUCr staff are experienced in document searching and
manual annotation of metadata in documents and will be invaluable with project evaluation.
specific dictionaries, as in project CheTA.
40.
Principal Investigator: Dr Peter Murray-Rust (PMR, UCMSI) is an Emeritus Reader
in Molecular Informatics with nearly 200 publications. He develops the Chemical Semantic
Web with support from DTI/EPSRC ("Molecular Standards for the Grid"), Unilever Research
("polymer informatics"), Microsoft Research(Chem4Word, OREChem), EPSRC(SciBorg),
JISC (SPECTRa, SPECTRa-T (with Cambridge UL), eCrystals, TheOREm, ICE-Theorem,
CLARION, CheTA, AMI), Royal. Soc. Chem. (OSCAR), IBM/Accelrys("Materials Grid", with
Earth Sci), IUCr, Nature, Merck, MDL and COST. Member of the OKF Advisory Board and a
co-author of their IsItOpen resource and Panton Principles;a founder member of the Blue
Obelisk Open Source group for chemistry; Advisory Board of UKPubMedCentral (British
Library); member of the IUCr's COMCIFS (dictionary/metadata) group. Peter will focus on
the project direction and the co-ordination of the project partners, and will contribute major
software enhancements to his JUMBO library for converting CIF to RDF.
http://en.wikipedia.org/wiki/Peter_Murray-Rust, http://wwmm.ch.cam.ac.uk/
41.
Dr. Rufus Pollock (RP, Mead Fellow in Economics, Emmanuel College, University of
Cambridge. Chairman OKF) will contribute to most areas of OB’s work, focusing on the
project direction, management and dissemination. He will also develop the metadata design,
store architecture and disambiguation work. RP has managed several similar projects
including a 2-yr EU-funded project, CKAN development for data.gov.uk and several others.
Co-founder Open Knowledge Foundation; member of the Board. Extensive experience of
the legal, social and technical aspect of open information and bibliographic data in particular.
Lead economist on 2-yr EU project (size and value of the public-domain funded)
[http://www.rightscom.com/Default.aspx?tabid=20397]. Worked extensively with
bibliographic metadata including the full CUL catalogue and on developing databases,
processing of bibliographic formats (including MARC), matching of entities from different
datasets.
42.
Ben O'Steen (BoS) has 13 years IT development experience and has most recently
worked at the Oxford University Library Service as the software architect for the Bodleian
Library's DAMS (Digital Asset Management System). Extensive experience working with
RDF, bibliographic and related metadata standards and distributed system design. Lead
developer for Oxford on many projects, including the JISC-funded BRII
[http://brii.bodleian.ox.ac.uk/] project and the Preserv2 project (partnered with the British
Library and The National Archives) [http://preserv.eprints.org/]. The infrastructure he has
designed and built enables the storage and preservation of complex semantic metadata
models, a system that underpins projects and services such as the joint JISC and Mellon
Foundation funded 'Cultures of Knowledge' [http://www.history.ox.ac.uk/cofk/infrastructure]
project and 'Medieval libraries of Great Britain'
[http://www.history.ox.ac.uk/research/projects/MLGB3.htm]. He was part of the winning team
of Repository Challenge 08 with an entry that provided a RDF Linked Data view on two of
the leading open source repository systems [http://bit.ly/dgtZvR]. O'Steen will contribute to
most areas of OB’s infrastructure including metadata design and realisation, triple storage
and SPARQL endpoints and user-facing interfaces and query systems.
Appendix A: Letters of Support
43.
Eight Letters of Support are attached as .DOCs or .PDFs in the .ZIP archive:
1 = Prof. Jeremy Sanders, Cambridge University, Head of School of Physical Sciences
“Letter of support - Jeremy Sanders Apr10 - JISC funding.doc”
2 = Dr Peter Strickland, International Union of Crystallography
“Letter of support - Peter Strickland - IUCr Apr10.pdf”
3 = Dr Anne Jarvis, Cambridge University Librarian
“Letter of support - Anne Jarvis - Cambridge Uni Library Apr10 - Jisc Bid 14-04-2010.pdf”
4 = Dr Mark Patterson, Public Library of Science (PLoS)
“Letter of support - Mark Patterson - PLoS Apr10.docx”
5 = Dr Rufus Pollock, Open Knowledge Foundation
“Letter of support - Rufus Pollack - Open Knowledge Foundation 17Apr10 cambridge_support.pdf”
6 = Dr Benjamin White, British Library
“Letter of support - Benjamin White - British Library Apr10 - jisc bid letter.pdf”
7 = Dr Michel Dumontier, Bio2RDF, Carleton University
“Letter of support - Bio2RDF Apr10 - peter_murray-rust_LOS-open_bibliograph.pdf”
8 = Dr David Shotton, Oxford University, “Open Citation” project leader
“Letter of Support - David Shotton Oxford for Peter Murray-Rust JISC application.pdf”
Appendix B: Freedom of Information
FOI Withheld Information Form
We would like JISC to consider withholding the following sections or paragraphs from
disclosure, should the contents of this proposal be requested under the Freedom of
Information Act, or if we are successful in our bid for funding and our project proposal is made
available on JISC’s website.
We acknowledge that the FOI Withheld Information Form is of indicative value only and that
JISC may nevertheless be obliged to disclose this information in accordance with the
requirements of the Act. We acknowledge that the final decision on disclosure rests with
JISC.
Section / Paragraph No.
Relevant exemption from disclosure Justification
under FOI
Appendix C: Additional JISC Projects
44.
UCSMI has had an ongoing program of developing semantic resources in science
including repositories (SPECTRa, SPECTRA-T, and CLARION), semantic infrastructure
(TheOREm, ICE-Theorem) and software (OMII-Engage for OSCAR3). Currently (see below)
Dr Murray-Rust leads 3 projects (CLARION, CheTA, AMI) and participates in I2S2.
The common theme is to develop an Open infrastructure and applications with a primary aim
to get community use and participation. The success of this is shown in projects such as
OREChem funded by Microsoft (PI Carl Lagoze, Cornell) which uses software developed in
SPECTRa and other projects to create RDF resources. As evidence of community
engagement our recent release of one item of software (Chem4Word) has had over 80,000
downloads within one month of release. In this we have worked with Microsoft to create one
of their first fully Open Source scientific offerings.
We have had in-depth discussion with David Shotton on his "Open Citation" project in this
strand. Our projects are complementary; Open Citations will extract citations from OA
publications, and Open Bibliography will extract bibliographic information and enhance it
with Open metadata. This is an excellent example of two groups which have communicated
regularly, shared advisory functions and designed projects deliberately to be
complementary.
CLARION. Chemical Laboratory Repository in In/Organic Notebooks.
http://www.jisc.ac.uk/whatwedo/programmes/inf11/sue2/clarion.aspx
The Cambridge Chemistry Department has a basic repository which stores crystallographic
data.
Project CLARION (Cambridge Laboratory Repository In/Organic Notebooks) will create an
enhanced repository that captures core types of chemistry data and ensures their access and
preservation. The Chemistry Department is implementing a commercial Electronic Laboratory
Notebook (ELN) system; CLARION will work closely with the ELN team to create a system for
ingesting chemistry data directly into the repository with minimum effort by the researcher.
CheTA. Chemistry Using Text Annotations:
http://www.jisc.ac.uk/whatwedo/programmes/inf11/resdis/cheta.aspx
This project (CheTA) will integrate Cambridge's chemical text mining tool OSCAR with
1
the U-Compare workflow infrastructure developed by NaCTeM and others. This integration
adds chemistry to the world's largest public collection of interoperable text mining tools and
will be highly valued by influential stakeholders both in the JISC community and the wider
chemistry community.
AMI The chemist's amanuensis:
http://www.jisc.ac.uk/publications/briefingpapers/2010/bpvrev3.aspx
45.
AMI is developing a speech interface to enable chemists to access files and archives
when their hands are occupied in the lab
I2S2 : Infrastructure for Integration in Structural Sciences
http://www.ukoln.ac.uk/projects/I2S2/
I2S2 will identify requirements for a data-driven research infrastructure in "Structural
Science", focusing on the domain of Chemistry, but with a view towards inter-disciplinary
application. I2S2 will develop use cases that explore perspectives of scale and
complexity and research discipline throughout the data lifecycle.
46.
Current JISC funded projects involving Ben O'Steen
ADMIRAL: A Data Management Infrastructure for Research
JISC PIMS Record: https://pims.jisc.ac.uk/projects/view/1523
Project site: http://imageweb.zoo.ox.ac.uk/wiki/index.php/ADMIRAL
Ben O'Steen as collaborator on behalf of Bodleian Libraries (as providers of 'Oxford
University Data Store')
The purpose of the ADMIRAL Project is to create a two-tier federated data management
infrastructure for use by life science researchers, that will provide services (a) to meet their
local data management needs for the collection, digital organization, metadata annotation and
controlled sharing of biological datasets; and (b) to provide an easy and secure route for
archiving annotated datasets to an institutional repository, The Oxford University Data Store,
for long-term preservation and access, complete with assigned Digital Object Identifiers and
Creative Commons open access licences.
Sudamih: Supporting Data Management Infrastructure for the Humanities
JISC PIMS Record: https://pims.jisc.ac.uk/projects/view/1524
Project site: http://sudamih.oucs.ox.ac.uk/
Ben O'Steen as 'Metadata Consultant' on behalf of Bodleian Libraries
The Supporting Data Management Infrastructure for the Humanities (Sudamih) project aims
to address a coherent range of requirements for the more effective management of data
(broadly defined) within the Humanities at an institutional level. The project is an intrainstitutional collaboration led by Computing services (OUCS) and bringing together the Library
(OULS) and the Research Service Office (RSO) with researchers from the Institute of
Archaeology and the Faculty of History.
Recently completed JISC projects - Ben O'Steen
Building the Research Information Infrastructure
JISC PIMS record: https://pims.jisc.ac.uk/projects/view/1205
Project site: http://brii.ouls.ox.ac.uk/
Ben O'Steen as lead developer/architect.
BRII will enable efficient sharing of research management information using semantic web
technologies. Ontologies and taxonomies will define and describe data objects (eg people,
research groups, funding agencies, publications, research ‘themes’) to forge connections
between them and provide web-based services to disseminate and reuse this information in
new contexts. It will create efficiencies, greater accuracy of data, and better discovery of
research activities at Oxford. University data sources will include academic departments and
central services. Half the project will be devoted to stakeholder input, collaboration and ‘buyin’ aimed at evolving current work practices and processes.
The project started on September 2008 and will end on March 2010.
Open Access Repository System for Forced Migration Online
JISC PIMS Record: https://pims.jisc.ac.uk/projects/view/417
Service site: http://www.forcedmigration.org/
Ben O'Steen as infrastructure developer/collaborator on behalf of Bodleian Libraries
This project will migrate a fragmented digital repository of scholarly resources, currently
managed by two proprietary software systems, to a single open source platform. This
repository, based at the Refugee Studies Centre, University of Oxford, is the largest in the
world on its subject area of forced migration. It is a unique, widely used and constantly
expanding collection of resources. The enhancement of this repository will make it more
manageable for those maintaining it, and also make it globally interoperable with other open
systems, as well as with the University of Oxford’s institutional repository.
DISC-UK DataShare
https://pims.jisc.ac.uk/projects/view/398
Ben O'Steen as collaborator on behalf of Bodleian Libraries at the University of Oxford
DISC-UK DataShare, led by Edina, arises from an existing UK consortium of data support
professionals working in departments and academic libraries in universities (Data Information
Specialists Committee-UK), and builds on an international network with a tradition of data
sharing and data archiving dating back to the 1960s in the social sciences. By working
together across four universities and internally with colleagues already engaged in managing
open access repositories for e-prints, this partnership will introduce and test a new model of
data sharing and archiving to UK research institutions. By supporting academics within the
four partner institutions who wish to share datasets on which written research outputs are
based, this network of institution-based data repositories develops a niche model for deposit
of ‘orphaned datasets’ currently filled neither by centralised subject-domain data
archives/centres/grids nor by e-print based institutional repositories (IRs).