from rpi.edu - Tetherless World Constellation
Transcription
from rpi.edu - Tetherless World Constellation
PRE_PUBLICATION DRAFT --- DO NOT CITE The Importance of Authoritative URI Design Schemes for Open Government Data Alexei Bulazel, Dominic DiFranzo, John S. Erickson, James A. Hendler Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, NY {bulaza, difrad, erickj4, hendlj2}@rpi.edu Abstract A major challenge when working with open government data is managing, connecting, and understanding the links between references to entities found across multiple datasets when these datasets use different vocabularies to refer to identical entities (ie: one dataset may refer to Microsoft as “Microsoft”, another may refer to the company by its SEC filing number as “0000789019”, and a third may use its stock ticker “MSFT”.) In this paper we propose a naming scheme based on Web URLs that enables unambiguous naming and linking of datasets and, more importantly, data elements, across the Web. We further describe our ongoing work to demonstrate the implementation and authoritative management of such schemes through a class of web service we refer to as the “instance hub”. When working with linked government data, provided either directly from governments via open government programs or through other sources, the issue of resolving inconsistencies in naming schemes is particularly important, as various agencies have disparate conventions for referring to the same concepts and entities. Using linked data technologies we have created instance hubs to assist in the management and linking of entity references for collections of categorically and hierarchically related entities. Instance hubs are of particular interest to governments engaged in the publication of linked open government data, as they can help data consumers make better sense of published data and can provide a starting point for development of linked data applications. PRE_PUBLICATION DRAFT --- DO NOT CITE In this paper we present our findings from the ongoing development of a prototype instance hub at the Tetherless World Constellation at Rensselaer Polytechnic Institute (TWC RPI). The TWC RPI Instance Hub enables experimentation and verification of proposed URI design schemes for open government data, especially those developed at TWC in collaboration with the United States Data.gov program. We discuss core principles of the TWC RPI Instance Hub design and implementation, and summarize how we have used our instance hub to demonstrate the possibilities for authoritative entity references across a number of heterogeneous categories commonly found in open government data, including countries, federal agencies, states, counties, crops, and toxic chemicals. Introduction Motivated by the Obama administration's government transparency initiatives, in May 2009 the United States launched the Data.gov web portal with a catalog of 47 datasets containing government information that had previously not been easily available online. [Kirkpatrick, 2009] The aggregation of datasets in a single online portal at Data.gov made them easier to find and search, and has led to the publication of datasets previously not avaliable online. During its first year Data.gov grew to include more than 250,000 datasets and has inspired the creation many hundreds of applications and services. [Kundra 2010] PRE_PUBLICATION DRAFT --- DO NOT CITE The launch of Data.gov was a key event near the beginning of a worldwide movement of open government data1 publication as governments, NGOs, and many other institutions began to make their data openly accessible to interested parties. During Data.gov’s first year, US municipalities including San Francisco and New York City, and states including California, Utah, Michigan, and Massachusetts launched open data portals; around the world, countries including the UK, Canada, and Australia, as well as organizations such as the World Bank followed the Data.gov lead. By early 2013, the International Open Government Dataset Search2 project of the Tetherless World Constellation at Rensselaer Polytechnic Institute (TWC RPI) had recorded nearly 200 catalogs from over 40 countries, totaling more than a million datasets spanning a vast array of topics. [Erickson et al, 2011] The significant growth in number and size of open government data catalogs since 2009 has been made possible by the emergence of an open government data ecosystem consisting of policy makers, agencies (as providers and consumers), data experts, independent software developers and service providers, academia, and citizen stakeholders. The publication of widely varied data has inspired a wide range of applications and services, has provided essential data for journalists, bloggers and activists, and has fueled academic research. In turn, demand from stakeholders has increased the quantity, quality and diversity of this data. 1 "Open government data" for the purposes of this paper refers to data released by government agencies without license restrictions, usually for the purpose of fulfilling a transparency policy. In the case of open government data released through the US Data.gov portal, this was the "Open Government Directive" of December 2009. [Orszag 2009] Note that non-governmental groups have proposed objective criteria for defining open government data, including and especially "The Annotated Eight Principles of Open Government Data," [http://opengovdata.org/] but since government providers such as Data.gov have not formally adopted its tenets, it is not appropriate to use that definition here. 2 http://logd.tw.rpi.edu/page/international_dataset_catalog_search PRE_PUBLICATION DRAFT --- DO NOT CITE The growth in the availability of open government data from providers around the world is encouraging, but for potential users of this data including developers of applications and services, the variety of formats and practices used to publish the data can cause interoperability, scalability, and usability problems. Government datasets are typically published "as is" (i.e., using a variety of structures and formats), requiring substantial human workload to clean them up for machine processing and to make them comprehensible. One of the most challenging issues for consumers of government datasets is establishing the equivalence of named entities within different datasets. The Importance of Unambiguous Naming An emerging issue for providers and consumers of open government data is the creation and interconnection of names for common entities referenced in datasets. For example, one dataset may refer to the state of New York as “NY” while another may use the Federal Information Processing Standards (FIPS) code of "36" in reference to the state The lack of common naming schemes across datasets hinders developers attempting to build applications or to otherwise make insights through data “mashups,” wherein datasets are combined and analyzed together to show a broader context. The process of finding common references to entities across multiple datasets is further complicated by unintentional naming clashes that may arise. For example “NY” might refer either to the city or state of New York. It would be wrong to make inferences about the entire state from data about the city, and visa-versa. The name “NY” is not sufficiently unique and does not provide us with enough information to accurately link multiple datasets together. One solution would be to name these common concepts using Uniform Resource Identifiers (URIs), the fundamental identification scheme for resources on the Web. 3 3 The acronym “URI” stands for “uniform resource identifier”, which includes both URLs (uniform resource locators) and URNs(uniform resource names). URLs specify how to reach a specific resources online (ie: http://example.com), and URNs identify a specific resource with a given namespace (ie: urn:isbn:0-12-385965-4 uniquely identifies “Semantic Web for the Working Ontologist” by James Hendler and Dean Allemang within the space of all books with ISBNs). URIs may be either URL, URNs, or both at once, and we use the term to broadly describe web naming schemes that may be used to describe or identify data entities. For more discussion on URIs, see <http://www.w3.org/TR/uri-clarification/> and RFC 3986 <http://tools.ietf.org/html/rfc3986> PRE_PUBLICATION DRAFT --- DO NOT CITE URIs are ideal for uniquely naming and linking common entities found throughout open government data. Given careful design, the structure and syntax of URIs can provide clarity as to the controlling authority for the entities they name and some insight as to their purpose. For example, a news item found on the "rpi.edu" domain referring to an event at RPI would likely be considered more trustworthy than news from other sources, since Rensselaer Polytechnic Institute is known to control the domain "rpi.edu." Similarly, government agencies can use URIs in domains they control (eg. “.gov” for the US government) to assert authority over the names of entities they oversee. URIs, Structured Data and Instance Hubs The structural freedom of the Web has been vital to its growth, but in practice has also led to the propagation of a large amount of factually incorrect information. As the Web has grown and become increasingly vital to human civilization around the world, online resources have naturally stratified into varying levels of authority within their respective topical domains. A government agency's official web presence can serve as the authoritative source of information about its activities, in contrast to sources with no official status like Wikipedia, the news media, or social networks. The use of URIs with clear, systematically structured syntax to name resources allows providers to express authoritative references that both name and describe entities to both computers and humans. An instance hub is a web-based service that implements an authoritative, structured URI scheme for collections of categorically and hierarchically related entities. Instance hubs are of particular interest when applied to linked open government data (LOGD), a method of structured data publication that exposes, shares, and connects data, information, and knowledge using URIs and the Resource Description Framework (RDF) 4. [Ding 2012] Instance hubs can provide key infrastructure for managing references to entities commonly found in government published datasets. In the next section we present a proposed set of URI design principles suitable for adoption by governments involved in the publication of LOGD, which we have used as the basis for the TWC RPI Instance Hub implementation. These URI Design Principles are also available online at on the TWC website. 5 4 http://www.w3.org/RDF/ 5 http://logd.tw.rpi.edu/instance-hub-uri-design PRE_PUBLICATION DRAFT --- DO NOT CITE Design Principles An increasing number of governments and government agencies have begun publishing linked open government data. From this work, policies and best practices have emerged, and continue to emerge, for the use of URIs in open government data release. RPI TWC is working extensively with key players in the United States and international linked open government data initiatives to develop URI schemes that are useful today and in the future. Developing ways of handing authoritative data release is especially important in the open government data space as developers of linked data and Semantic Web applications have few tools with which to verify that data they are using is factually correct or at least has been obtained from authoritative sources. Methods and infrastructure for clearly establishing a government's authority over such data are essential. Governments in turn have similar concerns, and want to ensure that users can tell the difference between data they directly publish and data that has been released unofficially by non-authoritative sources. Linking to authoritative data can be achieved by associating dataset entity references (to government agencies, states, etc.) with named instance hub URIs that present authoritative metadata about these entities (names, statistics, official logos, etc). These authoritative URIs may then in turn be used by developers to foster easy linking between datasets and disambiguate entity references. The Linked Data Cloud [LODCloud 2011] illustrates the Semantic Web’s heavy dependence on DBpedia as a central locus for online data linking and its role as the de facto source of authoritative URIs. While DBpedia has been immensely useful for building data infrastructure online, it is not well suited for this role when applied to government data. DBpedia presents crowdsourced information from Wikipedia that has been converted to RDF, but the organizations with authority over the actual entities referenced in DBpedia data have no way to correct inconsistencies or otherwise control the publication of information about entities under their control, especially those appearing in other linked sources. Great strides have been made in the publication of open government data over the past several years using DBpedia as a source of entity identifiers, but linking government data using DBpedia or other non-governmental sources as central loci is not ideal. Instance hubs provide a way for government data providers to share authoritative information about the entities referenced in their linked data publications. PRE_PUBLICATION DRAFT --- DO NOT CITE The Linked Open Data Cloud6 represents the inter-linking of datasets published in Linked Data format, by contributors to the Linking Open Data community project and other individuals and organisations. The depiction is based on metadata collected and curated by contributors to datahub.io. Instance Hubs and Entity Name Reconciliation Instance hubs can serve as central loci for entity linking on the Semantic Web, presenting authoritative, descriptive URIs which allow for easy disambiguation of references. A common issue encountered when combining linked data from disparate datasets is the use of different naming schemes to refer to the same across datasets. Instance hubs were first conceived as a solution to this problem, with the intent that they could present multiple alternative names for a given entity and thereby foster linking between datasets using different name schemes For example, states may be referred to by commonly used full names, ie: “New York”, official legal names, ie: “Commonwealth of Massachusetts”, or any number of other schemes such as two letter abbreviations or FIPS codes. 6 http://lod-cloud.net/versions/2011-09-19/lod-cloud.html PRE_PUBLICATION DRAFT --- DO NOT CITE URIs managed through instance hubs are expected to be dereferenceable to obtain descriptive data about the entities they represent, both as human-readable web pages and as RDF exposed through HTTP content negotiation. Instance hubs may also be used to aid in dataset discovery, by presenting links to datasets associated with each instance housed within the instance hub. Instance hubs serve several purposes at once, providing disambiguation of entity references in datasets, presenting easily accessible metadata about entities, fostering the linking of online open data, and aiding in dataset discovery. To help address these many potential uses of instance hub technology, we created a URI scheme that provides rich descriptive information about entities for humans examining open government data, while encouraging intuitive navigation and exploration of this data. The implementation of this scheme in the instance hub shows its utility for data publishing, and we hope to see government support and adoption of this scheme in order to further test its viability. Design Goal 1: Easily Rehosted URIs One of our most important goals in creating the the TWC instance hub was to provide a way for government agencies to act as an authority in publishing linked data, using entity naming schemes that they manage and create. Our solution was to create a URI scheme that is easily rehostable, meaning that URIs are not tightly bound to the specific domains at which they are hosted. These URIs should easily be transformable across multiple “base” domain names. For example, a URI pattern used to refer to toxic chemicals as classified by the US Environmental Protection Agency is: http://logd.tw.rpi.edu/id/us/fed/agency/Environmental_Protection_Agency/chemical/XXXXX This URI can easily be rehosted on other sites without losing any meaning: http://example.com/id/us/fed/agency/Environmental_Protection_Agency/chemical/XXX XX A more interesting use case is if the Environmental Protection Agency decided to itself publish RDF datasets on chemicals. In this case, since the agency’s interest would be in providing its own authoritative but agency-specific view of toxic chemicals and not necessarily serving as the sole representation of that knowledge for the government, it might choose to adopt an alterative domainspecific URI scheme such as: PRE_PUBLICATION DRAFT --- DO NOT CITE http://epa.gov/id/chemical/XXXXX If Data.gov decided to also publish chemical data from EPA, they might do so at http://data.gov/id/fed/agency/Environmental_Protection_Agency/chemical/XXXXX skipping the “us” in the URI as Data.gov’s association with the US government is obvious and implied. Shortening these URIs to minimal length is by no means a design requirement, but it is a possibility for publishers not interested in presenting a full explicit URI hierarchy. When chemicals are referenced in datasets from these agencies, the responsible agencies can link the instance hub URI for each chemical, helping consumers of this data to further link it with their own or with other datasets, as well as providing them with additional metadata about the chemicals beyond what is available in the datasets they are linking. This additional metadata can be used to enrich applications and data mashups created by consumers and developers. Design Goal 2: Concise URIs A second guideline for creating instance hub URIs is that they should be concise, with as little extra material as possible. Our practice in building the TWC Instance Hub has been to create concise URIs that enable instance hub users to intuitively navigate through the instance hub by interpreting and manipulating URIs. For the TWC instance hub implementation it was important that users should be able to navigate to any substring of given URI as denoted by slashes in the URI and at the least be presented with a logical HTML page that reflects the content one would expect to find at that page, including category and subcategory listings. Concise and clean URI design has been an important part of realizing this goal. PRE_PUBLICATION DRAFT --- DO NOT CITE Phil Archer’s “10 Rules For Persistent URIs” [Archer 2012] provides design guidance based on a 2012 EU survey studying approximately twenty cases from EU agencies and services, and EU Member States and standardization bodies and initiatives, where URI management and persistence have been subject to policy. These rules have been met for the most part in the design and implementation of the TWC instance hub.7 By keeping URI schemes consistent and only changing the extension associated with the final “token” in the URI, we enable easy and consistent navigation through URIs. For example, a user viewing the page at /id/us/state/New_York can simply remove the last token of the URI (“/New_York”) and view all states at /id/us/state”, and from there move to /id/us/state/Connecticut, all without the confusion of switching between id and id_page as would be necessary in previous instance hub versions. Design Goal 3: Multi-domain URIs This goal stipulates that URI design patterns should be applicable to many domains of authority, including national identifiers (e.g. governmental agencies, states, provinces, zip codes); state-level identifiers (e.g. counties, congressional districts); and agency-level identifiers (e.g. EPA facilities). This is critical to human-friendly design and contributes to intuitive interpretation of the generated URIs, but can also potentially be the cause of lengthy URIs. While the resultant URIs may be long, in the TWC RPI instance hub we have made them only as long as needed, keeping them as concise as possible while maintaining a logical hierarchical structure. 7 One rule we did not follow is Archer’s recommendation that URIs not contain file formats (“.html”, “.ttl”, “.xml”, etc). When visiting a given URI in a web browser, the user is redirected to the same URI with “.html” appended to the end, and they may then change this extension to view other Semantic file formats. Traditional Semantic Web “content negotiation” via HTTP request headers is still also supported. This behavior comes from the LODSPeaKr publishing framework used to build the instance hub, discussed later in this paper. PRE_PUBLICATION DRAFT --- DO NOT CITE In the case of entities that are logical sub-parts of other entities, URIs simply take the larger entity’s URI and append a unique token on to it. This pattern is used in sub-agency and state county URIs; for example the Department of Energy’s Office of Science can be found at /id/us/fed/agency/Department_of_Energy/Office_of_Science, and New York state’s Rensselaer county is found at /id/us/state/New_York/Rensselaer. In the RPI instance hub these sub-entities are associated with their larger “parent” entities8, and are listed in documents describing the larger entities (both HTML for humans and RDF documents about the entities). In other cases, a category of entities may be logically related to another entity, but may not be a direct sub-class, so a category identifier is appended to the entity before an identifier token for a related entities is placed at the end. Examples of this pattern include crops as defined by the USDA and toxic chemicals as defined by the EPA; a listing of crops may be found at /id/us/fed/agency/Department_of_Agriculture/crop, with crop specific identifier token following after “crop” (ie: “soybean”, “cauliflower”), and toxics may be found at /id/us/fed/agency/Environmental_Protection_Agency/chemical with specific identifiers following after chemical (ie: “hydrochloric_acid”, hexachlorocyclopentadiene). URI Design Details The TWC RPI instance hub URI patterns implement and build upon a set of design goals for persistent open government data URIs recommended earlier by TWC researchers. As noted in the previous section, the three main design goals set forward in our recommendation are that URIs should be easily rehosted, should be concise, and should be cross-domain. The basic URI structure upon which the instance hub is based is: “http://” [base] / “id” / ( [category] / [token]*)+ 8 Via skos:broader relations. / [token]+ PRE_PUBLICATION DRAFT --- DO NOT CITE Finite state machine representation of URI structure via http://www.regexper.com “Base” denotes the base domain of the site hosting the instance hub. This allows other government organizations to easily rehost our instance hub URIs under their own domain names. The “id” part of the URI scheme ensures that the base namespace of the site is not "polluted" with identifiers for entities; from this root, all other instance hub URIs follow. Instance hub entities are disambiguated by a combination of categories and tokens included in their URIs. Categories are broad and provide a classification for multiple entities. For example “agency” could be a category that contains the government agencies of a particular country. No entity URI may end with a category token, since categories are themselves not entities. Examples of URIs ending with categories include /id/country, or /id/us/fed/agency; these URIs might resolve to a presentation of a listing of entities which may be described the URI category hierarchy, but no actual entity exists at the URI. A tree style representation of URI “categories”, which present listings of the entities that are logically categorized beneath them. PRE_PUBLICATION DRAFT --- DO NOT CITE Example “category” listing of country resources available in the instance hub. PRE_PUBLICATION DRAFT --- DO NOT CITE Tokens serve to provide unique identification of a given entity within a specific category, or following from another token. For example, the token “Department_of_Justice” is ambiguous by itself, but could uniquely identify the United States Department of Justice at /id/us/fed/agency/Department_of_Justice versus the Canadian Department of Justice at /id/ca/fed/agency/Department_of_Justice. While all entities must proceed from the root of “id” with at least one category in their URI, entity URIs may then be formed by any combination of categories and tokens so long as the URI ends with a token. Federal government subagencies may be specified as a token proceeding from a federal government token /id/us/fed/agency/Department_of_Defense is a valid entity URI, but we may also view the Navy as a sub-agency at /id/us/fed/agency/Department_of_Defense/Department_of_the_Navy. Other combinations are also possible, such as presenting a listing of crops as defined by the US Department of Agriculture at /id/us/fed/agency/Department_of_Agriculture/crop (token Department_of_Agriculture followed by category crop), and a specific crop (cauliflower) at /id/us/fed/agency/Department_of_Agriculture/crop/cauliflower. If crop data from the Canadian agriculture and food inspection agency, Agriculture and Agri-Food Canada, were incorporated into the same instance hub it would be located at /id/ca/fed/agency/Agriculture_and_AgriFood_Canada/crop. If when the Canadian government refers to an entity called “cauliflower” they attach the same meaning as the USDA, links denoting their equivalence could then be established between /id/ca/fed/agency/Agriculture_and_Agri-Food_Canada/crop/cauliflower and /id/ca/fed/agency/Department_of_Agriculture/crop/cauliflower. PRE_PUBLICATION DRAFT --- DO NOT CITE On the Semantic Web, equivalence between entities is commonly expressed using the “sameAs” property within the OWL web ontology language (denoted “owl:sameAs”). To state in an RDF document that “entity_A owl:sameAs entity_B” means that (as “sameAs” would employ), these entities are in fact the same thing. This may be useful as entities may be hosted on different sites, have different local names, or as in the example of cauliflower given previously, come from different naming authorities (USDA vs. AAFC). The owl:sameAs property is commonly used on the Semantic Web to establish links between RDF documents describing the same entity or concept. For example, in the TW instance hub, the US Army is found at http://logd.tw.rpi.edu/ih2/id/us/fed/agency/Department_of_the_Army/Department _of_the_Army, while on DBpedia the army is found at http://dbpedia.org/resource/United_States_Department_of_the_Army . The instance hub RDF data on the Army states that it is owl:sameAs the DBpedia URI for the Army, so anyone who views this RDF knows that is appropriate to treat any dataset reference to one of these as a reference to the other, and that any property of the Army asserted on one site is also true for the Army entity referenced in the other. In the cauliflower example, the “links denoting their equivalence” would be owl:sameAs links. Entity equality may be explicitly indicated across the instance hub for unique names for given entities; for example, crop “XXXX” under a given classification system could be owl:sameAs linked to crop “YYYY” in another if they are in fact scientifically the same. If two dissimilar entities are referred to by the same name in two different countries, no owl:sameAs link would be established between them, it would not be appropriate for data consumers to infer that the entities are the same. PRE_PUBLICATION DRAFT --- DO NOT CITE By allowing for specific classification of entities in this way, instance hubs can enable developers of linked data applications to determine exactly what sort of entities they are dealing with, while also not impeding them from gaining access to a large amount of data about these entities. If a mapping exists between “cauliflower” from US data sets and “cauliflower” from Canadian data sets, any properties only associated with one of these entities may easily be associated with the other. While crops may seem a trivial example, the implications of this sort of linked data technology become more important in contexts where datasets may not share so simple a common vocabulary as vegetable names, or where a single entity may have a number of alternative identifiers. References in datasets to corporate entities are a good example of this, they may use a common name, a legal name, a stock ticker, a tax id, or various other identifier standards, any of which might differ for the same corporation when conducting business in different countries. [OC 2011] For example, datasets making references to Microsoft may refer to it as by names such as “Microsoft”, “Microsoft Corporation”, “MICROSOFT CORP”, as well as its SEC filings identifier, “0000789019”, its US Senate lobbying disclosure filings identifier “25204”, its ticker symbol “MSFT”, its US federal tax identifier “91-1144442”, or any number of other unique ways.9,10 9 http://tw.rpi.edu/orgpedia/page/company/0000327629 10 “The Federal Tax Identification Number for Microsoft“, http://support.microsoft.com/kb/834344 PRE_PUBLICATION DRAFT --- DO NOT CITE TWC RPI Instance Hub Implementation RPI TWC’s prototype instance hub was first implemented in PHP as an extension to the TWC Linking Open Government Data (LOGD) Portal. [Ding 2011] This version was an effective proof-of-concept but presented usability and system development constraints, so for the second (current) iteration we adopted the LODSPeaKr platform11, a linked data publishing framework developed at TWC. [Graves 2012] The move to LODSPeaKr has made instance hub development and management significantly easier, while also improving the general user interactive experience. LODSPeaKr uses a templating system, whereby templated HTML pages are associated with individual RDF types. When the URI of an entity within the instance hub is dereferenced (requested), LODSPeaKr initiates a SPARQL query12 to determine the RDF type of the entity being requested and serves the appropriate page as determined by the type. Once the type of the URI being request has been established, LODSPeaKr chooses a type template to display the URI that being referenced. LODSPeaKr’s templating system allows developers to define SPARQL queries for data associated with each type that they present in the instance hub, and it uses the results of these queries to fill in HTML page templates for each “type” in the instance hub. For the TWC Instance Hub we defined specific type templates for the classes of entities contained within it: countries, US federal agencies, toxic chemicals, etc., allowing us to style and incorporate available data differently for each one. 11 http://lodspeakr.org 12 SPARQL, the “SPARQL Protocol and RDF Query Language” is a query language for Semantic RDF data, much like SQL is a query language for tabular data. PRE_PUBLICATION DRAFT --- DO NOT CITE PRE_PUBLICATION DRAFT --- DO NOT CITE Two example instance hub pages - the State of New York the US President’s Council of Economic Advisors. LODSPeaKr’s templates allowed us to style these pages differently, for example, as a geographic entity, New York is shown on an embedded map at the top of the page alongside its state flag. See appendix for New York state RDF document. PRE_PUBLICATION DRAFT --- DO NOT CITE Principles for both publishing and consuming linked data were applied in the design of the TWC Instance Hub. In particular, some entity properties which the instance hub presents are not natively stored in TWC’s SPARQL endpoint, but dynamically pulled from other sources. The primary source of external data on the instance hub is DBpedia. Each entity in the instance hub has an associated owl:sameAs DBpedia URI, and selected DBpedia data is presented alongside locally hosted data on instance hub HTML pages and RDF documents. As the instance hub’s goal is to facilitate data linking and entity disambiguation, not to serve as a large-scale repository for data itself, the amount of information presented by each page is far less than the DBpedia page on the same entity presents. The data that the instance hub pulls in from other sites serves to further enhance the instance hub user experience by providing additional descriptive resources (such as country or state flags, or for geographic entities, an embedded Google map displaying the location of the entity in question, with the location coordinates derived from DBpedia). The TWC Instance Hub has been designed to demonstrate the full potential of the instance hub concept, and therefore provides both internally and externally sourced data. A government-hosted and managed instance hub may instead choose to provide only internally sourced data. Authoritative government-run instance hubs might provide links to external resources that reference the entities that they house, but directly pulling resources from external sites could expose users of the instance hub to potentially incorrect or inappropriate, unvetted data, and potentially could present a vector for malicious individuals to carry out crosssite scripting attacks13. An early demo instance hub created by one of the authors (Bulazel) during an internship with Data.gov intentionally used only governmenthosted resources for this reason. 13 Cross-site scripting, or XSS, refers to a class of computer security vulnerabilities commonly found in web applications. XSS enables malicious attackers to inject JavaScript into vulnerable pages, and when this code is executed by other users it can prevent a serious security issue. Relying on external data sources for instance hub data could present an XSS vulnerability for governments. Even if the government sites do not themselves have a vulnerability allowing malicious code to be injected, if an XSS vulnerability in an external data provider exists and exploited, and this code is then pulled into a government instance hub, users are harmed just the same (and in turn, this code may then propagate to end users of applications developed relying on government instance hub resources). Simply not using external data at all significantly reduces the attack surface available to malicious individuals. PRE_PUBLICATION DRAFT --- DO NOT CITE In order to generate instance hub pages that provide listings of categorial information at descriptive partial URI fragments (ie: /id/us/state for US states and everything else “below” them, or /id/country for countries and everything below them), LODSPeaKr “service” pages were created, allowing for the presentation of custom developer-defined content. Each page is populated by queries that retrieve all entities of the each type that should be displayed on the page. in the interest of extensibility and ease of management, rather than implementing single queries to retrieve all entities that should be presented on a single page (ie: all US states, agencies, counties, etc under /id/us), a number of queries were implemented, one for each of these categories. The queries used in service pages are all stored in a common directory, and new service pages can easily be created, with the developer simply writing a few lines of LODSPeaKr template code and associating it with each query needed to populate the page they are creating. Instance Hub Data The data presented in the current RPI TWC Instance Hub was chosen so as to show the possibilities for the use of this technology within government. A variety of hierarchically related and heterogeneously typed entities were chosen to demonstrate the URI design of the instance hub. While most of the instance hub data is currently from the United States, we plan on expanding its scope to include agency and state/province level entities from other countries involved in the publication of open government data. As of April 2014 the TWC instance hub housed information about world countries, United States federal agencies and subagencies, US states and their counties, toxic chemicals as defined by the US EPA, and crops as defined by the USDA. Instance hub pages and RDF documents provide a variety of high level metadata about the entities that they describe. Alternative entity names and links to external sites (DBpedia, Freebase, etc) for each entity are presented so as to assist with linking data. Additional data such as a brief description of the entity from DBpedia and a link to a flag or logo are also provided to help developers seeking to develop linked data applications using instance hub entities. Links between other entities are also presented (sub-agencies, state counties, etc.). PRE_PUBLICATION DRAFT --- DO NOT CITE One important use case for the instance hub is data exploration and discovery. To this end, we have pursued work integrating links to TWC hosted linked open government datasets with related instance hub entities. Currently, this data comes from TWC’s International Open Government Dataset Catalogue Search (IOGDS) dataset repositories. Each time an instance hub page is loaded, a query for pertinent IOGDS data is sent out to the TWC endpoint to retrieve an related dataset listing. Due to the overwhelmingly massive number of IOGDS datasets available from TWC (over one million), only a small selection of relevant datasets is presented on each instance page, as a way to pique a viewers interest in further exploring IOGDS. Ways to better integrate these datasets with instance hub is still an area of ongoing research, and we hope to develop new techniques, so as to further enhance instance hub’s capabilities as a linked data discovery tool. Future Work In order to take full advantage of linked data, named entities expressed in natural language form in datasets must be recognized and associated with corresponding URIs administered in the relevant instance hub. When aliases or incorrect spellings have been used to name a given entity, disambiguation is required to ensure that the canonical URI is assigned. Additionally, if identifiers from proprietary naming systems are used in the data, reconciliation of these with instance hub URIs should take place. An emerging body of work focused on named entity recognition, leveraging progress in natural language processing, is contributing tools and methods to the problem of recognizing entity mentions within data. [WoLI 2012] Future work should include a focus on the development of efficient, automated workflows that apply named entity recognition algorithms at scale and integrate effectively with a network of instance hubs, producing highquality, useful linked data. PRE_PUBLICATION DRAFT --- DO NOT CITE The core value of an instance hub lies in its implementation of an authoritative URI scheme, but is greatly enhanced by the extent to which it provides links to alternative data. The TWC RPI Instance Hub accomplishes this by providing links for certain entities to alternative entity identifier schemes appropriate to the category; for example, for toxic chemicals we link to identifiers for chemicals in other canonical systems, including PubChem and ChemSpider.14,15 In the current TWC RPI Instance Hub these alternative schemes have been identified by hand; future work should include identifying scalable mechanisms to automate this process using the global web graph and incorporation of resources such Freebase and DBpedia. Challenges and Guidance for Government Adopters Government adoption of the implementation approach we have tested will likely require several steps. First, governments will need to internally compile the authoritative entity information that they would like to publish online for open data consumers. Second, following the instance hub pattern this data will need to be published as linked open data, requiring conversion into RDF. [Ding 2011, Ding 2012] Third, a web application will be required to present this data, and finally the data should be published online. Governments should engage the open data community during this process to ensure that they are in fact creating usable systems that present desirable, useful data. The United States’ Data.gov has been a world leader in promoting the use of Semantic Web technologies in the publication of open government data [Kundra 2010]. During the Summer of 2012 the site hosted one of the authors (Bulazel) as a summer intern exploring the use of instance hub technology for Data.gov’s open data publication work. Much of TWC’s recent work with developing instance hubs builds on this experience, and many valuable lessons for governments seeking to publish linked data in Semantic formats can be taken from these experiences. 14 http://pubchem.ncbi.nlm.nih.gov/ 15 http://www.chemspider.com/ PRE_PUBLICATION DRAFT --- DO NOT CITE We recommend that government agencies seeking to implement instance hub technology do the following: plan their URI schemes before implementation, consider the security and authority implications of their applications, and finally carry out implementation having fully prepared the content that will be presented. Community and intra-government collaboration is crucial to this process, and these stakeholders should be involved at each step. For these reasons, we strongly recommend that agencies transparently plan out the URI schemes that they propose using in their instance hubs, seeking input from developers, academia, and other agencies. All three of these stakeholders have valuable contributions to make to the planning process. Developers who will ultimately use the instance hub in their open data applications have a vested interest in ensuring that the hub is usable and meets their needs. Academics have valuable insight into best practices for the deployment of experimental Semantic Web applications like instance hubs, and should also be engaged in the process. Finally, agencies should collaborate with other agencies that may be developing instance hubs, sharing best practices, implementation tips, and ensuring the interoperability and linkability of their data. The security and authority implications of instance hubs are another important consideration for governments. While linked data applications often feature information dynamically drawn from multiple sources, we recommend that governments not do this, and instead either rely entirely on internally sourced data or on vetted data from other sources that is cached for presentation and not dynamically sourced. Given the relative high-profile of government sites, there is the risk of malicious actors using dynamically sourced data as way to spread embarrassing misinformation, or worse as a vector to spread malware via socalled “cross-site scripting” attacks using JavaScript. The adoption of Semantic Web technologies by many government agencies would be a large undertaking. For this reason, we recommend that open data agencies with previous experience in the Semantic Web space (such as Data.gov or Data.gov.uk) take the lead in exploring the possibilities for the use of instance hub technologies in their respective governments. Outcomes For Open Data Consumers PRE_PUBLICATION DRAFT --- DO NOT CITE If governments use instance hubs to link entities within their hubs to datasets referencing these entities, as the RPI TWC instance hub does, these instance hubs can then serve as an entry point for dataset discovery. After developers have chosen the datasets that they would like to work with, instance hubs can play an important role in linking these datasets and making them interoperable. In an ideal world, government published open datasets would be easily available in machine readable RDF, and elements within these datasets would be linked via authoritative instance hub URIs. In reality, many open datasets are not in RDF format, let alone simple parseable open formats like CSV. [Ding 2011] In the case that documents are not in RDF format, developers will need to first convert these documents to RDF. Instance hubs can be used during the this process to craft richer RDF documents, as developers can integrate metadata about referenced resources into these documents during conversion. For data that is in RDF format, but not linked via instance hub URIs, developers can use instance hubs to link this data to other data, based on commonalities in entity reference schemes as negotiated by the instance hub. After linking their datasets together, instance hubs can further help open data developers by providing them with high-quality authoritative metadata about entities referenced in these datasets. Instance hub metadata can include government agency logos, official websites, short descriptions, or anything else that governments think would be of use to developers. This metadata can then be integrated into open data applications with ease, while allowing developers to know that they are dealing with authoritative data. Overall, we believe that instance hubs can provide great value to developers interested in working with open data, and that they would lower the barriers to entry for this sort of work by making data more easily interoperable. Conclusion The RPI instance hub project is an area of undergoing research, and will continue developing. Instance hubs provide a way of making open government data (or any data) more readily usable in linked data applications. PRE_PUBLICATION DRAFT --- DO NOT CITE What is needed now is adoption of instance hubs by government publishers of open data. Instance hub technology and the URI schemes we have proposed have not been developed in a vacuum, rather they are the product of dialogue with colleagues in government actively involved in open data publication. The TWC Instance Hub is a realization of the ideas put forward in these discussions and serves as an important demonstration of the viability of these ideas for use in real world applications. TWC has proposed a scheme for the naming of government entities and has created a demonstration of its implementation, but its adoption remains within the domain of governments themselves. While TWC will continue to be involved in the promotion of linked open government data and the technologies such as instance hubs that may be used to promote it, backing for this technology is now necessary from governments themselves in the form of implementation in government information systems. As with the greater Web, the Semantic Web needs the authoritative power of government agencies to help ensure the quality and accountability of linked open data. We have provided the first technological step towards that solution, but now need support from government agencies to embrace and evolve it. Acknowledgments This work was supported by a generous gift to TWC RPI from Microsoft Research. Jim Hendler serves as Open Data Advisor to the New York State Government and to the US Data.gov project. The ideas expressed in this paper are those of the authors and do not necessarily reflect the opinions of any of these organizations. References [Archer 2012] Phil Archer, "D7.1.3 - Study on persistent URIs, with identification of best practices and recommendations on the topic for the MSs and the EC." Deliverable for ISA Action 1.1 on Semantic Interoperability (Dec 2012). http://bit.ly/19Rj2M5 [Ding 2011] Ding, L., T. Lebo, J. S. Erickson, D. DiFranzo, A. Graves, G. T. Williams, X. Li, J. Michaelis, J. Zheng, Z. Shangguan, et al., "TWC LOGD: A Portal for Linked Open Government Data Ecosystems", Web Semantics: Science, Services and Agents on the World Wide Web, vol. 9, no. 3: Elsevier, 2011. http://bit.ly/16tmY9q PRE_PUBLICATION DRAFT --- DO NOT CITE [Ding 2012] Ding, L., V. Peristeras, and M. Hausenblas, "Linked Open Government Data", IEEE Intelligent Systems, vol. 27, no. 3, Los Alamitos, CA, USA, IEEE Computer Society, pp. 11-15, 2012. http://bit.ly/16YYb7s [DOI] The Digital Object Identifier System. http://www.doi.org/ [Erickson 2011] Erickson, J. S., E. Rozell, Shi, Y. , Zheng, J. , Ding, L. , Hendler, J.A. "TWC International Open Government Dataset Catalog", the 7th International Conference Proceedings of the 7th International Conference on Semantic Systems - I-Semantics '11, Graz, Austria; ACM Press, 2011. http://bit.ly/1aD4GQq [Graves 2012a] Alvaro Graves, "Creating Web Applications with LODSPeaKr." (Feb 22, 2012) http://slidesha.re/16YUyOT [Graves 2012b] Alvaro Graves, "Publishing Linked Data with LODSPeaKr." (Jul 07, 2012) http://slidesha.re/16YVsLo [Juty 2012] Juty N., Le Novère N., Laibe C., "Identifiers.org and MIRIAM Registry: community resources to provide persistent identification." Nucleic Acids Research. 2012; 40 (Database issue): D580-D586 [Kirkpatrick 2009] Marshall Kirkpatrick, "Data.gov Now Live; Looks Nice But Short on Data." ReadWriteWeb (created 21 May 2009; accessed 19 Mar 2014) http://bit.ly/1nEmAwk [Kundra 2010] Viveck Kundra, "Data.gov: Pretty Advanced for a One-Year-Old." White House web site (created 21 May 2010; accessed 19 Mar 2014) http://1.usa.gov/15BNGfC [LinkedData 2006] Tim Berners-Lee, "Linked Data." W3C Design Issues (created 2006; latest revision 2009). http://www.w3.org/DesignIssues/LinkedData.html [LODCloud 2011] Richard Cyganiak and Anja Jentzsch (eds.), The Linking Open Data cloud diagram. (Edited Sep 2011). http://lod-cloud.net/ [OC 2011] The Sunlight Foundation, “Unique Corporate Identifiers.” OpenCongress Wiki (accessed 30 Oct 2013). http://www.opencongress.org/wiki/Unique_Corporate_Identifiers PRE_PUBLICATION DRAFT --- DO NOT CITE [Orszag 2009] Peter R. Orszag, "Open Government Directive." M10-06, MEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND AGENCIES (8 Dec 2009). http://1.usa.gov/1nEqANo [Pereira 2012] Bianca Pereira, João C. P. da Silva, and Adriana S. Vivacqua, "Discovering Names in Linked Data Datasets." In Proceedings of the Web of Linked Entities Workshop in conjunction with the 11th International Semantic Web Conference (ISWC 2012) Boston, USA (November 11, 2012). [WoLI 2012] Web of Linked Entities Workshop 2012 http://ceur-ws.org/Vol-906/ [W3 2002] Klyne, Graham and Carroll, Jeremy (eds.), “Resource Description Framework (RDF): Concepts and Abstract Data Model” (work in progress) (Aug 29, 2002), http://www.w3.org/TR/2002/WD-rdf-concepts-20020829 PRE_PUBLICATION DRAFT --- DO NOT CITE Appendix Example of machine readable “Turtle” format representation of RDF data on the State of New York: @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix ns0: <http://logd.tw.rpi.edu/source/twc-rpiedu/dataset/instance-hub-us-states-and-territories/vocab/> . @prefix ns1: <http://www.w3.org/2002/07/owl#> . @prefix ns2: <http://purl.org/dc/terms/> . @prefix ns3: <http://xmlns.com/foaf/0.1/> . @prefix ns4: <http://open.vocab.org/terms/> . @prefix ns5: <http://rdfs.org/ns/void#> . @prefix ns6: <http://logd.tw.rpi.edu/source/twc-rpiedu/dataset/instance-hub-us-states-and-territories/vocab/enhancement/1/> . @prefix ns7: <http://www.w3.org/2004/02/skos/core#> . <http://logd.tw.rpi.edu/id/us/state/New_York> rdf:type ns0:State ; ns1:sameAs <http://dbpedia.org/resource/New_York> ; ns2:isReferencedBy <http://logd.tw.rpi.edu/source/twc-rpiedu/dataset/instance-hub-us-states-and-territories/version/2011-Oct09> ; ns2:identifier "36" , "ny" , "new york" , "New York" , "NEW YORK" , "NY" ; ns2:title "New York" ; ns3:isPrimaryTopicOf <http://logd.tw.rpi.edu/id/us/state_page/New_York> ; ns4:csvRow "33"^^<http://www.w3.org/2001/XMLSchema#integer> ; ns5:inDataset <http://logd.tw.rpi.edu/source/twc-rpiedu/dataset/instance-hub-us-states-and-territories/version/2011-Oct09> ; ns6:fips "36" . <http://logd.tw.rpi.edu/id/us/state/New_York/Chautauqua> ns2:title "Chautauqua" ; ns7:broader <http://logd.tw.rpi.edu/id/us/state/New_York> . <http://logd.tw.rpi.edu/id/us/state/New_York/Chemung> ns2:title "Chemung" ; ns7:broader <http://logd.tw.rpi.edu/id/us/state/New_York> . PRE_PUBLICATION DRAFT --- DO NOT CITE <http://logd.tw.rpi.edu/id/us/state/New_York/Erie> ns2:title "Erie" ; ns7:broader <http://logd.tw.rpi.edu/id/us/state/New_York> . <http://logd.tw.rpi.edu/id/us/state/New_York/Herkimer> ns2:title "Herkimer" ; ns7:broader <http://logd.tw.rpi.edu/id/us/state/New_York> . [county listing truncated]