HDT (Queryable compression format for Linked Data)

ISWC2017 Proposal for a Tutorial

Abstract

The steady adoption of Linked Data in recent years has led to a significant increase in the volume of RDF datasets. The potential of this Semantic Big Data is under-exploited when data management is based on traditional, human-readable RDF representations, which add unnecessary overheads when storing, exchanging and consuming RDF in the context of a large-scale and machine-understandable Semantic Web. HDT tackles this issue by proposing a binary representation for RDF data. HDT can be seen as a compressed, self-contained triple store for RDF. On the one hand, HDT represents RDF with compact data structures that enable the storage, parsing and loading of Big Semantic Data in compressed space. At the same time, “the HDT data are the index”, and thus it can be used as a graph data store that reports competitive querying performance. In this tutorial we will focus on providing a hands-on experience with HDT. We will also welcome external presentations on related topics and a discussion on next steps for the interested community.

Motivation

Although the amount of RDF data has grown impressively over the last decade, traditional RDF representations are dominated by a document-centric, human-readable view, hence they suffer from scalability problems due to the huge space they need, the powerful resources required to manage them, and the large time required for data retrieval on the Web.

This scenario calls for efficient and functional representation formats for RDF as an essential tool for RDF preservation, sharing, and management. HDT fills this gap and proposes a compact data structure and binary serialization format that keeps big datasets compressed, saving space while maintaining search and browse operations without prior decompression. This makes it an ideal format for storing and sharing RDF datasets on the Web.

HDT has been adopted by the Semantic Web community because of its simplicity and its performance for data retrieval operations. It is worth noting that it is successfully deployed in projects like Linked Data Fragments, which provides a uniform and lightweight interface to access RDF in the Web, indexing/reasoning systems like HDT-FoQ or WaterFowl, recommender systems, mobile applications, and it is the main store behind the LOD Laundromat project serving a crawl of a very big subset of the Linked Open Data Cloud. Thus, we expect this tutorial will be of particular relevance to ISWC, since it raises awareness of a practical technology for managing and serving Big Semantic Data.

Detailed Description

We propose a full-day tutorial. It will be held in an innovative format in order to elicit unanswered questions about the HDT technology and the Big Semantic Data community. Thus, the program will be split into two parts, with a knowledge sharing session in the morning (tutorial) and participant presentations and open discussion in the afternoon (workshop).

The knowledge sharing session will be composed of practical and hands-on lessons to acquire the following skills:

The interactive workshop in the afternoon will consists of short presentations and demos from invited speakers, as well as the response to our call for papers and call for action. The last session will be dedicated to establishing collaborations and defining concrete next steps towards better organizing our community and materialising actions to increase the impact on Linked Data management.

Time Activity
9:00 - 9:10 Welcome and introduction by organizers
9:15 - 9:30 Participant presentations: Getting acquainted session
9:30 - 10:30 HDT foundations
10:30 - 11:00 Coffee Break
11:00 - 12:30 Practical uses: Linked Data Fragments and LOD Laundromat
12:30 - 14:00 Lunch
14:00 - 15:30 Presentations - Different perspectives on the topic
15:30 - 16:00 Coffee Break
16:00 - 16:50 Discussion: concrete next steps
16:50 - 17:00 Closing Remarks

Tutorial Material

Material will be composed of slides, code snippets, datasets and existing libraries in https://github.com/rdfhdt. All tutorial material will be available to anyone on Github and the project website http://rdfhdt.org.

Audience

We expect to attract Linked Data researchers and practitioners, in particular data publishers and consumers. The audience will benefit from attending this tutorial by learning about ways to scale up large semantic data management and data retrieval, as well as by being able to discuss their expectations, requirements and experiences with current RDF representations and triple stores at large scale. We aim for an audience of at least 20 people.

Requirements

The organisation of the tutorial requires standard, basic needs (projector and Internet connection).

Presenters

Wouter Beek

VU University Amsterdam, The Netherlands

http://wouterbeek.com/

[Wouter Beek]

Wouter Beek received his Master’s in Logic from the Institute for Logic, Language and Computation (ILLC). He is currently PhD researcher at VU University Amsterdam (VUA), working in the Knowledge Rep- resentation & Reasoning (KR&R) group. His research focuses on the development, deployment and analysis of large-scale heterogeneous knowledge bases and the way in which they enable unanticipated and innovative reuse. Wouter is the principle developer of the LOD Laundromat and LOD Lab. He has taught over ten courses in Artificial Intelligence and Philosophy.

Javier D. Fernández (primary contact)

Vienna University of Economics and Business, Austria

https://www.wu.ac.at/en/infobiz/team/fernandez/

[Javier Fernández]

Javier D. Fernández holds a PhD in Computer Science by the University of Valladolid (Spain), and the University of Chile (Chile). His thesis addressed efficient management of Big Semantic Data, proposing HDT, a binary RDF representation for scalable publishing, exchanging and consumption in the Web of Data. Dr. Javier D. Fernandez is currently a post-doctoral research fellow under an FWF (Austrian Science funds) Lise-Meitner grant. His current research focuses on efficient management of Big Semantic Data, RDF streaming, archiving and querying dynamic Linked Data. He has published more than 40 articles in international conferences and workshops and was editor of the HDT W3C Member Submission.

Ruben Verborgh

Ghent University – imec, Belgium

https://ruben.verborgh.org/

[Ruben Verborgh]

Ruben Verborgh is a researcher in semantic hypermedia at Ghent University – imec, Belgium and a postdoctoral fellow of the Research Foundation Flanders. He explores the connection between Semantic Web technologies and the Web’s architectural properties, with the ultimate goal of building more intelligent clients. Along the way, he became fascinated by Linked Data, REST/hypermedia, Web APIs, and related technologies. He’s a co-author of two books on Linked Data, and has contributed to more than 200 publications for international conferences and journals on Web-related topics.

Program Committee

The following people could be potential PC members for the call for papers and would help in the dissemination of the tutorial.