Linked data means that data stored in heterogeneous and autonomous information sources can be integrated, making the information more valuable than what could be obtained from isolated sources. Aggregation of related data offers value; it can provide additional knowledge not available in individual sources. Linked data sources offer resource description framework (RDF) data by means of SPARQL endpoints that can be queried with the SPARQL query language.
The problem of decomposing queries in distributed environments is based on the information integration problem; thus, it is not new. However, the increasing relevance of the linked open data (LOD) cloud poses new challenges to the information integration community: the data model is different and the federation gets to its maximal expression because data sources are completely autonomous from the agent in charge of query distribution.
This paper tackles query distribution in LOD data sources when there are replicated data fragments. The particularity of the replication problem in the LOD context is that data fragmentation and replication cannot be designed in advance to obtain better performance when querying the data sources. Moreover, the availability of sources is unpredictable.
The query decomposition problem is treated in this paper, and a solution to query decomposition with fragment replication (QDP-FR) is offered. It is called LILAC (SPARQL query decomposition against federations of replicated data sources). Its main components are four algorithms: a decompose algorithm, a reduceunions algorithm, a reducebgps algorithm, and an increaseselectivity algorithm. They locate the relevant sources (that is, select nonredundant sets of fragments and candidate endpoints) and join the relevant fragments obtained from the different sources.
The paper presents the problem and formalizes it. Then, the LILAC solution is proposed. The algorithms that constitute LILAC are formalized, their complexity is measured, and proofs of theorems are presented. A validation with experiments on four real datasets and one synthetic dataset is included. In these experiments, the performance of LILAC in two query engines, FedX and ANAPSID, is compared with the performance of other competitors. Performance is measured in terms of execution time, answer completeness, and number of transferred tuples.
This is a sound paper, of interest to the community of researchers working with information integration in linked data environments. I particularly appreciated the “Related Work” section, which proves to be an admirable effort in comparing the problem with other known problems (and solutions), such as distributed databases, data fragmentation, and data replication.