- Laboratoire de Recherche en Informatique (LRI), Paris Sud University
- IATE Joint Research Unit (Ingénierie des Agropolymères et Technologies Emergentes), Montpellier
- INRIA Project Team « GraphIK », Montpellier
- Institut National de l’Audiovisuel (INA), Paris
- Agence Bibliographique de l’Enseignement Supérieur (ABES), Montpellier
Abstract
The main difficulty of data fusion concerns the conflicts in property values, that is, several possible values for the same property. These conflicts are mainly due to the heterogeneity of the data where different vocabularies and conventions are used to describe the data. Poor data quality (freshness, errors and incomplete information) may contribute to the amplification of conflicts between values.
This aim of this PhD project is to develop a data fusion approach where: (i) it is possible to choose and combine several criteria (e.g, freshness, frequency, reliability of sources, functional and semantic dependencies) to choose the right values (ii) the schema constraints should be checked in the merged data and (iii) provenance information on the original data sources but also mappings that are applied during schema integration of the data sources should be exploited. In addition, the richness of Web of Data will be used by navigating the graph of owl: sameAs.
Context
Recently, with the initiative of « Linked Open Data cloud (LOD) », the number of sources of structured data made available on the Web has lead to an explosive growth of the global data space with billions of assertions (61 billions in January 2014). In this data space, semantic links can be established between data. These links allow crawlers, browsers or applications to navigate through the data sources and combine information from different sources. However, in an open environment like the Web, different URIs are regularly created to identify the same object. Generating identity links (owl:sameAs) between resources is crucial to allow applications to exploit the richness of the LOD. To thereby, these application should be able to fuse the resources linked using owl:sameAs links in order to obtain a unified representation. This is the data fusion problem arising after data linking problem. It is this data fusion problem that interests us in this PhD project.
Objectives
Study the theoretical and practical aspects of data fusion approach in order to: (i) avoid data redundancy and (ii) summarize information from a point of view.
Work program
A first work direction concerns the annotation of linked data. Indeed, once the linking achieved, an essential element for data quality is the interpretability and the explanation of data fusion result. It is important to be able to identify and keep track of the criteria that are used to choose the "best value" or to rank the possible values of a property during the fusion step. It is possible to achieve such a complex annotation for storing: excluded values, synonymy relations, specialization/generalization relations, mereology relations, and so on. For each criteria, a quality score is computed. These scores should also be kept and represented in the annotations.
2. A second direction of work concerns the case where several data can be grouped into the goal of creating a more generic entity with common characteristics. A case characteristic concerns the concept of "work" in the bibliographical data sources. A fusion approach within a data summary objective needs to be developed
Extra information
We are looking for PhD candidates and for funding for this PhD project.
Prerequisite
Good skills in databases, Web technologies (XML, RDF, OWL) and in knowledge representation.