The problem of reference reconciliation has been studied extensively for relational data and more recently for hierarchical and graph data. Due to the tremendous amount of Web data, current solutions for graph duplicate detection do not apply. Scalability is not the only issue, the second crucial aspect is the quality of the result. To improve result quality, this project will investigate how reference reconciliation can benefit from data provenance, i.e., information about the source of the data and the processes that produced these data. Another optimization in this direction is the reversibility of reconciliations, which can revoke a reconciliation when conflicting information appears.
The goal of the doctoral project is to develop reference reconciliation algorithms that take into account provenance and reversibility while applying to large volumes of Web data.
Context
In managing massive heterogeneous and distributed data, producing and using these data raise the question of quality and reliability. In this context, we consider the problem of reference reconciliation, which aims at identifying different representations of a same real world object (person, product, etc..) among data from different sources. This problem, also known as entity resolution or duplicate detection is of high practical relevance in industry.
Objectives
The goal is to design algorithms for reference reconciliation that improve the state of the art in three aspects: (i) apply to large volumes of data, (ii) make use of data provenance, and (iii) study how reconciliation decisions can be revoked and how this affects the overall result in terms of efficiency and result quality. In our group, we thrive at developing practically relevant solutions, so the proposed solutions will be implemented and evaluated on real data.
Work program
The overall goal will be achieved following a four-step work program:
1) Inclusion of data provenance
2) Study of reversibility of reconciliations
3) Large scale graph reference reconciliation
4) Software implementation and evaluation