Databases-Web-Information Retrieval

Éléments

Domain - extra

Data Provenance

Année

2012

Starting

September 2012

État

Open

Sujet

Foundations and Algorithms to Compute the Provenance of Missing Data

Thesis advisor

BIDOIT Nicole

Co-advisors

Melanie Herschel (LRI-INRIA : BD/OAK) will be the main advisor
http://www.lri.fr/~herschel/research.html

Laboratory

LRI LaHDAK

Collaborations

Abstract

Complex data transformations appear in numerous applications, such as data warehousing, data integration, and data cleaning. With increasing transformation complexity, the complexity of developing and understanding these transformations increases as well. Data provenance techniques, which trace back transformation output data to the input data contributing to the output, can help in understanding such complex data transformations by explaining how the output was produced. However, especially during transformation development, a crucial question is not only to explain existing output data, but also to explain why expected data is missing from the output.

Our goal is to devise the theoretical foundation and to propose novel algorithms to automatically compute the data provenance of missing output data. The goal is to explain to a data transformation developer why he is not obtaining the desired output, based on data examples and intuitive transformation representations.

Context

This thesis topic is in the context of the Nautilus project (http://nautilus-system.org, (HG11)) we are pursuing at the database group of Université Paris Sud. The goal of Nautilus is to support developers in developing, analyzing, debugging, fixing, testing, and evolving complex data transformations process by providing a suite of algorithms and tools to accompany the process. The work proposed here will contribute to the query analysis and debugging components of Nautilus.

This project also fits in the work plan we propose within the KIC EIT ICT Labs Activity 2013 "DataBridges", a renewal of the successful activity of prior years to be coordinated by Melanie Herschel.

Objectives

The goal of this research is to show that the answer to the question why some data is missing from a data transformation's output can be answered for a significant fraction of SQL data transformations by computing the provenance of missing data in form of so called explanations. This core hypothesis dictates the following goals:
Development of a framework that (i) unifies the concept of existing and different representations of missing data provenance and (ii) analyzes and defines interesting properties of the input and output.
Definition of efficient and effective algorithms to compute missing data provenance complying to the proposed framework. We envision a new type of algorithm that computes explanations that unify the multiple different explanation types that exist today.
Experimental validation of the proposed solutions to assess both the efficiency and the usability of the computed explanations for analyzing and debugging complex data transformations.

Work program

The work program consists of eight work packages, briefly outlined below. Work packages 1 to 3 contribute to the framework development, 4 through 6 devise new algorithms to compute missing data provenance in form of explanations, and the goal associated to 7 and 8 is the experimental validation.

Supported SQL transformations
Generated explanation types
Complexity analysis for different types of transformations and explanations
Develop / extend algorithms computing instance-based explanations, thus pursuing our work started with the Artemis algorithm (HH10).
Develop / extend algorithms computing query-based explanations such as (CJ09)
Develop algorithms computing hybrid explanations (that unify query-based and instance-based explanations).
Implement the proposed algorithms in Java as part of an Eclipse Plugin, the general framework chosen to implement Nautilus (HHT09).
Evaluation of the proposed algorithms, including a comparative evaluation with existing approaches.

Extra information

(CJ09) A. Chapman, H.V. Jagadish. Why Not? In Proceedings of the Conference on the Management of Data (SIGMOD), 2009.
(HHT09) M. Herschel, M.A. Hernandez, W.C. Tan. Artemis: A System for Analyzing Missing-Answers. Proceedings of the VLDB Endowment, Volume 2, August 2009.
(HH10) M. Herschel, M.A. Hernandez. Explaining Missing Answers to SPJUA Queries. Proceedings of the VLDB Endowment, Volume 3, September 2010.
HG11 M. Herschel, T. Grust. In Proceedings of the VLDB QDB Workshop, 201

Prerequisite

The candidate should have advanced knowledge in the area of databases and should have prior experience in Java programming.

Détails

Expected funding

Institutional funding

Status of funding

Expected

Candidates

Utilisateur

melanie.herschel

Créé

Jeudi 03 mai 2012 15:02:20 CEST

dernière modif.

Lundi 21 mai 2012 11:51:39 CEST

Fichiers joints

	filename	créé	hits	filesize
Aucun fichier joint à cette fiche

Connexion

Ecole Doctorale Informatique Paris-Sud

Directrice
Nicole Bidoit
Assistante
Stéphanie Druetta
Conseiller aux thèses
Dominique Gouyou-Beauchamps

ED 427 - Université Paris-Sud
UFR Sciences Orsay
Bat 650 - aile nord - 417
Tel : 01 69 15 63 19
Fax : 01 69 15 63 87
courriel: ed-info à lri.fr