Complex data transformations appear in numerous applications, such as data warehousing, data integration, and data cleaning. With increasing transformation complexity, the complexity of developing and understanding these transformations increases as well. Data provenance techniques, which trace back transformation output data to the input data contributing to the output, can help in understanding such complex data transformations by explaining how the output was produced. However, especially during transformation development, a crucial question is not only to explain existing output data, but also to explain why expected data is missing from the output.
Our goal is to devise the theoretical foundation and to propose novel algorithms to automatically compute the data provenance of missing output data. The goal is to explain to a data transformation developer why he is not obtaining the desired output, based on data examples and intuitive transformation representations.
Context
This thesis topic is in the context of the Nautilus project (http://nautilus-system.org, (HG11)) we are pursuing at the database group of Université Paris Sud. The goal of Nautilus is to support developers in developing, analyzing, debugging, fixing, testing, and evolving complex data transformations process by providing a suite of algorithms and tools to accompany the process. The work proposed here will contribute to the query analysis and debugging components of Nautilus.
This project also fits in the work plan we propose within the KIC EIT ICT Labs Activity 2013 "DataBridges", a renewal of the successful activity of prior years to be coordinated by Melanie Herschel.
Objectives
The goal of this research is to show that the answer to the question why some data is missing from a data transformation's output can be answered for a significant fraction of SQL data transformations by computing the provenance of missing data in form of so called explanations. This core hypothesis dictates the following goals:
Development of a framework that (i) unifies the concept of existing and different representations of missing data provenance and (ii) analyzes and defines interesting properties of the input and output.
Definition of efficient and effective algorithms to compute missing data provenance complying to the proposed framework. We envision a new type of algorithm that computes explanations that unify the multiple different explanation types that exist today.
Experimental validation of the proposed solutions to assess both the efficiency and the usability of the computed explanations for analyzing and debugging complex data transformations.
Work program
The work program consists of eight work packages, briefly outlined below. Work packages 1 to 3 contribute to the framework development, 4 through 6 devise new algorithms to compute missing data provenance in form of explanations, and the goal associated to 7 and 8 is the experimental validation.
Supported SQL transformations
Generated explanation types
Complexity analysis for different types of transformations and explanations
Develop / extend algorithms computing instance-based explanations, thus pursuing our work started with the Artemis algorithm (HH10).
Develop / extend algorithms computing query-based explanations such as (CJ09)
Implement the proposed algorithms in Java as part of an Eclipse Plugin, the general framework chosen to implement Nautilus (HHT09).
Evaluation of the proposed algorithms, including a comparative evaluation with existing approaches.
Extra information
(CJ09) A. Chapman, H.V. Jagadish. Why Not? In Proceedings of the Conference on the Management of Data (SIGMOD), 2009.
(HHT09) M. Herschel, M.A. Hernandez, W.C. Tan. Artemis: A System for Analyzing Missing-Answers. Proceedings of the VLDB Endowment, Volume 2, August 2009.
(HH10) M. Herschel, M.A. Hernandez. Explaining Missing Answers to SPJUA Queries. Proceedings of the VLDB Endowment, Volume 3, September 2010.
HG11 M. Herschel, T. Grust. In Proceedings of the VLDB QDB Workshop, 201
Prerequisite
The candidate should have advanced knowledge in the area of databases and should have prior experience in Java programming.
Détails
Expected funding
Institutional funding
Status of funding
Expected
Candidates
Utilisateur
melanie.herschel
Créé
Jeudi 03 mai 2012 15:02:20 CEST
dernière modif.
Lundi 21 mai 2012 11:51:39 CEST
Fichiers joints
filename
créé
hits
filesize
Aucun fichier joint à cette fiche
Connexion
Ecole Doctorale Informatique Paris-Sud
Directrice
Nicole Bidoit Assistante
Stéphanie Druetta Conseiller aux thèses
Dominique Gouyou-Beauchamps
ED 427 - Université Paris-Sud
UFR Sciences Orsay
Bat 650 - aile nord - 417
Tel : 01 69 15 63 19
Fax : 01 69 15 63 87
courriel: ed-info à lri.fr