Ulf Leser (Professor for Knowledge Management in Bioinformatics at Humboldt Universität in Berlin)
The design and evaluation of the methods will be done in close collaboration with oncologists from the Institut Curie and the Children's Hospital of Philadelphia.
Abstract
The extremely rapid growth in the number of public biomedical data sources associated with the exponential growth of data available in these sources makes the research phase of biomedical information increasingly complex and tedious. Several query systems of biomedical sources are now available and used by physicians but none of them offers to rank the data obtained in response to a query. This is particularly problematic in a context where simple queries can return thousands of results.
Many classical ranking techniques can be considered for biomedical data, from very simple (eg, focus on the most recent data) to complex. The aim of this thesis is to propose a uniform framework able to represent and compare (conceptually and on real data) ranking methods and bring up new approaches for ranking large masses of biological data.
Context
Many classical ranking techniques can be considered for biomedical data, varying from simple (eg, most recent data first) to complex (e.g., techniques such as PageRank exploiting links between data) approaches. These algorithms exploit two key aspects of biological data: (i) linked to each other’s, forming an intricate network (graph-based algorithms) and (ii) plain-text file (text mining like algorithms). Ranking algorithms have then very different time and space complexities.
On the other hand, techniques able to calculate consensus rankings (aka median) have been proposed Fagin et al, PODS 2004, Ailon, Algorithmica 2010,Cohen-Boulakia et al, SSDBM 2011. They propose to generate a new ranking highlighting the common points of a set of rankings while minimizing their disagreements. The computation of the median of a set of m rankings is then known to be NP-hard when m>4. Several (exact and approximated) algorithms and heuristics have then been proposed.
Objectives
The aim of this thesis is to introduce a uniform framework able to represent and compare (conceptually and on real data) both classical and consensual ranking methods and bring up new approaches for ranking large masses of biological data.
Work program
(1) Formal study of the ranking algorithms. Criteria of comparison will be introduced to describe all algorithms in a uniform way. The complexity of each algorithm will be evaluated. As for NP-hard problems, assumptions and parameters causing the complexity will be underlined.
(2) Evaluation of the ranking algorithms. Data obtained by the various ranking algorithms will be compared to expected data and random data. Several metrics will be considered in collaboration with Prof. Ulf Leser.
(3) Reducing the search space (graph-topology). Study the size of the search space to be explored for the results to be still reliable while obtained in a reduced time and using a reduced space.
4) Mixing graphs-based and text mining-based techniques to rank biological data
This part of the thesis will introduce text-mining techniques into ranking algorithms to exploit the textual aspect of biological data, following research directions described in Cohen-Boulakia and Leser, ICDE 2011.
Extra information
Prerequisite
Knowledge of the structures of biological data sources and skills in algorithmics and in object-oriented programming are expected.