Chargement...
 

Natural Language Speech and Audio Processing

Domaine
Natural Language Speech and Audio Processing
Domain - extra
Machine Learning, Multimedia, Computer Vision
Année
2014
Starting
September
État
Open
Sujet
Unsupervised Character Identification in TV Series
Thesis advisor
BARRAS Claude
Co-advisors
BREDIN Hervé
Laboratory
Collaborations
Karlsruher Institut für Technologie (KIT, Allemagne)
Abstract
Automatic character identification in multimedia videos is an extensive and challenging problem. Person identities can serve as foundation and building block for many higher level video analysis tasks, for example semantic indexing, search and retrieval, interaction analysis and video summarization.

The goal of this project is to exploit both audio and video information to automatically identify characters in TV series in movies without requiring any manual annotation for training character models. A fully automatic and unsupervised approach is especially appealing when considering the huge amount of available multimedia data (and its growth rate). Audio and video provide complementary cues to the identity of a person, and thus allow to better identify a person than from either modality alone.
Context
While there exist many generic approaches to unsupervised clustering, they are not adapted to heterogeneous audiovisual data (face tracks vs. speech turns) and do not perform as well on challenging TV series and movies content as they do on other controlled data. Our general approach is therefore to first over­cluster the data and make sure that clusters remain pure, before assigning names to these clusters in a second step. On top of domain specific improvements for either modality alone, we also expect joint multimodal clustering to take advantage of both modalities and improve clustering performance over each modality alone.

Objectives
The goal of this project is to exploit both audio and video information to automatically identify characters in TV series in movies without requiring any manual annotation for training character models. A fully automatic and unsupervised approach is especially appealing when considering the huge amount of available multimedia data. Audio and video provide complementary cues to the identity of a person, and thus allow to better identify a person than from either modality alone.

To this end, we will address three main research questions: unsupervised clustering of speech turns (i.e. speaker diarization) and face tracks in order to group similar tracks of the same person without prior labels or models; unsupervised identification by propagation of automatically generated weak labels from various sources of information (such as subtitles and speech transcripts); and multimodal fusion of acoustic, visual and textual cues at various levels of the identification pipeline.
Work program
Unsupervised identification aims at assigning character names to clusters in a completely automatic manner (i.e. using only available information already present in the speech and video). In TV series and movies, character names are usually introduced and re­iterated throughout the video. We will detect and use addresser­addressee relationships in both speech (using named entity detection techniques) and video (using mouth movements, viewing direction and focus of attention of faces). This allows to assign names to some clusters, learn discriminative models and assign names to the remaining clusters.

For evaluation, we will extend and further annotate a corpus of three TV series (49 episodes) and one movie series (8 movies), a total of about 50 hours of video. This diverse data covers different filming styles, type of stories and challenges contained in both video and audio. We will evaluate the different steps of this project on this corpus, and also make our annotations public
Extra information
This PhD offer is proposed by the Spoken Language Processing Group of the CNRS/LIMSI lab in Orsay (France).

Contact : Hervé Bredin (bredin@limsi.fr / http://herve.niderb.fr)
Prerequisite
  • Knowledge in machine learning, computer vision and/or speech processing.
  • Skills in the Python programming language.

Détails
Expected funding
Research contract
Status of funding
Expected
Candidates
Utilisateur
herve.bredin
Créé
Jeudi 06 mars 2014 11:55:51 CET
dernière modif.
Jeudi 06 mars 2014 11:55:51 CET

Fichiers joints

 filenamecrééhitsfilesize 
Aucun fichier joint à cette fiche


Ecole Doctorale Informatique Paris-Sud


Directrice
Nicole Bidoit
Assistante
Stéphanie Druetta
Conseiller aux thèses
Dominique Gouyou-Beauchamps

ED 427 - Université Paris-Sud
UFR Sciences Orsay
Bat 650 - aile nord - 417
Tel : 01 69 15 63 19
Fax : 01 69 15 63 87
courriel: ed-info à lri.fr