Research (Corpus Linguistics) : Jean-Yves Antoine

icone Presentation

Corpus linguistics - The influence of corpus driven approaches is more and more important in natural language processing. If they have enabled the successful development of many operational systems, NLP data based methods come most of the time down to black box approaches where corpora are only seen as an input for some learning mechanism. In my opinion, corpus linguistics should be of great help for helping NLP to go beyond this restrictive approach, thereby enabling the development of advances (linguistically motivated) language models. Consider for instance task oriented man-machine dialogue :

the analysis of large task-specific corpora should provide a precise characterization of the phenomena that occur in the corresponding application domains. This characterization is also helpful for evaluation purposes,
the comparison (by means of statistical data analyses) of corpora that concern different application domains should assess usefully the influence of the task. It should therefore provide answers to the important problem of genericity.

I conduct corpus analyses that aim at characterizing more precisely several linguistic phenomena that have a direct influence on the robustness on spoken language processing :

Speech disfluences (hesitations, repetitions, self-repairs..)
Word order variations (WOV) in conversational speech :our studies on a large variety of corpora has shown that if WOV are very frequent in spontananeous spoken French, they follow some impressive regularities which demonstrate that spoken French remains a SVO (Subjet-Verb-Object) language.
Anaphora and co-reference in spoken language. Our studies, to be continued in the ANCOR projet, have shown that it would be risky to consider the constraint of number agreement between coreferent entities as mandatory in conversational spoken French . In particular, we have precisely quantified the influence of metonymy on the infringement of this constraint.

Corpus building and diffusion - The availability of spoken language resources is highly different from one language to another. Aside from the large amount of data available in English, there is a real need for large French-speaking corpora of conversational speech. This is why I used to work during several years on the PAROLE PUBLIQUE sortie nouvelle fenetre

program. This project aimed at the collection of a large corpus of spoken French dialogues restricted to several specific tasks (tourist information, switchboard, child interaction). The whole of the collected corpora is freely distributed on the WWW for any academic use.

Now, I carry on with this objective in the framework of two projects (CO2 and et ANCOR

) that concerns the coreference annotation of large speech corpora. It had lead to the achievement of ANCOR_Centre

,an annotated corpus of spontaneous speech including around 500 000 words, 100 000 mentions and 50 000 coreference relations. It represents the largest corpus of spoken French with coreference and anaphora annotations. Besides, we have extended this work in the direction of temporal annotation in tjhe framework of the TEMPORAL nouvelle fenêtre

project. Our first aim here is to question the relevancy of the TimeML standard.

Annotation reliabiliy - I am conducting with Jeanne Villaneau (IRISA) experimental studies on the reliability ... of reliability measures. This investigation concerns the factors of influence that should bias standard inter-coders agreement measures such as Cohen's Kappa, Scott"s Pi or Krippendorff's alpha.

icone Works and projects

PAROLE PUBLIQUE program - free diffusion of (French speaking) spoken dialogue corpora on tourism information, switchboard, children dialogues.
EPAC project (2007-2010) - Rich annotation (POS, chunking, named entities) of large corpora of conversational speech.
VARILING project (2007-2010) - Named entities tagging in the ESLO2 corpus (Enquête Sociolinguistique de l'Oral d'Orléans).
CO2 and et ANCOR projects - linguistic study of co-reference in speech corpora
TEMPORAL project on temporal annotation.

icone Some publications

Coréférence

Muzerelle J., Lefeuvre A., Schang E., Antoine J.-Y., Pelletier A., Maurel D., Eshkol I., Villaneau J. (2014) ANCOR_Centre, a Large Free Spoken French Coreference Corpus: Description of the Resource and Reliability Measures. Proc. LREC’2014, Reykjavik, Island [HAL_01075679].

Judith MUZERELLE, Aurore BOYER, Jean-Yves ANTOINE, Emmanuel SCHANG, Iris ESKHOL, Denis MAUREL (2012) Annotation en relations anaphoriques d'un corpus de discours oral spontané en français, Actes Congrès Mondial de Linguistique Française, Lyon [HAL-00788164] .

Emmanuel SCHANG, Aurore BOYER, Judith MUZERELLE, Jean-Yves ANTOINE, Iris ESHKOL, Denis MAUREL (2011) Coreference and anaphoric annotations for spontaneous speech corpos in French. Proc. DAARC'2011, Discourse Anaphora and Anaphor Resolu1on Colloquium, Faro, Portugal [HAL-00831414]

Data Reliability

Antoine J.-Y., Villaneau J., Lefeuvre A. (2014) Weighted Krippendorff's alpha is a more reliable metrics for multi-coders ordinal annotations: experimental studies on emotion, opinion and coreference annotation. Proc. 14th Conference of the European Chapter of the Association of Computational Linguistics, EACL’2014, Gothenburg, Sweden [ACL Anthology E14-1058;HAL-01001811].

Corpus Linguistics

Jean-Yves ANTOINE, Jerome GOULIAN, Jeanne VILLANEAU, Marc LE TALLEC (2009) Word Order Phenomena in Spoken French : a Study on Four Corpora of Task-Oriented Dialogue and its Consequences on Language Processing. Proc. Corpus Linguistics’2009, Liverpool, UK, July 2009 [HAL-00483777].

Pascale NICOLAS, Sabine Letellier-Zarshenas,, Igor SCHADLE, Jean-Yves ANTOINE, Jean CAELEN (2002) Towards a large corpus of spoken dialogue in French that will be freely available: the PAROLE PUBLIQUE project. Proc. LREC’2002, Las Palmas de Gran Canaria, Espagne. pp. 649-655
Jean-Yves ANTOINE, Jérôme GOULIAN (2001) Word order variations and spoken man-machine dialogue in French: a corpus analysis on the ATIS domaine. Proc. Corpus Linguistics'2001, Lancaster, Royaume-Uni, pp. 22-29.