photo JY Antoine
Jean-Yves ANTOINE
France Deutsch Portugese Brezhonneg
Home Research Activities Publications Teaching Activities History of Art


> Research > Corpus Linguistics
Corpus Linguistics and spoken language processing

 icone Presentation


Corpus linguistics - The influence of corpus driven approaches is more and more important in natural language processing. If they have enabled the successful development of many operational systems, NLP data based methods come most of the time down to black box approaches where corpora are only seen as an input for some learning mechanism. In my opinion, corpus linguistics should be of great help for helping NLP to go beyond this restrictive approach, thereby enabling the development of advances (linguistically motivated) language models. Consider for instance task oriented man-machine dialogue :
  • the analysis of large task-specific corpora should provide a precise characterization of the phenomena that occur in the corresponding application domains. This characterization is also helpful for evaluation purposes,
  • the comparison (by means of statistical data analyses) of corpora that concern different application domains should assess usefully the influence of the task. It should therefore provide answers to the important problem of genericity.
I conduct corpus analyses that aim at characterizing more precisely several linguistic phenomena that have a direct influence on the robustness on spoken language processing :
  • Speech disfluences (hesitations, repetitions, self-repairs..)
  • Word order variations (WOV) in conversational speech :our studies on a large variety of corpora has shown that if WOV are very frequent in spontananeous spoken French, they follow some impressive regularities which demonstrate that spoken French remains a SVO (Subjet-Verb-Object) language.
  • Anaphora and co-reference in spoken language. Our studies, to be continued in the ANCOR projet, have shown that it would be risky to consider the constraint of number agreement between coreferent entities as mandatory in conversational spoken French . In particular, we have precisely quantified the influence of metonymy on the infringement of this constraint.

Corpus building and diffusion - The availability of spoken language resources is highly different from one language to another.  Aside from the large amount of data available in English, there is a real need for large French-speaking corpora of conversational speech. This is why I used to work during several years on the PAROLE PUBLIQUE sortie nouvelle fenetre program. This project aimed at the collection of a large corpus of spoken French dialogues restricted to several specific tasks (tourist information, switchboard, child interaction). The whole of the collected corpora is freely distributed on the WWW for any academic use.

Now, I carry on with this objective in the framework of two projects (CO2 and et ANCOR nouvelle fenêtre) that concerns the coreference annotation of large speech corpora. It had lead to the achievement of ANCOR_Centre nouvelle fenêtre ,an annotated corpus of spontaneous speech including around 500 000 words, 100 000 mentions and 50 000 coreference relations. It represents the largest corpus of spoken French with coreference and anaphora annotations. Besides, we have extended this work in the direction of temporal annotation in tjhe framework of the TEMPORALnouvelle fenêtre project.  Our first aim here is to question the relevancy of the TimeML standard.

Annotation reliabiliy - I am conducting with Jeanne Villaneau (IRISA) experimental studies on the reliability ... of reliability measures. This investigation concerns the factors of influence that should bias standard inter-coders agreement measures such as Cohen's Kappa, Scott"s Pi or Krippendorff's alpha.

 icone Works and projects


  • PAROLE PUBLIQUE sortie nouvelle fenetre program - free diffusion of (French speaking) spoken dialogue corpora on tourism information, switchboard, children dialogues.
  • EPAC nouvelle fenêtre project (2007-2010) - Rich annotation (POS, chunking, named entities) of large corpora of conversational speech.
  • VARILING nouvelle fenêtre project (2007-2010) - Named entities tagging in the ESLO2 corpus (Enquête Sociolinguistique de l'Oral d'Orléans).
  • CO2 and et ANCOR nouvelle fenêtre projects - linguistic study of co-reference in speech corpora
  • TEMPORALnouvelle fenêtre project on temporal annotation.

icone Some publications


Coréférence
  • Muzerelle J., Lefeuvre A., Schang E., Antoine J.-Y., Pelletier A., Maurel D., Eshkol I., Villaneau J. (2014) ANCOR_Centre, a Large Free Spoken French Coreference Corpus: Description of the Resource and Reliability Measures. Proc. LREC’2014, Reykjavik, Island [HAL_01075679]introduction article ACM TASSESTS.
  • Judith MUZERELLE, Aurore BOYER, Jean-Yves ANTOINE, Emmanuel SCHANG, Iris ESKHOL, Denis MAUREL (2012) Annotation en relations anaphoriques d'un corpus de discours oral spontané en français, Actes Congrès Mondial de Linguistique Française, Lyon [HAL-00788164] document PDF PUR 2005.
  • Emmanuel SCHANG, Aurore BOYER, Judith MUZERELLE, Jean-Yves ANTOINE, Iris ESHKOL, Denis MAUREL (2011) Coreference and anaphoric annotations for spontaneous speech corpos in French. Proc. DAARC'2011, Discourse Anaphora and Anaphor Resolu1on Colloquium, Faro, Portugal [HAL-00831414] introduction article ACM TASSESTS
Data Reliability
  • Antoine J.-Y., Villaneau J., Lefeuvre A. (2014) Weighted Krippendorff's alpha is a more reliable metrics for multi-coders ordinal annotations: experimental studies on emotion, opinion and coreference annotation. Proc. 14th Conference of the European Chapter of the Association of Computational Linguistics, EACL’2014, Gothenburg, Sweden [ACL Anthology E14-1058;HAL-01001811]introduction article ACM TASSESTS.
Corpus Linguistics
  • Jean-Yves ANTOINE, Jerome GOULIAN, Jeanne VILLANEAU, Marc LE TALLEC (2009) Word Order Phenomena in Spoken French : a Study on Four Corpora of Task-Oriented Dialogue and its Consequences on Language Processing. Proc. Corpus Linguistics’2009, Liverpool, UK, July 2009 document PDF LREC'2008[HAL-00483777].
  • Pascale NICOLAS, Sabine Letellier-Zarshenas,, Igor SCHADLE, Jean-Yves ANTOINE, Jean CAELEN (2002) Towards a large corpus of spoken dialogue in French that will be freely available: the PAROLE PUBLIQUE project. Proc. LREC’2002, Las Palmas de Gran Canaria, Espagne. pp. 649-655 document PDF
  • Jean-Yves ANTOINE, Jérôme GOULIAN (2001) Word order variations and spoken man-machine dialogue in French: a corpus analysis on the ATIS domaine. Proc.  Corpus Linguistics'2001,  Lancaster, Royaume-Uni, pp. 22-29. article PDF Corpus Linguistics 2001  
Corpus Linguistics and spoken language processing rappel haut de page

Jean-Yves ANTOINE - Last update : september 30th 2014