Presentation
Corpus linguistics - The influence of
corpus driven approaches is more and more important in natural language
processing. If they have enabled the successful development of
many operational systems, NLP data based methods come most of the time
down to black box approaches where corpora are only seen as an input
for some learning mechanism. In my opinion, corpus linguistics should
be of great help for helping NLP to go beyond this restrictive
approach, thereby enabling the development of advances (linguistically
motivated) language models. Consider for instance task oriented
man-machine dialogue :
- the analysis of large
task-specific corpora should provide a precise characterization of the
phenomena that occur in the corresponding application domains. This
characterization is also helpful for evaluation purposes,
- the comparison (by means of
statistical data analyses) of corpora that concern different
application domains should assess usefully the influence of the task.
It should therefore provide answers to the important problem of
genericity.
I conduct corpus analyses that aim at
characterizing more precisely several linguistic phenomena that have a
direct influence on the robustness on spoken language processing :
- Speech
disfluences (hesitations, repetitions,
self-repairs..)
- Word order variations
(WOV) in
conversational speech :our studies on a large variety of corpora has
shown that if WOV are very frequent in spontananeous spoken French,
they follow some impressive regularities which demonstrate that spoken
French remains a SVO (Subjet-Verb-Object) language.
- Anaphora and co-reference
in
spoken language. Our studies, to be continued in the ANCOR projet, have
shown that it would be risky to consider the constraint of
number agreement between coreferent entities as mandatory in
conversational spoken French . In particular, we have precisely
quantified the influence of metonymy on the infringement of
this constraint.
Corpus building and
diffusion -
The availability of spoken language resources is highly different from
one language to another. Aside from the large amount of data
available in English, there is a real need for large
French-speaking corpora of
conversational speech. This is why I used to work during
several years on the
PAROLE PUBLIQUE 
program.
This project aimed at the collection of a large corpus of spoken French
dialogues restricted to several specific tasks (tourist information,
switchboard, child interaction). The whole of the collected corpora is
freely distributed
on the WWW for any academic use.
Now, I carry on with this objective in
the framework of two projects (CO2 and et
ANCOR

) that concerns the
coreference annotation of large speech corpora.
It had lead to the achievement of
ANCOR_Centre 
,an annotated corpus of spontaneous
speech including around 500 000 words, 100 000 mentions and 50 000
coreference
relations. It represents the largest corpus of spoken French with
coreference and anaphora annotations. Besides, we have extended this
work in the direction of temporal annotation in tjhe framework of the
TEMPORAL
project. Our first aim here is to question the relevancy of the TimeML standard.
Annotation reliabiliy -
I am conducting with Jeanne Villaneau (IRISA) experimental studies on
the reliability ... of reliability measures. This investigation
concerns the factors of influence that should bias standard
inter-coders agreement measures such as Cohen's Kappa, Scott"s Pi or
Krippendorff's alpha.
Works
and projects
- PAROLE
PUBLIQUE
program - free
diffusion of (French speaking) spoken dialogue corpora on
tourism information, switchboard, children dialogues.
- EPAC
project (2007-2010) - Rich annotation (POS, chunking, named entities)
of large corpora of
conversational speech.
- VARILING
project
(2007-2010) - Named entities tagging in the ESLO2 corpus
(Enquête
Sociolinguistique de l'Oral d'Orléans).
- CO2 and et ANCOR
projects
- linguistic study of co-reference in speech
corpora
- TEMPORAL
project on temporal annotation.
Some
publications
Coréférence
- Muzerelle J., Lefeuvre A., Schang
E., Antoine J.-Y., Pelletier A., Maurel D., Eshkol I., Villaneau J.
(2014) ANCOR_Centre, a Large Free Spoken French Coreference Corpus:
Description of the Resource and Reliability Measures. Proc.
LREC’2014, Reykjavik, Island [HAL_01075679]
.
- Judith MUZERELLE, Aurore BOYER,
Jean-Yves ANTOINE, Emmanuel SCHANG, Iris ESKHOL, Denis MAUREL
(2012) Annotation en relations anaphoriques d'un corpus de discours
oral spontané en français, Actes Congrès Mondial
de
Linguistique Française, Lyon [HAL-00788164]
.
- Emmanuel SCHANG, Aurore BOYER, Judith MUZERELLE, Jean-Yves ANTOINE, Iris ESHKOL, Denis MAUREL (2011) Coreference and anaphoric annotations for spontaneous speech
corpos in French. Proc. DAARC'2011,
Discourse Anaphora and Anaphor
Resolu1on Colloquium, Faro, Portugal [HAL-00831414]

Data Reliability
- Antoine J.-Y., Villaneau J.,
Lefeuvre A. (2014) Weighted Krippendorff's alpha is a more
reliable metrics for multi-coders ordinal annotations: experimental
studies on emotion, opinion and coreference annotation. Proc. 14th Conference of the European
Chapter of the Association of Computational Linguistics,
EACL’2014, Gothenburg, Sweden [ACL
Anthology E14-1058;HAL-01001811]
.
Corpus Linguistics
- Jean-Yves
ANTOINE, Jerome GOULIAN, Jeanne
VILLANEAU, Marc LE TALLEC (2009)
Word Order Phenomena in Spoken French : a Study on Four Corpora of
Task-Oriented Dialogue and its Consequences on Language Processing.
Proc. Corpus
Linguistics’2009, Liverpool, UK, July
2009
[HAL-00483777].
- Pascale NICOLAS, Sabine Letellier-Zarshenas,, Igor SCHADLE, Jean-Yves ANTOINE, Jean CAELEN
(2002) Towards a large corpus of spoken dialogue in French
that will be
freely available: the PAROLE PUBLIQUE project. Proc. LREC’2002, Las Palmas de Gran Canaria, Espagne.
pp. 649-655

- Jean-Yves
ANTOINE, Jérôme GOULIAN
(2001) Word order variations and spoken man-machine dialogue in French:
a corpus analysis on the ATIS domaine.
Proc. Corpus
Linguistics'2001,
Lancaster, Royaume-Uni, pp. 22-29.