Research (Text mining) : Jean-Yves Antoine

icone Presentation

team of the LI lab. develops researches on dictionnaries and shallow parsing. In this general framework, I am currently working on the recognition of named entities. Named Entity Recognition (NER) is an information extraction (IE) task that aims at extracting and categorizing specific entities (proper names or multi-word expressions but also dedicated linguistic units as time expressions, amounts, etc.) which are denoting a single element in the universe of discourse.

While NER is often considered as quite a simple task, there is still room for improvement when it is confronted to difficult contexts. For instance, NER systems may have to cope with noisy data such as word sequences containing speech recognition errors in ASR. In addition, NER is no more circumscribed to proper names (e.g. "Albert Einstein"), but may also involve common nouns (e.g., "the famous scientist") or complex multi-word expressions (e.g. “the Computer Science department of the New York University”). These complementary needs for robust and detailed processing explain that knowledgebased and data-driven approaches remain equally competitive on NER tasks as shown by numerous evaluation campaigns. This is why the development of hybrid systems has been investigated by the NER community for years. In most cases, hybridization for NER relies a much simpler principle: outputs of knowledge-based systems are considered as features by a machine learning algorithm among which Conditional Random Fields (DRF) are the more widely used. On the opposite, our team is currently investigating (PhD of Damien Nouvel nouvelle fenêtre

, 2012) the hybrization of a symbolic NER system with a data-driven system base on text mining techniques.

Our team has developped CasEN, a knowledge-based NER system based on finite state transducers which was involved in the French speaking Ester 2 nouvelle fenêtre

evaluation campaign. Despite its encouraging performances, manually extending the coverage of such a hand-crafted system is a difficult and tedious task. This is why we have investigated the use of text mining techniques to automatically extract sequential patterns correlated to NEs. Our first aim was to obtained automatically some detection patterns that would be integrated in CasEN as transduction rules. But we also investigate a hybrid approach where hand-crafted and automatically extracted detection rules are applied concurrently.

Our pattern mining approach is based on the principle of hierarchical sequences mining. This means that we mine generalized patterns which consider either words, their lemmas, their POS category or even their semantic type. The second originality of our approach is that we try do detect separately the start and the end of every named entity. We expect this separate detection to be more robust on speech recognition errors and speech disfluences as well. This point has been assessed during the ETAPE nouvelle fenêtre

evaluation campaign. Our symbolic NER system CasEN won this competitive evaluation, where our text-mining mXs nouvelle fenêtre

system presented encouraging results (ranked 3rd or 4th, depending of the considered task)

To download freely our NER systems

mXS
CasEN

More explanations : Wikipedia

icone Works and projects

PhD of Damien Nouvel (2012, supervisor in collaboration with Nathalie Friburger and Arnaud Soulet ) - Text mining techniques for the detection of named entities.

EPAC project (2007-2010) - Detection of named entities in large corpora of conversational speech.
VARILING project (2007-2010) - Named entities tagging in the ESLO2 corpus (Enquête Sociolinguistique de l'Oral d'Orléans).
Ester 2 (2009) and ETAPE (2012) evaluation campaigns.
CO2 (2010) andANCOR (2012) projects : linguistic study of co-reference chains in speech corpora.

Selection of publications

Damien NOUVEL, Jean-Yves ANTOINE (2014) Adapting Data Mining for German Named Entity Recognition, Proc. Konvens'2014 Conference, GermEval sattelite workshop, Hildesheim, Germany, october 2014, 149-152 [HAL_01075678]

Damien NOUVEL, Jean-Yves ANTOINE, Nathalie FRIBURGER (2014) Pattern-Mining for Named Entitiy Recognition. Lecture Notes in Computer Sciences/ LNAI subseries, LNCS-LNAI 8387

(revised selected papers of LTC'2011 Conference), Springer, 226-237 [authors version HAL_01076157

Damien NOUVEL, Jean-Yves ANTOINE, Nathalie FRIBURGER, Denis MAUREL (2012) Coupling Knowledge-Based and Data-Driven Systems for Named Entity Recognition, Proc. EACL’2012 Joint Workshop W4 Hybrid’12 : Innovative Hybrid Approaches to Process Textual Data , Avignon, France. pp. 69-77 [HAL-00788166]

Damien NOUVEL, Jean-Yves ANTOINE, Nathalie FRIBURGER, Arnaud SOULET (2011) Recognizing Named Entities using Automatically Extracted Transduction Rules, Proc. LTC’2001, Language Technology Conference, Poznan, Poland. 136-140. [HAL-00664610]

Damien NOUVEL, Jean-Yves ANTOINE, Nathalie FRIBURGER, Denis MAUREL (2010) An analysis of the performances of the CasEN named entities detection system in the Ester2 evaluation campaign. Proc. 9th European conference on Language Resources and Evaluation, LREC’2010, Valetta, Malta, May 2010

[HAL-00502370].

Jean-Yves ANTOINE, Abdenour MOKRANE, Nathalie FRIBURGER (2008) Automatic rich annotation of large corpus of conversational transcribed speech, Proc. 8th European conference on Language Resources and Evaluation. LREC'2008, Marrakesh, Maroc

[LREC_2008-172] [HAL-00484046]