photo JY Antoine
Jean-Yves ANTOINE
FranceDeutsch Portugese Brezhonneg
Home Research Activities Publications Teaching Activities History of Art


> Research > Named Entities Detection and Text Mining
Named entities recognition and text mining

 icone Presentation


 The BDTLN nouvelle fenêtreteam of the LI lab. develops researches on dictionnaries and shallow parsing. In this general framework, I am currently working on the recognition of named entities. Named Entity Recognition (NER) is an information extraction (IE) task that aims at extracting and categorizing specific entities (proper names or multi-word expressions but also dedicated linguistic units as time expressions, amounts, etc.) which are denoting a single element in the universe of discourse.

While NER is often considered as quite a simple task, there is still room for improvement when it is confronted to difficult contexts. For instance, NER systems may have to cope with noisy data such as word sequences containing speech recognition errors in ASR. In addition, NER is no more circumscribed to proper names (e.g. "Albert Einstein"), but may also involve common nouns (e.g., "the famous scientist") or complex multi-word expressions (e.g. “the Computer Science department of the New York University”). These complementary needs for robust and detailed processing explain that knowledgebased and data-driven approaches remain equally competitive on NER tasks as shown by numerous evaluation campaigns. This is why the development of hybrid systems has been investigated by the NER community for years. In most cases, hybridization for NER relies a much simpler principle: outputs of knowledge-based systems are considered as features by a machine learning algorithm among which Conditional Random Fields (DRF) are the more widely used. On the opposite, our team is currently investigating (PhD of Damien Nouvel nouvelle fenêtre, 2012) the hybrization of a symbolic NER system with a data-driven system base on text mining techniques.

Our team has developped CasEN, a knowledge-based NER system based on finite state transducers which was involved in the French speaking Ester 2 nouvelle fenêtre evaluation campaign. Despite its encouraging performances, manually extending the coverage of such a hand-crafted system is a difficult and tedious task. This is why we have investigated the use of text mining techniques to automatically extract sequential patterns correlated to NEs. Our first aim was to obtained automatically some detection patterns that would be integrated in CasEN as transduction rules. But we also investigate a hybrid approach where hand-crafted and automatically extracted detection rules are applied concurrently.

Our pattern mining approach is based on the principle of hierarchical sequences mining. This means that we mine generalized patterns which consider either words, their lemmas, their POS category or even their semantic type. The second originality of our approach is that we try do detect separately the start and the end of every named entity. We expect this separate detection to be more robust on speech recognition errors and speech disfluences as well. This point has been assessed during the ETAPE nouvelle fenêtre evaluation campaign. Our symbolic NER system CasEN won this competitive evaluation, where our text-mining mXs nouvelle fenêtre  system presented encouraging results (ranked 3rd or 4th, depending of the considered task)

To download freely our NER systems

More explanations : Wikipedia

icone Works and projects


  • EPAC nouvelle fenêtre project (2007-2010) - Detection of named entities in large corpora of conversational speech.
  • VARILING nouvelle fenêtreproject (2007-2010) - Named entities tagging in the ESLO2 corpus (Enquête Sociolinguistique de l'Oral d'Orléans).
  • Ester 2nouvelle fenêtre (2009) and ETAPE nouvelle fenêtre (2012) evaluation campaigns.
  • CO2 (2010) andANCOR nouvelle fenêtre (2012) projects : linguistic study of co-reference chains in speech corpora.

iconeSelection of publications
  • Damien NOUVEL, Jean-Yves ANTOINE (2014) Adapting Data Mining for German Named Entity Recognition, Proc. Konvens'2014 Conference, GermEval sattelite workshop, Hildesheim, Germany, october 2014, 149-152 [HAL_01075678] introduction article ACM TASSESTS.
  • Damien NOUVEL, Jean-Yves ANTOINE, Nathalie FRIBURGER (2014) Pattern-Mining for Named Entitiy Recognition. Lecture Notes in Computer Sciences/ LNAI subseries, LNCS-LNAI 8387sortie nouvelle fenetre (revised selected papers of LTC'2011 Conference), Springer, 226-237  [authors version HAL_01076157 introduction article ACM TASSESTS].
  • Damien NOUVEL, Jean-Yves ANTOINE, Nathalie FRIBURGER, Denis MAUREL (2012) Coupling Knowledge-Based and Data-Driven Systems for Named Entity Recognition, Proc. EACL’2012 Joint Workshop W4 Hybrid’12 : Innovative Hybrid Approaches to Process Textual Data , Avignon, France. pp. 69-77 [HAL-00788166] introduction article ACM TASSESTS
  • Damien NOUVEL, Jean-Yves ANTOINE, Nathalie FRIBURGER, Arnaud SOULET (2011) Recognizing Named Entities using Automatically Extracted Transduction Rules, Proc. LTC’2001, Language Technology Conference, Poznan, Poland. 136-140. [HAL-00664610] introduction article ACM TASSESTS
  • Damien NOUVEL, Jean-Yves ANTOINE, Nathalie FRIBURGER, Denis MAUREL (2010) An analysis of the performances of the CasEN named entities detection system in the Ester2 evaluation campaign. Proc. 9th European conference on Language Resources and Evaluation, LREC’2010, Valetta, Malta, May 2010document PDF LREC'2010[HAL-00502370].
  • Jean-Yves ANTOINE, Abdenour MOKRANE, Nathalie FRIBURGER (2008) Automatic rich annotation of large corpus of conversational transcribed speech, Proc. 8th European conference on Language Resources and Evaluation. LREC'2008, Marrakesh, Maroc document PDF LREC'2008 [LREC_2008-172] [HAL-00484046]
  • Named entities recognition and text mining rappel haut de page

    Jean-Yves ANTOINE - Last update : october 19th, 2014