 
|
>
Research > Named Entities Detection and Text Mining
Named entities recognition and text mining
Presentation
The BDTLN
 t eam
of the LI lab. develops researches on dictionnaries and shallow
parsing. In this general framework, I am currently working on the
recognition of named entities.
Named
Entity Recognition
(NER)
is an information extraction (IE) task that aims at
extracting and categorizing specific entities (proper names or
multi-word expressions but also dedicated linguistic units as time
expressions, amounts, etc.) which are denoting a single element in the
universe of discourse.
While NER is often
considered as quite a simple task, there is still
room for improvement when it is confronted to difficult contexts. For
instance, NER systems may have to cope with noisy data such as word
sequences containing speech recognition errors in ASR. In addition, NER
is no more circumscribed to proper names (e.g. "Albert
Einstein"),
but may also involve common nouns (e.g., "the
famous scientist")
or complex multi-word expressions (e.g. “the
Computer Science
department of the New York University”).
These
complementary needs for robust and detailed processing explain that
knowledgebased and data-driven approaches remain equally competitive on
NER tasks as shown by numerous evaluation campaigns. This is why the
development of hybrid systems has been investigated by the NER
community for years. In most cases, hybridization for NER relies a much
simpler principle: outputs of knowledge-based systems are considered as
features by a machine learning algorithm among which Conditional Random
Fields (DRF) are the more widely used. On the opposite, our team is
currently investigating (PhD of Damien
Nouvel  , 2012) the hybrization of
a symbolic NER system with a
data-driven system base on text mining techniques.
Our team has developped CasEN,
a knowledge-based NER system based on finite
state transducers which was involved in the French speaking Ester 2
evaluation
campaign. Despite its encouraging performances, manually extending the
coverage of such a hand-crafted system is a difficult and tedious task.
This is why we have investigated the use of text
mining techniques to
automatically extract sequential patterns correlated to NEs.
Our first aim was to obtained automatically some detection patterns
that would be integrated in CasEN as transduction rules. But we also
investigate a hybrid approach where hand-crafted and automatically
extracted detection rules are applied concurrently.
Our pattern mining
approach is based on the principle of hierarchical
sequences
mining.
This means that we mine generalized patterns which consider either
words, their lemmas, their POS category or even their semantic type.
The second originality of our approach is that we
try do
detect separately the start and the end of every named entity.
We
expect this separate detection to be more robust on speech recognition
errors and speech disfluences as well. This point has been assessed
during the ETAPE
evaluation campaign. Our
symbolic NER system CasEN won this competitive
evaluation, where our text-mining mXs
system
presented encouraging results (ranked 3rd or 4th, depending of the
considered task)
To
download freely our NER systems
More explanations : Wikipedia
Works
and projects
- EPAC
project (2007-2010) - Detection of named entities in large corpora of
conversational speech.
- VARILING
project
(2007-2010) - Named entities tagging in the ESLO2 corpus
(Enquête
Sociolinguistique de l'Oral d'Orléans).
- Ester 2
(2009) and ETAPE
(2012) evaluation
campaigns.
- CO2 (2010) andANCOR
(2012)
projects :
linguistic study of co-reference chains in speech corpora.
Selection
of publications
Damien NOUVEL, Jean-Yves ANTOINE
(2014) Adapting Data Mining for German Named Entity Recognition, Proc.
Konvens'2014
Conference, GermEval
sattelite workshop, Hildesheim, Germany, october 2014,
149-152 [HAL_01075678] .
Damien NOUVEL, Jean-Yves ANTOINE,
Nathalie FRIBURGER (2014) Pattern-Mining for Named
Entitiy Recognition. Lecture
Notes in Computer Sciences/ LNAI subseries, LNCS-LNAI 8387
(revised selected papers of LTC'2011 Conference), Springer, 226-237
[authors version HAL_01076157 ].
Damien
NOUVEL, Jean-Yves ANTOINE, Nathalie FRIBURGER, Denis MAUREL
(2012) Coupling Knowledge-Based and Data-Driven
Systems for
Named Entity Recognition, Proc. EACL’2012 Joint
Workshop W4 Hybrid’12 : Innovative Hybrid Approaches to
Process Textual Data , Avignon, France. pp. 69-77 [HAL-00788166]
.
Damien NOUVEL,
Jean-Yves ANTOINE, Nathalie FRIBURGER, Arnaud SOULET
(2011) Recognizing
Named Entities using Automatically Extracted Transduction Rules, Proc. LTC’2001,
Language
Technology Conference, Poznan, Poland. 136-140. [HAL-00664610]
Damien
NOUVEL, Jean-Yves ANTOINE, Nathalie FRIBURGER, Denis MAUREL (2010)
An analysis of the performances of the CasEN named entities detection
system in the Ester2 evaluation campaign.
Proc. 9th European
conference on Language Resources and Evaluation, LREC’2010,
Valetta, Malta, May 2010 [HAL-00502370].
Jean-Yves
ANTOINE, Abdenour MOKRANE, Nathalie FRIBURGER (2008)
Automatic rich annotation of large corpus of conversational transcribed
speech, Proc.
8th European conference on Language Resources and Evaluation. LREC'2008,
Marrakesh, Maroc [LREC_2008-172]
[HAL-00484046]
Named
entities recognition and text mining
|
|
|
Jean-Yves
ANTOINE - Last update : october 19th, 2014
|
|