News from this site

 Rental advertising space, please contact the webmaster if you need cooperation


+focus
focused

classification  

no classification

tag  

no tag

date  

no datas

nltk tag tag_sents give different results

posted on 2024-11-07 20:00     read(643)     comment(0)     like(14)     collect(2)


I essentially want to use the nltk StanfordNERTagger in order to purify a list of names (eg. there are organizations in there I want to remove) and I stumbled on weird issue. It seems the tag results of one sentence depend on what other sentences are given, which isn't very intuitive.

Here is how to reproduce:

from nltk.tag import StanfordNERTagger
tagger = StanfordNERTagger('/path/to/english.all.3class.distsim.crf.ser.gz','/path/to/stanford-ner-2017-06-09/stanford-ner.jar',encoding='utf-8')
things_to_tag = ["Star Trek".split(),
                 "Star Jones".split(),
                 "Star Wars".split()]

# tagging using tag_sents
print tagger.tag_sents( things_to_tag )

# tagging using tag
for t in things_to_tag:
    print tagger.tag(t)

Output:

[[(u'Star', u'ORGANIZATION'), (u'Trek', u'ORGANIZATION')],
[(u'Star', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION')],
[(u'Star', u'ORGANIZATION'), (u'Wars', u'ORGANIZATION')]]

[(u'Star', u'O'), (u'Trek', u'O')]
[(u'Star', u'PERSON'), (u'Jones', u'PERSON')]
[(u'Star', u'O'), (u'Wars', u'O')]

I also tried removing Star Wars from the list, and again the results change ('Trek' becomes Person, and 'Star' becomes O).

I looked into nltk/tag/stanford.py and it's not really clear why this would happen. I was hoping someone could lend a hand in identifying what the issue might be, or at least confirm I'm not the only one seeing this.

nltk version 3.2.5 python version 2.7.13


solution


Ok so it has to do with whether or not you use this NLs tokenization. If you leave it as false, it will treat the input as one giant string, which means the predicted tags are now dependent on everything in the string. In my view, this is wrong. Changing it to 'true' and removing the quotes gives me the desired output.

To be extra clear, modify: '\"tokenizeNLs=false\"' --> 'tokenizeNLs=true'



Category of website: technical article > Q&A

Author:qs

link:http://www.pythonblackhole.com/blog/article/246851/fd47096ee66269c30e83/

source:python black hole net

Please indicate the source for any form of reprinting. If any infringement is discovered, it will be held legally responsible.

14 0
collect article
collected

Comment content: (supports up to 255 characters)