Course: Information Retrieval and Extraction (CSE474) Spring ‘16

Professor: Vasudeva Verma Mentor: Priyanka Bajaj

Team Members:

Project Links to artifacts

Project Description:

Task: Named entity recognition is one of the first steps in most IE pipelines. The diverse and noisy style of user generated social media text presents serious challenges, however. Performance still lags far behind than on formal text genres such as newswire. The goal of this shared evaluation is to promote research on NER in noisy text.

Dataset and baseline code was provided by the organizers of NER-Shared-Task

Overview of baseline code

Baseline Code consists of 1. Featurizer.py :Python code that generates features in format as required by crfsuite module 2. Baseline.sh : Runs the feature generation, crfsuite training of model, model dump and tagging on test data. 3. Connlleval.pl : perl script that evaluates the output of testing and calculates precision and F1 measure quantities

Modification Done

  1. Code changes

  2. Lexicons Changes

Output

baseline output

  
processed 11570 tokens with 356 phrases; found: 244 phrases; correct: 128.
accuracy: 96.07%; precision: 52.46%; recall: 35.96%; FB1: 42.67
company: precision: 72.41%; recall: 51.22%; FB1: 60.00 29
facility: precision: 40.00%; recall: 30.00%; FB1: 34.29 15
geo-loc: precision: 64.44%; recall: 50.00%; FB1: 56.31 45
movie: precision: 11.11%; recall: 33.33%; FB1: 16.67 9
musicartist: precision: 16.67%; recall: 8.33%; FB1: 11.11 6
other: precision: 35.00%; recall: 11.48%; FB1: 17.28 20
person: precision: 60.44%; recall: 47.01%; FB1: 52.88 91
product: precision: 26.67%; recall: 22.22%; FB1: 24.24 15
sportsteam: precision: 25.00%; recall: 16.67%; FB1: 20.00 12
tvshow: precision: 50.00%; recall: 12.50%; FB1: 20.00 2

build output

processed 11570 tokens with 356 phrases; found: 261 phrases; correct: 157.
accuracy: 96.57%; precision: 60.15%; recall: 44.10%; FB1: 50.89
company: precision: 70.97%; recall: 53.66%; FB1: 61.11 31
facility: precision: 50.00%; recall: 30.00%; FB1: 37.50 12
geo-loc: precision: 68.09%; recall: 55.17%; FB1: 60.95 47
movie: precision: 33.33%; recall: 33.33%; FB1: 33.33 3
musicartist: precision: 12.50%; recall: 8.33%; FB1: 10.00 8
other: precision: 42.42%; recall: 22.95%; FB1: 29.79 33
person: precision: 68.27%; recall: 60.68%; FB1: 64.25 104
product: precision: 62.50%; recall: 27.78%; FB1: 38.46 8
sportsteam: precision: 40.00%; recall: 22.22%; FB1: 28.57 10
tvshow: precision: 20.00%; recall: 12.50%; FB1: 15.38 5