Final Report - Complete

May 15, 2003
Simson L. Garfinkel
Guido Zarrella
Howard Chou

Introduction

Electronic mail is a popular method of communication in both academic and industrial settings.  One common email format is an announcement.  People receive dozens of announcement emails every day about activities taking place around them, including seminars, conferences, and parties.  There is a wealth of information in each announcement about a specific activity, and it can become tedious and time consuming to manually extract the useful and relevant information from the email.  People can spend a significant amount of time everyday reading announcements, deciding if they have already received the announcement before, and updating their PDA or paper calendar with the relevant information.

The goal of the project is to explore and compare a number of techniques for automatically extracting relevant information from announcement emails. The ultimate objective is to use the lessons learned to create a system that can perform the task with high recall and precision, and to present the information in an organized fashion on an electronic calendar.  Therefore, users no longer have to spend time reading through dozens of announcements in order to find out what activities are taking place around them.  Instead, all of that information is summarized and conveniently sorted for the user in the calendar.

We decided to focus on seminar announcements because they are one of the most common types of emails received on a daily basis by members of the MIT community. Seminars are also a great source of food. However, our system can easily be applied to handle other types of announcement emails as well. 

We implemented and compared three different techniques for analyzing seminar announcements:

*         Using Flex1 and hand-crafted regular expressions to create a pattern matching program that takes advantage of the structure common to most announcement emails. 
*         Using the Brill tagger2 and handmade grammars for the SCOL3 skipping parser to match part-of-speech patterns.  
*         Using trained machine learning algorithms to autonomously weigh and combine many different types of features present in the messages.

Various attempts have already been made at trying to solve the problem of automatically extracting information from announcement emails, and most techniques take advantage of the regular format typical of most announcement emails.  Cheong4 discusses knowledge-based engineering approaches, such as recognizing domain-relevant patterns and relationships amongst grammatical constituents in the text and parsing with finite state transducers (FST).  Freitag and McCallum5 apply a machine learning approach to information extraction of announcement emails and compare the performance of a grown Hidden Markov Model with stochastic optimizations to the performance of static HMM models. Several approaches merge knowledge-based engineering with machine learning to increase recall while maintaining high precision.  Almgren and Berglund6 describe a system focused on parsing time and location fields from announcements by using a pattern matcher together with a Hidden Markov Model.  (LP)2, described by Ciravegna7 combines machine learning with shallow NLP to create an algorithm for parsing announcement emails, by training the system on a corpus tagged with SGML tags and using shallow NLP to generalize the rules for higher recall. Freitag and McCallum were also the creators of the tagged CMU corpus used in our project.


Semantic Understanding with Very Large Regular Expressions

Semantic Understanding with Very Large Regular Expressions

 

Semantic Understanding with Very Large Regular Expressions (SLVR) is based on the observations that information is commonly presented with common typographical or lexical idioms. For example, many talks at universities are structured as a word, a colon, and a brief explanation, as shown in Example 1. Other times, there is implied structure from the typographical layout, as shown in Example 2. Some announcements are quite unstructured, as shown in Example 3.

 

********************************REMINDER***********************************

THESIS DEFENSE TODAY AT 10:00AM!

*****************************************************************************

Hello,

Please join us on Monday, May 5, 2003 at 10:00am in NE43-941 for Paul Fitzpatrick's thesis defense.  For your reference, the details are as follows:

 

DATE:         Monday, May 5, 2003

TIME:         10:00am

LOCATION: NE43-941

SPEAKER: Paul Fitzpatrick

WHAT:         Ph.D. Thesis Defense

 

Example 1: A structured talk announcement

 

 

 

 

****************** H C I  S E M I N A R  S E R I E S ******************

******************       5/9  1:30PM  NE43-941       ******************

 

Mobile Devices for Control

 

Brad Myers

Human Computer Interaction Institute, CMU

 

May 9, 2003

1:30PM (refreshments at 1:15PM)

NE43-941

 

Abstract:

 

With today's wireless technologies, such as BlueTooth and IEEE 802.11,

connecting handheld computers and conventional computers together are

 

Example 2: A semi-structured talk announcement

 

 

 

 

<0.09.11.93.22.01.32.Mark_Kantrowitz@GLINDA.OZ.CS.CMU.EDU.0>

Type:     cmu.andrew.org.epp

Topic:    Rayman talk: Competition vs. Community  11/10/93 4:30 Benedum Hall

Dates:    09-Nov-93

Time:     4:30

PostedBy: Mark Kantrowitz on 09-Nov-93 at 22:01 from GLINDA.OZ.CS.CMU.EDU

Abstract:

 

------- Blind-Carbon-Copy

 

 To: gso+@ANDREW.CMU.EDU

 Subject: Rayman talk: Competition vs. Community  11/10/93 4:30 Benedum Hall

 Date: Tue, 09 Nov 93 22:01:32 EST

 Message-ID: <6738.752900492@GLINDA.OZ.CS.CMU.EDU>

 From: Mark Kantrowitz <Mark_Kantrowitz@GLINDA.OZ.CS.CMU.EDU>

 

On Wednesday, November 10 at 4:30 p.m. Paula Rayman, Associate Professor

of Sociology and Director, Pathways for Women in Science Project at

Wellesley College will speak at 1175 Benedum Hall, University of

Pittsburgh.  Her topic will be "Competition Vs. Community: Building

Equity in the Sciences, Math & Engineering." This is the third

event of the CMU/Pitt Women's Speakers Series.

 

 Benedum Hall is located on O'Hara Street off of Bouquet near the Pitt

Stadium. Take Forbes to Bouquet, turn right and walk up the hill to

O'Hare.

 

Pathways is a longitudinal study about women in science and math: what

attracts them and keeps them in these fields at critical stages in their

lives.  According to Rayman, four factors keep undergraduates engaged in

 science: parents, mentors, hands-on experience research, and career

 advice.  In graduate school, the importance of mentors makes the

 difference.

 

For further information and to RSVP, contact Kathleen Minadeo Johnson,

x8-7970 or kj26.

 

- ------- End of Forwarded Message

 

------- End of Blind-Carbon-Copy

Example 3: An unstructured talk announcement

cmu.andrew.org.epp-344-0.txt [tagdemo]

 

 

 

We believe that once these idioms are introduced into a community, they are picked up and repeated by others within the community. As they become more widely used, we believe that people learn how to rapidly recognize them and extract information from them. As an idiom becomes more widely used still, there is an advantage to community members to use the idioms further. This process explains how typographical conventions are adopted.

 

Because many typographical conventions appear to be arbitrary, we believe that it is difficult to anticipate or enumerate them from first principles. To use the examples from the previous paragraph, there isn’t any real reason why academics should title some talks with a word followed by a colon and a brief explanation, but some academics starting doing so and the practice caught on. Likewise, there is no obvious reason why other academics should put the titles of their talks in quotes, or follow them with a line of delimiters. Whatever the reason, these typographical conventions can be manually enumerated by consulting examples, rules can be written, and then those rules can successfully find examples of the conventions in a corpus.

 

Regular expressions can also be used for extracting information from otherwise unstructured textual descriptions. For example, the announcement for a talk may not contain any obvious title or subtitle, but the announcement may include a sentence which reads ‘the topic of his talk will be “Semantic Understanding with Very Large Regular Expressions.”’ In our experience, such idioms are surprisingly common --- especially in the University environment.

 

At first glance, it may seem that the process of creating very large regular expressions for semantic analysis would be a never-ending task. In our experience, a relatively small number of rules can handle an extraordinarily large number of cases. And as new idioms emerge, it is a simple matter to create new regular expressions and add them to the list.

 

Overall, we are very pleased with the ease of constructing VLREs and impressed by their power. Frequently there are many different VLREs that can extract information from a given talk announcement. If one VLRE fails to match because of a typo or configuration error, another VLRE can compensate.

 

Tagging Talk Announcements with VLRE

Our procedure for tagging talk announcements follows these steps:

 

The talk announcement is processed by a perl script called “filter.” This script:

1.      Reads every line of the announcement.

2.      Removes every MIME type that is not text/plain.

3.      Searches the text lines for all lines that have prefixes or suffixes that commonly arise from forwarding email messages. For example, all lines that are prefixed with a greater-than sign have the greater than sign removed. This has the side-effect of eliminating many instances in which special characters are used for typographical decoration.

4.      Finds all blocks of text that have the same amount of indentation and removes that indentation.

5.      Flows all multi-line paragraphs into a single line of text.

6.      Sends the resulting “filtered” message to standard output.

 

After being filtered, it is processed with the VLRE tagger. The tagger reads each line of its input and makes multiple passes with different very large regular expressions created with the “flex” lexical generator system. After lexical processing, a series of

 

In order, these expressions:

1.      Recognize typographical conventions, including lines that end with a colon, lines that end with a question mark, lines that begin or end with quotes, lines of delimiters, and blank lines. (parse_typography.flex)

2.      Recognize the speaker of a talk. Speakers are recognized by RFC822-style headers that say “Speaker:,” by sentences by say so-and-so “will be speaking” or “will be presenting” on a certain topic, and so forth.  (parse_speaker.flex)

3.      Recognize the topic of a talk, using heuristics that are similar to those that are used to recognize a speaker. (parse_topic.flex)

4.      Recognize other common items from announcements:

a.       rooms and buildings in which the talk is likely to be given, using a set of regular expressions that match the conventions of room numbers and the popular buildings at MIT and CMU. (parse_announce.flex and parse_iannounce.flex)

b.      words that are commonly used in academic talk announcements, such as the names of universities, the names of industrial research labs, and words commonly used in job titles. (parse_announce.flex and parse_iannounce.flex)

5.      Recognize dates and times. (parse_time.flex and parse_month.flex)

 

As indicated above, each of these regular expression systems is created using flex, a system for creating lexical analyzers. A special flex preprocessor was used so that each set of regular expressions could use a common set of flex macros. One commonly used macro is {CAPWORDS} which matches a set of capitalized words; {CAPWORDS} also matches on words that are not capitalized but which commonly appear in titles without capitalization, such as the words “for,” “of” and “not.” A complete list of flex macros used in this project is in the file flex-macros and appears in the table below.

 

A significant number of macros were created for working with times. Because times follow fairly well-determined rules, it is possible to create a regular expression that matches on valid times such as 3pm, 3:30pm, 1530 and 15:30, but do not match on invalid times such as 15:30pm or 115:30. We believe that the amount of attention that has been applied to these regular expressions is partially responsible for the success that VLRE has with locating times.

                                                                                                                                                     

 

 

Macro

Purpose

S

Whitespace

SP

Mandatory white space

NOTW

Not white space

END

End of a word

P

Punctuation

H

Alphanumeric

 

 

QUOTED_STRING

A quoted string

JAN

Jan, Jan., or January

FEB .. DEC

Other calendar months

DAY

The word “DAY”

MON

Mon or Monday

TUE .. SUN

Other calendar days

NAMED_DAY

Any named day of the week

 

 

NDAY

Any numeric day of the month

SEC

Any numeric second

MIN

Any numeric minute

HOUR   

An hour in 12-hour format

HOUR24   

An hour in 24-hour format

NMONTH

A numeric month

AMONTH

An alphabetic month

 

 

YEAR

A year

YEAR4

A 4-digit year

NUM

Any number

NOTNUM

Not a number

 

 

TIMECHARS

Any character that tends to be used in a time representation

AM

 

PM

 

 

 

TIME

A time

TIME_RANGE

A time range

TIME_OR_RANGE

Either a time or a time range

 

 

DATE1

Number-month-year

DATE2

Month Number Year

DATE

Either date1 or date2

 

 

TALK

(talk|presentation)

DOMAIN

Common top-level Internet domains

 

 

ORDINAL

1st, 2nd, 3rd, 4th, and so on.

 

 

PRO

Possessive pronouns.

PREPOSITIONS

Common English prepositions

PERSON_TITLES

Typical titles that a person might have

PERSON_DEGREES

Typical degrees that a person might have.

CAPWORD0

A capitalized word at the beginning of a capitalized word group

CAPWORD

A capitalized word

CAPWORDSP

A capitalized word with a space

CAPWORDS

A string of capitalized words; this regular expression understands that some words such as “and” and “but” can appear without being capitalized.

 

 

NAMED_ROOMS

Common rooms at MIT; “Kresge” and “AI Lab Playroom.”

 

 

NAMED_FLOOR

Matched “5th Floor”

ROOM3NUMBER

A 3-digit room number

ROOM4NUMBER

A 4-digit room number

ROOM_NN_NUMBER

Either 3 or 4 digit room number

ROOMNUMBER

Any room number

 

 

MIT_MAIN_BUILDINGS

All of the numbered MIT buildings on main campus

MIT_E_BUILDINGS

All of the MIT buildings beginning with E.

MIT_NE_BUILDINGS

All of the MIT buildings beginning with NE

MIT_N_BUILDINGS

All of the MIT buildings beginning with N

MIT_NW_BUILDINGS

All of the MIT buildings beginning with NW

MIT_W_BUILDINGS

All of the MIT buildings beginning with W

 

 

OBVIOUS_MIT_BUILDINGS

Any MIT building in standard form

MIT_SPECIAL_ROOMS

Special MIT buildings, like “Kresge

MIT_BUILDING_ROOM

Any MIT building and a room number

 

 

 

 

BUILDING

Building names common at other schools

 

 

SCHOOL_YEAR

Matches all-caps and a year, like MIT ‘84

ROOM_TYPE

Matches types of rooms, like Lounge, Conference Room, etc.

Common regular expressions in flex-macros

 

When a regular expression matches, it can either set a numeric flag. A list of numeric flags appears below. Alternatively, a regular expression can also register its matching value under a property name. This is called a “found” property. Each regular expression that creates a found property records the matching text, the name of the regular expression that matched, the line and offset that matched, and the rule’s “shurity.” The shurity of the rule is a percentage that reflects the likelihood that the rule actually found what it was supposed to have found. Currently these percentages are estimated by the person who created the rules, but at some future point in time they might be determined by a naive Bayesian analysis or a genetic algorithm.  A list of found properties appears below as well.

 

Flags

Meaning

fIGNORE

Ignore this match

fSOMETHING

We found something

fDELIM

A line of delimiters

fBLANK_LINE

A blank ilne

fTEXT

A line with text

fABSTRACT

A line with the word “Abstract”

fPARENS

A line surrounded by parenthesis.

fORGANIZATION

A word that tends to be in organizational names, like University.

fFEATURING

Words like “Featuring,” “Starring,” etc.

fTIME

Something that looks like a time

fDATE

Something that looks like a date

fOR

The word “or” on a line by itself

fEND_WITH_COLON

A line that ends with a colon

fEND_WITH_QUESTION

A line that ends with a question mark

fBEGIN_WITH_QUOTES

A line that beings with a quote mark

fEND_WITH_QUOTES

A line that ends with quotes.

Paragraph flags  in found.h.

 

SPEAKER

Who is giving the talk

SPEAKER_TITLE

The speaker’s title

SPEAKER_AFFILIATION

Where the speaker works

TALK_TOPIC

What the talk is about

TALK_TIME

What time the talk is given; might be a range.

TALK_TIME_START

When the talk starts.

TALK_TIME_END

When the talk ends

TALK_DATE

The date of the talk

TALK_ABSTRACT

What the talk is about

TALK_ROOM

The room that the talk is being given in

TALK_BUILDING

The building where the talk is being given

TALK_ROOM_BUILDING

Both the talk room and the building; this might be matched all at the same time.

RFC822_DATE

If the talk was announced in an email message, this it the Date: field of that message. Useful for determining the year in which a talk will be given if the message failed to mention this important fact.

TALK_CONTACT

Self-explanatory.

REFRESHMENTS_TIME

When is the food going to be served?

REFRESHMENTS_ROOM

Where will the food be? Might be different.

HOST

Who is hosting the talk.

HOST_AFFILIATION

Where is the host from?

RSVP_DATE

Some talks require an RSVP by a particular date.

Properties that can be found in a talk; declared in found.h.

 

 

After the initial parse with large regular expressions, a series of “short paragraph rules” are applied to all paragraphs that are “short.” Each rule operates by passing a window over the lines in the announcement, looking for a group of lines that are “short” (different rules have different definitions of short), and then looking for lines that match a certain pattern of text, recognized regular expressions, and recognized words. If a rule fires,  found” objects are created for each line in the window. The rule specifies what is found and the shurity of each finding.

 

There are 17 short paragraph rules in the production system. These rules are typically used to code for multi-line typographical conventions, since the flex-generated regular expressions are only used to process a line at a time. A typical short paragraph rule is that a line that is less than 80 characters long followed by a line of delimiters is 50% likely to be the topic of the talk. Another rule is that a line followed by a line in parenthesis that is less than 300 characters long is likely to be the topic of the talk. Yet another is that if there are three lines less than 60 characters and the second line contains a word that is commonly seen in job titles (like “professor” or “assistant”) and the third line contains a word that is commonly seen in the name of an organization (like “university” or “laboratory”) then the first line is likely to be the speaker, the second line is likely to be the speaker’s title, and the third line is likely to be the speaker’s affiliation.

 

A list of the short-paragraph rules appears below:

 

TITLE -DELIM-

If a line is preceeded by a blank line and comes before a line of delimiters, then it is the name of the talk.

 

 

TITLE (SUBTITLE) 

A line preceeded by a blank line followed by a name in parens is probably the title of the talk.

 

 

TITLE: SUBTITLE   

A line ending with a colon followed by another line is the title of the talk

 

 

TITLE: SUBTITLE SUBTITLE  

A line ending with a colon followed by another line is the title of the talk

 

 

TITLE? SUBTITLE

A line ending with a question mark followed by another line is the title of the talk

 

 

``TITLE''  

A line in quotes surrounded by blank line is probably the title

 

 

``TITLE---REST_OF_TITLE''

Two lines with quotes

 

 

TITLE -OR- SUBTITLE

a tricky title with an -or-

 

 

ABSTRACT   

If a line says abstract and it is followed by a paragraph, we have found the abstract!

 

 

SPEAKER-TITLE 

A line followed with a line that has a title is probably the name of somebody and the following line is probably that person's title.

 

 

KNOWNSPEAKER-ORGANIZATION

If we don't have a speaker_affiliation, and a line is found in the message that has an organization, and this line follows a line that contains the speaker’s name, and this line is shorter than 40 characters, then this line must be the affiliation

 

 

SPEAKER-ORGANIZATION 

If we don't have a speaker_affiliation, and a line is found in the message that has an organization, and this line follows a line that contains the speaker’s name, and this line is shorter than 40 characters, then this line must be the affiliation

 

 

INTRODUCING-SPEAKER-ORGANIZATION 

If we don't have a speaker_affiliation, and a line is found in the message that has an organization, and this line follows a line that contains the speaker’s name, and this line is shorter than 40 characters, then this line must be the affiliation

 

 

SPEAKER-TITLE-ORGANIZATION

If we don't have a name of a speaker, but we can find a block of three doc.lines with no tag and one ORG and one TITLE line, then almost certainly the first line is the person, the second line is their title, and the third line is their organization

 

 

SPEAKER-ORGANIZATION-ORGANIZATION

If we don't have a name of a speaker, but we can find a block of three doc.lines with no tag and two ORGANIZATION tags then almost certainly the first line is the person, the second line is their title, and the third line is their organization

 

 

SPEAKER-TITLE-ORGANIZATION-ORGANIZATION    

If we don't have a name of a speaker, but we can find a block of three doc.lines with no tag and two ORGANIZATION tags then almost certainly the first line is the person, the second line is their title, and the third line is their organization

 

 

JUNK:-QUOTEDTITLE:-SUBTITLE-ENDQUOTE

If we have a line with a colon, followed by a line that begins with a quote and has a colon, followed by a line that ends the quote then the second and third lines are the title

 

 

ANYNAME-TITLE   

A line followed with a line that has a title is probably the name of somebody and the following line is probably that person's title.

 

 

TITLE (bare)

If we haven't found a topic, just grab the first line surrounded by blanks that's unidentified

 

 

LAST-DITCH-SPEAKER

Find a line with 2 or 3 words followed by a line with an organizational name

short paragraph rules from short.cpp; each is implemented with the C++ class rule.cpp.

After all of the single-line rules and short paragraph rules run, the tagger evaluates all of the “found” objects. For each recognized tag (e.g. “speaker,” “speaker_affiliation,” “talk_time”) the found object with the largest shurity value is chosen and written to standard output.

The final VLRE step is time processing. Two regular expression systems are used. The first, parse_time, can recognize roughly 25 different time formats. When a time format is matched, the appropriate fields in a unix struct tm structure are filled in. The parse_time system relies on parse_month to turn alphabetic months into numeric months. Special attention needs to be paid to boundary cases such as “12:15pm,” which is 12 hours after midnight, while “1:15pm” is 13 hours after midnight.

A separate entry point in the parse_time system can be used to parse time ranges. These are ranges such as “5:00 to 7:00”, “5:00-7:00pm”, and “10:00 until 3pm”. The parse_range system looks for a range delimiter and attempts to apply the parse_time system to either side. If a match is found, the following rules are evoked:

·        If neither AM nor PM is specified on the first part of the time range, and the second part of the time range has PM specified, and the hour in the first part is lower than the hour in the second part, then increase the hour for the first part of the time range by 12. This correctly interprets “3:00-4:00pm” to mean “15:00-16:00” without misinterpreting  11:00-12:30pm

·        If the time range ends up with the first part being later in time than the second part, then the range is invalid.

As part of the time processing, any discovered tag that has the word “time” in it is checked to see if is actually a time range. If the tag is a time range, it is split into two tags --- one for the start time, one for the end time. These derived tags are given the same shurity value as the original tag.

Finally, the highest-rated tags are printed. If the “-a” option is given to the tagger, then all of the found tags are printed. This output mode might be useful as further input to a neural network.

 

 

 


1

Knowledge-based engineering (KE) and NLP to parse seminar announcements

Seminar announcements have a unique topological structure that allows information to be conveyed quickly to the reader.  The header of a typical seminar announcement usually contains information about the talk, such as its type, the speaker, the host, the speaker's affiliation with the host, the contact, the time, the place, and the topic.

It is a difficult task to use KE with NLP to extract useful information from announcements, because the goal is to parse small fragments of ungrammatical phrases, not to parse sentences with defined grammatical structures.  Therefore, grammars cannot be written based on relationships between words with different parts of speech; instead, grammars have to be written based on keywords and symbols in the announcement and the topological structure the information is presented in.

The two types of NLP techniques used to extract relevant information from announcements are part of speech tagging (POS) and parsing.  POS is accomplished with the Brill tagger and parsing with the Cass partial parser.  The information that the system tries to extract from an announcement are the following: date, time, room, type of talk, speaker, speaker affiliation, host, topic, and contact information.  Contact information can appear in the form of URLs or emails.

The order in which an announcement is processed is the following:

  1. format announcement with a filter program to remove MIME and spacer2
  2. tag announcement with Brill
  3. format output from Brill with spacer2 and debrill
  4. use tagfixes
  5. parse with Cass
  6. extract information with Parser.java

The different components - spacer2, Brill, debrill, tagfixes, Cass, and Parser.java are described in detail below.

POS tagging with Brill

Seminar announcements are tagged with the Brill tagger.  Before feeding the announcement into Brill, the input is formatted with spacer2 to meet Brill's textual requirement concerning spaces in the text.  Tagging the announcement is necessary in order to parse the announcement with grammars.  Figure 1 shows an untagged announcement, and Figure 2 shows the corresponding announcement after tagging with Brill.

Topic : " MHC Class II : A Target for Specific Immunomodulation of the Immune Response "
Dates : 3 - May - 95
Time : 3 : 30 PM


Seminar : Departments of Biological Sciences
Carnegie Mellon and University of Pittsburgh
Name : Dr . Jeffrey D . Hermes
. . .

Figure 1: Untagged announcement.

Topic/NNP :/: "/" MHC/NNP Class/NNP II/NNP :/: A/DT Target/NNP for/IN Specific/JJ Immunomodulation/NNP of/IN the/DT Immune/NNP Response/NNP "/"
Dates/NNP :/: 3/CD -/: May/NNP -/: 95/CD
Time/NNP :/: 3/CD :/: 30/CD PM/NNP
Seminar/NNP :/: Departments/NNPS of/IN Biological/JJ Sciences/NNPS
Carnegie/NNP Mellon/NNP and/CC University/NNP of/IN Pittsburgh/NNP
Name/NN :/: Dr/NNP ./. Jeffrey/NNP D/NN ./. Hermes/NNP
. . .

Figure 2: Announcement after being tagged by Brill.

One major flaw with Brill, and any other type of tagger that is currently available, is that the tag-set provided by the tagger is not robust enough for tagging announcements.  There is a significant difference between parsing announcements and parsing newspaper text.  Brill provides a rich set of tags for differentiating between various parts of speech, but POS tags are useless for tagging announcements, because announcements do not have complex grammatical structures.  Furthermore, Brill does not differentiate many subtle details that occur in an announcement.  For example, Brill considers any noun that begins with a capital letter to be a singular proper noun (NNP), and this over generalized rule has the undesired side-effect of tagging both John and Laboratory as NNP.

The way Brill mishandles capitalization and overlooks many textual details that occur in announcements hinders the goal of extraction information from announcements, because these textual details are extremely common in announcements and play an important part in differentiating between different types of information.  For example, we would like the parser to recognize that John is part of a person's name and Laboratory is part of building or a host's name based on the tags that Brill gives John and Laboratory.  Unfortunately, this differentiation is not possible with the current implementation of Brill, because Brill tags both John and Laboratory as NNP.

Brill's rich tag-set for POS tagging is also a hindrance when parsing announcements with grammars, because the tags are not useful for parsing announcements and make the grammars more complicated, as described in the section "Parsing with SCOL".  The ideal tagger for tagging announcements would recognize people's names, numbers, months, and company names.  The tagger would tag everything else as either a noun, a verb, or an adjective without worrying about irrelevant details for parsing announcements, such as tense and pluralizing.

Parsing with SCOL

SCOL is a toolbox with many different NLP tools for analyzing text.  The two tools used for analyzing announcements are tagfixes and cass.  In order to write any useful grammar for parsing announcements, it was necessary to create a subset of more useful tags for various words after tagging with Brill.  Grammars are written using the Cass partial parser.  A partial parser is used instead of a normal parser, because extracting information from announcements does not require parsing every sentence and phrase in the announcement.  Instead, only the relevant information from the announcement are of interest, and it is desirable for the parser to skip over the extraneous information that the parser cannot parse with the grammars provided.  The input from Brill is also formatted before parsing with Cass by debrill.

Creating new tags with tagfixes

Tagfixes translates a subset of tags associated with specific words to a new set of tags.  This functionality is extremely useful for generating a new set of tags that is more useful for parsing announcements with grammars.  Figure 3 shows the tagfixes file used for parsing announcements.  The first column specifies the new tag of the string  indicated in the second column, and the third column shows the original tag associated with the string.  Tagfixes is performed after tagging with Brill and before parsing with Cass.

month January NNP
month Jan NNP
month February NNP
month Feb NNP
month March NNP
month Mar NNP
month April NNP
month Apr NNP
month May NNP
month June NNP
month Jun NNP
month July NNP
month Jul NNP
month August NNP
month Aug NNP
month September NNP
month Sep NNP
month October NNP
month Oct NNP
month November NNP
month Nov NNP
month December NNP
month Dec NNP
month Spring NNP
month SPRING NN
month Summer NNP
month SUMMER NN
month Fall NNP
month FALL NN
month Winter NNP
month WINTER NN
loc Room NNP
loc room NN
loc Auditorium NNP
loc Hall NNP
loc Wing NNP
title Professor NNP
title Prof
title Doctor NNP
title Dr NNP
title Mr NNP
title Mrs NNP
title Ms NNP
title Doctoral NNP
title PhD NNP
talk Seminars NNP
talk seminars NNS
talk Seminar NNP
talk seminar NN
talk Colloquium NNP
talk colloquium NN
talk talk VB
talk Talk NN
talk TALK NNP
talk Lecture NNP
school department NN
school Department NNP
school Deparments NNPS
school University NNP
school College NNP
school Institute NNP
school School NNP
 
cmp Committee NNP
cmp Council NNP
cmp Center NNP
cmp Company NNP
cmp Companies NNPS
cmp Laboratory NNP
cmp Laboratories NNPS
cmp LABORATORIES NNPS
cmp LABORATORIES NNP
cmp Foundations NNS
col : :
hyph - :
at @ IN
cma , ,
qa " "
sq ~ NN
SYM ( (

on on IN
fr from IN
and and CC

dn edu NN
dn com NN
dn net NN
dn org NN

tm PM NNP
tm pm NN
tm AM NNP
tm am NN
tm noon NN

ADJ Russian NNP
CD 5 NN
loc romm NN

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3: Tagfixes translates one set of tags to a new set of tags for specified words.

The tag month facilitates parsing dates.  Brill tags all month and their abbreviations as NNP.  Month is a more meaningful tag for months and their abbreviations, so tagfixes translates all NNP tags associated with months to the tag month. were translated into month.  Seasons, such as spring, summer, fall, and winter were also tagged as month, because they are associated with groups of months. 

The tag loc is used for parsing locations.  Words that suggest a type of location, such as room, auditorium, hall, and wing, are tagged as either NNP or singular nouns (NN) by the Brill tagger.  Again, the Brill tags do not convey the important information that these words are associated with places where the seminars take place.  Therefore, the loc tag was added to the tag-set to facilitate the process of writing grammars to parse locations from announcements.

People's names are one of the most challenging types of information to parse, because there is no distinguishing factor that separates names from other types of information or from the background text.  Brill tags most names as a pair of nouns - some combination of NN and NNP.  If such a simple scheme were used to parse names, then any other noun pair in the announcement would also be considered as names, which reduces the precision of the system.  Names are also difficult to parse because of their variation.  Various symbols can occur within names depending on the origin of the name, such as  Javier L~Pez.  A more detailed discussion on parsing names is provided in "Writing Grammar with Cass".  However, many speakers in announcements have titles in their names, such as Professor, Dr, Mr, or Mrs.  Brill also tags the titles as NNP or NNPS, which again does not convey any useful information.  Tagfixes translates the tags into the new tag title to provide contextual clues for finding the speaker's name during parsing.

The tag talk identifies keywords related to the type of talk in the announcement; whether it is a talk, lecture, or seminar series.  Brill tags most of these keywords as NNP and NNPS.  Therefore, the talk tag serves two purposes: identifying talk type and distinguishing these words from being parts of names.

Similarly, the tag school identifies keywords related to the speaker's affiliation, such as department and university.  Again, Brill tags most of these keywords as NNP and NNPS, so it is important to distinguish them from other words that might be parts of names.

Finally, tag cmp identifies keywords related to the host of the talk.  Keywords include words like company, laboratory, and center.  Again, Brill tags most of these keywords as NNP and NNPS, so it is important to distinguish them from other words that might be parts of names.  Identifying the host correctly is almost as challenging as identifying people's names, because the host is a name as well, such as IBM.  However, methods have to be implemented to distinguish a company name from a person's name.  A similar argument is applicable to the speaker's affiliation, if the affiliation is the name of a school, such as Carnegie Mellon, which could easily be the name of a person as well.

Other miscellaneous tags incorporated into the tag-set include dn for tagging parts of emails and URLs to indicate possible contact information; in, on, and tm for keywords identifying certain types of relevant information preceded or followed by keywords, such as in, on, am, and pm; col and at for special symbols such as : and @ that indicate the presence of time and emails.  The last three entries in tagfixes are for fixing some of the mistakes Brill makes when tagging announcements and fixing spelling mistakes in the corpus, suggesting that the system has to be able to tolerate human errors and errors with Brill when trying to extract information from announcements.

A quick glance at Figure 3 reveals how much richer the tag-set is after incorporating tagfixes for tagging announcements.  Many of the words that are important for parsing announcements are tagged by Brill as either NNP, NN, or their plural forms.  Brill's impoverished tag-set for parsing announcements makes writing grammars extremely difficult, so tagfixes is used to expand the tag-set to make parsing more feasible.

Writing Grammar with Cass

A grammar rule is associated with each type of information extracted from an announcement and each rule consists of one or more sub-rules.  Figure 4 shows the grammar rules used to parse announcements.  Grammar rules were written using KE and whatever specific information that could be extracted about the domain.  For example, most topics are encapsulated inside of quotes.  These types of information were sometimes generalized or made more specific in order to capture the correct information from the announcement.  If the rule was too general, then it would misinterpret other types of information.  If the rule was too specific, then many rules were required to capture one type of information.  Therefore, finding a good balance for rule-specificity was a challenge.  How many exceptional cases handled by the rules also affected the size of each rule.  Many of the rules were complicated by the presence of Brill's overabundance of tags associated with different parts of speech.  Therefore, rules were often augmented with plural forms and different tenses of words.  The order that the rules are applied also affects how well information is parsed from the announcement.  More general rules, such as that for parsing the speaker, are placed towards the end so that more specific rules can be applied first during parsing to increase accuracy.

 ## Date
date -> (month .? CD cma CD) | ((CD | month .?) (hyph | /SL) (CD | month .?) (hyph | /SL) CD) | (CD? month .? CD) | (month CD hyph month CD (cma CD)+);

## Time
time -> CD col CD (tm | ((NN | NNP) .? (NN | NNP) .))? (hyph (CD col CD (tm | ((NN | NNP) .? (NN | NNP) .))?))?;

## Room
room -> ((CD? (NNP | NN | school)* loc+ (CD | (CD col CD))?) | (CD NNP) | (NNP CD) | (NNP hyph CD)) ((IN NNP) | cma (NNP | NNPS)+ loc CD?)?;

## Type
type -> ((NN | NNS| NNP | NNPS| JJ | VB | DT | month)+ talk (IN (NN | NNS | NNP | NNPS)+)? (NNP | NNS)?) | (talk NNP? IN (NN | NNS | NNP | NNPS| JJ | VB | DT)+) | (NNP . NNP . talk);

## Topic
topic -> (qa (WP | NN | NNS | NNP | NNPS | JJ | VB | DT | VBN | VBG) (NN | NNS | NNP | NNPS | POS | JJ | VB | DT | . | IN | TO | CC | col | cma | and | hyph | VB | VBG | VBN | MD)+ qa);

## University
univ -> (NNP? school IN JJ? (NNP | NNPS) (cma (NNP | NN | NNPS | NNS))?) | (NNP NNP? school) | (NNP NNP? and (NNP? school IN JJ? (NNP | NNPS) (cma (NNP | NN | NNPS | NNS))?) | (NNP NNP? (cma NNP)? and? NNP school))? | (school IN (NNP | NNPS)+ (and | CC) DT? NNP+ school?);


## Company
company -> ((NNP | NNPS)+ cmp (SYM NNP+ SYM)? (and DT (NNP | NNPS)+ cmp)?) | (cmp IN (NNP | NNPS)+ (IN NNP)? (cma NNP)?) | (fr (NNP | NN | NNPS | NNS)+) | (school NNP+ cmp);

## Person
person -> ((title .?)? (NNP | NNP . | NNP ((NN | NNP) .))+ (hyph (NNP | NN))? (NNP | NNS) (sq (NN | NNS | NNP))? | NNP NNP VBZ) | (title . ((NN | NNP) .)+ (SYM (NNP | NN) SYM)? NNP) | (NNP cma NNP (and NNP)?);

## URL
url -> ((NN | NNS | NNP | JJ | VB | DT) . (NN | NNS | NNP | JJ | VB | DT)?)+ . dn;

## E mails
email -> (NN | NNP | JJ | VB | DT | CD) SYM? at (url | NN | NNS);

Figure 4: Grammar rules for parsing announcements.

Grammar Rule Analysis

The grammar rule for parsing a date is the following: (month .? CD cma CD) | ((CD | month .?) (hyph | /SL) (CD | month .?) (hyph | /SL) CD) | (CD month .? CD) | (month CD hyph month CD (cma CD)+) and each sub-rule handles a different representation of a date, as shown in Table 1.  The sub-rules are pretty straightforward, and it was observed that the structure of dates is relatively consistent throughout the corpus.  Dates are one of the easier pieces of information to parse, because the information and the arrangement of the information that makes up a date is unique.  For example, no other piece of information consists of a word with the month tag or a specific arrangement of numbers separated by symbols (other than time, which is discussed later).  The difference between a date and a time is the symbol that separates the numbers in each type of information.  In the case of a date, the symbols are usually "-", "/", or sometimes ".", and in the case of time, the symbol is usually a ":".

Sub-Rule Examples
(month .? CD cma CD) Jan. 19, 2004; Jan 19, 2004
((CD | month .?) (hyph | /SL) (CD | month .?) (hyph | /SL) CD) 1 - 19 - 2004; 1 / 19 / 2004; 19 - Jan - 2004
(CD? month .? CD) 19 Jan 2004; Jan 2004
(month CD hyph month CD (cma CD)+) Jan 19 - Jan 22

Table 1: Examples of each sub-rule for parsing date.

The grammar rule for parsing a time is the following: CD col CD (tm | ((NN | NNP) .? (NN | NNP) .))? (hyph (CD col CD (tm | ((NN | NNP) .? (NN | NNP) .))?))?.  Examples of the rule are shown in Table 2.  There is only one rule for parsing a time.  Similar to dates, the structure of times is also relatively consistent throughout the corpus, and make times one of the easier pieces of information to parse.  The information and the arrangement of the information that makes up a time is unique, because no other piece of information extracted consists of a specific arrangement of numbers separated by a colon and sometimes a dash if the start and end time of the talk are on the same line.  No other information is followed by a string associated with the tm tag either to indicate AM and PM.

Sub-Rule Examples
CD col CD (tm | ((NN | NNP) .? (NN | NNP) .))? (hyph (CD col CD (tm | ((NN | NNP) .? (NN | NNP) .))?))? 9:00; 9:00 PM; 9:00 P.M.; 9:00 - 10:00; 9:00 PM - 10:00 PM

Table 2: Examples of rule for parsing time.

The grammar rule for parsing a room is the following: ((CD? (NNP | NN | school)* loc+ (CD | (CD col CD))?) | (CD NNP) | (NNP CD) | (NNP hyph CD)) ((IN NNP) | cma (NNP | NNPS)+ loc CD?)?.  Examples of the rule are shown in Table 3.  Similar to time, room consists of one grammar rule.  Unlike the previous time and date, the representation of room is less consistent throughout the corpus, because room numbers can appear before or after building names and not all rooms are associated with numbers.  Therefore, in an effort to capture all of the different types of representations of rooms, some of the sub-rules are made to be more general and overlap with sub-rules for extracting other types of information.  This is a tradeoff for being able to find information about rooms in announcements.  Therefore, the order in which the grammars are executed affects what type of information is parsed, as discussed in the Results section.  However, most rooms have some type of structure with one or more words with the loc tag and some combination of numbers, which helps to write grammar rules for parsing rooms.

Sub-Rule Examples
((CD? (NNP | NN | school)* loc+ (CD | (CD col CD))?) | (CD NNP) | (NNP CD) | (NNP hyph CD)) ((IN NNP) | cma (NNP | NNPS)+ loc CD?)? Mellon Institute Conference Room; room 261 of GSIA; Wean Hall 7500; 2315 Doherty Hall

Table 3: Examples of rules for parsing room.

The grammar rule for parsing the type of talk is the following ((NN | NNS| NNP | NNPS| JJ | VB | DT | month)+ talk (IN (NN | NNS | NNP | NNPS)+)? (NNP | NNS)?) | (talk NNP? IN (NN | NNS | NNP | NNPS| JJ | VB | DT)+) | (NNP . NNP . talk) and each sub-rule handles a different representation of a type of talk, as shown in Table 4.  The structure of the information for talk type is relatively consistent throughout the corpus, since most announcements indicate what type of seminar the announcement is for by stating the type followed or preceded by a word with the talk tag.  What isn't consistent is the type of words used in talk type, and this is one place where Brill's POS tags become a hindrance, because the rule has to account for different noun and verb types, as well as other parts of speech, including determiners, prepositions and adjectives.  If these differentiations were not  made, then the rules could be more concise, and it would not be necessary to worry about whether the rule has accounted for every possible POS tag in order to parse the information correctly.

Sub-Rule Examples
((NN | NNS| NNP | NNPS| JJ | VB | DT | month)+ talk (IN (NN | NNS | NNP | NNPS)+)? (NNP | NNS)?) UTC Summer Seminars for Graduate Students; Marketability Talk
(talk NNP? IN (NN | NNS | NNP | NNPS| JJ | VB | DT)+) Lecture Series
(NNP . NNP . talk) Chem. Eng. Seminar

Table 4: Examples of rules for parsing talk type.

The grammar rule for parsing a topic is the following: (qa (WP | NN | NNS | NNP | NNPS | JJ | VB | DT | VBN | VBG) (NN | NNS | NNP | NNPS | POS | JJ | VB | DT | . | IN | TO | CC | col | cma | and | hyph | VB | VBG | VBN | MD)+ qa).  Examples of the rule are shown in Table 5.  After studying many examples from the CMU corpus, it was noticed that many topics are distinguished from the rest of the text by quotation marks and consist of at least two words within the quotation marks.  However, the topics of many announcements are also not in quotes and those topics are distinguished by their placement within the announcement or consist of all capital letters.  There are also cases where the topic is not distinguished by quotes or distinct text formatting.

Using quotation marks as markers to distinguish a topic is a good start at parsing topics consistently, since grammars cannot take advantage of any placement or text (capitalization) information within an announcement.  A grammar rule for parsing topics without using quotation marks would be too broad and could potentially correspond to any consecutive series of two or more strings.  Unfortunately, topics do not have any keywords to distinguish them from other types of information and could include any word from a person's vocabulary.  This diversity makes topics difficult to parse without some type of knowledge about the domain to write good grammars with.

One problem associated with using quotation marks to identify a topic is information within the body of the announcement sometimes uses quotation marks for literary purposes, not to demarcate a topic.  Therefore, an algorithm would be useful for selecting the correct quoted phrase as the topic, as described in the section "Extracting information with Parser.java".  This is another place where the richness of Brill's POS tags is a hindrance, because all of the different tags that cannot be used for parsing announcements still have to be accounted for in order to parse the topic, making the grammar rule more cumbersome.

Sub-Rule Examples
(qa (WP | NN | NNS | NNP | NNPS | JJ | VB | DT | VBN | VBG) (NN | NNS | NNP | NNPS | POS | JJ | VB | DT | . | IN | TO | CC | col | cma | and | hyph | VB | VBG | VBN | MD)+ qa) " MHC Class II : A Target for Specific Immunomodulation of the Immune Response "; " Graft Modification of Polymers in Twin Screw Extruders , "

Table 5: Examples of rules for parsing topic.

The grammar rule for parsing the speaker affiliation is the following (NNP? school IN JJ? (NNP | NNPS) (cma (NNP | NN | NNPS | NNS))?) | (NNP NNP? school) | (NNP NNP? and (NNP? school IN JJ? (NNP | NNPS) (cma (NNP | NN | NNPS | NNS))?) | (NNP NNP? (cma NNP)? and? NNP school?)) | (school IN (NNP | NNPS)+ (and | CC) DT? NNP+ school?) and each sub-rule handles a different representation of a speaker affiliation, as shown in Table 6.  Speaker affiliation is usually a department title or the name of a school; therefore, the school tag facilitates identifying the speaker affiliation.  Not unlike names, the structure of the speaker affiliation varies, and writing a grammar rule that encapsulates all of the different ways the name of a department or school can be represented is difficult.

Sub-Rule Examples
(NNP? school IN JJ? (NNP | NNPS) (cma (NNP | NN | NNPS | NNS))?) University of Pittsburgh
(NNP NNP? school) McGill University; Carnegie Mellon University
(NNP NNP? and (NNP? school IN JJ? (NNP | NNPS) (cma (NNP | NN | NNPS | NNS))?) | (NNP NNP? (cma NNP)? and? NNP school?)) Carnegie Mellon and University of Pittsburgh; University of Tennessee , Oak
(school IN (NNP | NNPS)+ (and | CC) DT? NNP+ school?) department of Electrical and Computer Engineering

Table 6: Examples of rules for parsing speaker affiliation.

The grammar rule for parsing the host is the following ((NNP | NNPS)+ cmp (SYM NNP+ SYM)? (and DT (NNP | NNPS)+ cmp)?) | (cmp IN (NNP | NNPS)+ (IN NNP)? (cma NNP)?) | (fr (NNP | NN | NNPS | NNS)+) | (school NNP+ cmp) and each sub-rule handles a different representation of a host, as shown in Table 7.  Many of the problems encountered in parsing the speaker affiliation are also encountered in parsing the host.  There are several keywords that uniquely identify a host, such as those tagged with cmp.  However, the names of the hosts are often inconsistent and vary greatly in structure, and sometimes there are substructures, such as information surrounded by parentheses.  Trying to find a rule that encapsulates the different ways that the name of a host can be represented is a challenge.  Aside from the cmp tag, the from tag was also used to parse structures such as "John, from IBM".

Sub-Rule Examples
((NNP | NNPS)+ cmp (SYM NNP+ SYM)? (and DT (NNP | NNPS)+ cmp)?) Merck Research Laboratories; Dow Chemical Company; Student Alumni Relations Council ( SARC ) and the Career Center
(cmp IN (NNP | NNPS)+ (IN NNP)? (cma NNP)?) Laboratory of Computer Science
(fr (NNP | NN | NNPS | NNS)+) from IBM
(school NNP+ cmp) Department of Biology Committee

Table 7: Examples of rules for parsing host.

The grammar rule for parsing the speaker is the following ((title .?)? (NNP | NNP . | NNP ((NN | NNP) .))+ (hyph (NNP | NN))? (NNP | NNS) (sq (NN | NNS | NNP))? | NNP NNP VBZ) | (title . ((NN | NNP) .)+ (SYM (NNP | NN) SYM)? NNP) | (NNP cma NNP (and NNP)?) and each sub-rule handles a different representation of a speaker, as shown in Table 8.  Parsing the speaker, a person's name, is one of the most challenging tasks in extracting information from an announcement.  Names can take a variety of forms depending on ethnicity and sometimes are preceded or followed by a word tagged as title.  The title tag is extremely useful in identifying names, because no other type of information contains the tag.  Most of the time, a name is simply two NNP, and in some cases just one NNP.  Therefore, a simple rule that covers NNPs could correspond to almost anything in the announcement, including information about location, host, topic, talk type, and speaker affiliation.  The wide variety of forms that names can take - some names even include symbols such as - and ~, and the wide variety of arrangements the title and name can take all contribute to the difficulty of parsing names.

Due to the difficulty in parsing names and the lack of a proper tag-set for distinguishing names, the grammar rule for parsing names is particularly general.  The rule depends on the title tag to indicate names that do have titles.  However, for announcements with names that do not have the title tag, then the accuracy of the rule depends on the grammar rules for parsing the other information.   As shown in Figure 4, names are one of the last information that the parser parses.  Therefore, if the other information in the announcement is parsed correctly, then there should be a limited number of pairs of NNPs that can be the name of the speaker.  Whether the speaker's name is parsed correctly also depends on the extracting function described in "Extracting Information with Parser.java".

Sub-Rule Examples
((title .?)? (NNP | NNP . | NNP ((NN | NNP) .))+ (hyph (NNP | NN))? (NNP | NNS) (sq (NN | NNS | NNP))? | NNP NNP VBZ) Dr . Jeffrey D . Hermes; Shelley Brozenick
(title . ((NN | NNP) .)+ (SYM (NNP | NN) SYM)? NNP) Dr . R . J . ( Bob ) Pangborn
(NNP cma NNP (and NNP)?) Ermolinsky , Meleshchenko and Plaksiyev

Table 8: Examples of rules for parsing speaker.

The grammar rule for parsing the contact information are the following ((NN | NNS | NNP | JJ | VB | DT) . (NN | NNS | NNP | JJ | VB | DT)?)+ . dn and (NN | NNP | JJ | VB | DT | CD) SYM? at (url | NN | NNS).  Examples of the rule is shown in Table 9.  The only type of contact information parsed are URLs and emails.  Phone numbers were are parsed, because they take the form of CD hyph CD after being tagged by Brill, which is too similar to the structure for dates.  URLs and emails are relatively easy to parse, because URLs take the form of a series of strings separated by "." and usually ends with a domain name tagged with dn, such as org or edu.  Similarly, emails are distinguished by the @ sign.

Sub-Rule Examples
((NN | NNS | NNP | JJ | VB | DT) . (NN | NNS | NNP | JJ | VB | DT)?)+ . dn frc . ri . cmu . edu; gp . cs . cmu . edu
(NN | NNP | JJ | VB | DT | CD) SYM? at (url | NN | NNS) sutner @ aiki; plp @ cs

Table 9: Examples of rules for parsing contact information.

Figure 5 shows a sample output from Cass with text tagged by these rules.

Extracting information with Parser.java

After parsing the announcement with the Cass partial parser, the parsed information is extracted from the rest of the text and the correct information is chosen from the parsed information based the tags Cass associates with the parsed information, such as date, time, room, type, topic, univ, company, person, url, and emailParser.java is a java program that extracts the parsed information from the irrelevant text, decides which information to extract, and translates the Cass tags into standard tags for conveying the extracted information, such as date, time, speaker, host, speaker_affiliation, contact, topic, room, and series.  The program returns one instance of each type of information parsed by Cass.  parseInput collects all of the parsed information generated by Cass, and parseOutput selects one instance of each type of information parsed.

Currently, parseOutput selects the first instance of each type of information and assumes that the first instance is correct.  This assumption is based on the knowledge that most announcements place the relevant information in the headers or beginning of the announcement.  Therefore, the earliest information parsed should contain the pertinent information to be extracted.  The grammar does a relatively good job of parsing the announcement, so if most of the header information is parsed correctly, then the information returned to the user should also be correct.  Trying to write more complicated functions to determine which information to parse is an extremely difficult task, since the program has no semantic understanding of the contents of the announcement.  Furthermore, the Brill tags do not provide any information about the contents of the announcement, so the grammars depend mainly on the few tags incorporated with tagfixes.


Neural Analysis of Multiple Inputs

Neural Analysis of Multiple Inputs

The problem of locating an informative segment buried inside a larger text can be broken up into two subproblems: 1) determining which words are likely to be a part of that segment and 2) identifying contextual cues that provide readers with hints that the segment will appear nearby. For instance, finding the time of an event requires knowledge of what a time might look like (eg "4:30" or "noon") and also what words indicate that a particular time is the time of the event (eg "the talk will start at").

Humans performing these tasks can employ strategies that make use of both structural and semantic content from the text. A computer’s representation of the message is necessarily much weaker since this high level natural language understanding is not available to the machine. However, it is possible to simulate partial understanding by combining a number of elementary structural or semantic features into a higher level model. This requires not only detecting the elementary features but also weighing them such that all basic pieces of information complement the others and form the building blocks of the program’s final decision.

This is the model for our third and final system for automated message parsing. The Neural Analysis of Multiple Inputs (NAOMI) uses machine learning classifiers to automatically learn weights for each piece of evidence that contributes to the end solution. A collection of homegrown feature generators read the text and provide inputs to backpropagation neural networks. Each network is trained to identify message segments that contain a specific type of information – speaker name, talk location, start time, or end time. The networks’ outputs are collected, sanity checked, and stored to serve as a summary of each message.

Architecture

The feature generators are fairly simple natural language processing programs that tag input at the level of individual tokens. When one is run on an input message, it produces an output file containing the original message and a separate set of standoff tags. A standoff tag consists of a tag type and character offset pointers into the message body indicating which text the tag applies to. Standoff tags were used because they offer a simple method for merging tagged files with overlapping and nonstandard tags.

The feature generators built for this project are:

  • Inverse Document Frequency. This calculates a score for each token based on how often it occurs in the message relative to how often it occurs in the entire corpus. The formula used is
    [# of appearances in message] * log([# of stories] / [# of stories containing token])
    This will penalize common words that occur in many documents while rewarding words that are typically important and unique to one particular message, such as names, dates, and talk topics. The top ten scored tokens of each message are tagged as "interesting". This is often useful for distinguishing between human names and other capitalized or proper nouns that could be mistaken for names, since the speaker's name tends to appear repeatedly within the message but infrequently in other messages.

    Our IDF program is also able to chunk words together into phrases, when appropriate. If an "interesting" word is frequently co-located with other interesting words, they are combined to form a larger phrase. This is again useful for picking out names which typically consist of multiple tokens. Talk topics are also commonly recognized as phrases.

    The program can be run in batch mode, reading in and tagging an entire directory of messages. It can also be run on individual messages using saved frequency data. Sample run, code [~700 lines]

  • Semantic Word Class Analysis. This provides a very simplistic analysis of which tokens have been seen as a part of or immediately preceding a tag in the training set. The program is run on a set of tagged messages and builds up histogram data recording which words occur as part of a tag. This effectively builds up a very large equivalency class of words for each type of tag. Classes are also built for leadup phrases, specifically the two words previous to a tag are recorded. Suppose the training set includes "Randy Pausch will be presenting in <location>WeH 8400</location>". The program then adds "WeH" and "8400" to the location class and "presenting" and "in" to the near-location class.

    This is clearly an oversimplified metric allowing the program to determine which phrases "look like" they could be part of or near a tag. No attempt was made to improve this system; rather, we wanted to test how well the algorithm would perform with an impoverished model of semantic word classes. Since most of the previous work in information extraction of seminar announcements has been focused on building up these classes (as with HMMs or the handcrafted parsers discussed above), it will be straightforward to replace this feature generator program in the future with a more sophisticated model of semantic class analysis.

    The program can be "trained" on a directory of tagged files or produce its own tags on individual or batch sets of untagged message files using saved frequency data. Sample run, code [~400 lines].

  • Capitalization. This detects capitalized words. This is simple but useful since it provides cues for proper nouns such as names or locations. The capitalized words are tagged using offset tags like all other features, making it easy to combine the feature tagged files into a format the neural networks can use.
  • Numerics. Similar to the previous feature generator, except this tags only tokens that include numbers. This is useful for finding room numbers or times.
  • Non alphanumerics. This will tags only tokens containing a character that is not alphanumeric. Common examples include dashes, periods, or colons. This is helpful in detecting times or leadup tokens like "Speaker:".

These programs illustrate the range of complexity that can be found in the feature generators for the NAOMI system. Features can be extemely simple or as complicated as full blown NLP systems. Adding a new type of input to NAOMI is as simple as plugging in a new feature generator that produces some type of potentially helpful tagged output.

The above programs certainly do not cover the entire breadth of possibly helpful data. It is also clear that no single one of these features is sufficient to provide the information needed to reach a conclusion. However, the neural networks are able to use these basic inputs to generate meaningful results with surprising accuracy. Future work to include more sophisticated feature generators should only continue to improve the performance of the system.

The neural networks are the other important component of NAOMI. Each neural network is charged with the job of learning how to combine the different tags in the input into a decision about which tokens should be tagged as a location, speaker, start time or end time. They process each message one token at a time, building an array of binary inputs representing the standoff feature tags that are present in the current or nearby tokens in the text.

Inter-esting

token in 'near location' class

token in 'location' class

prev token in 'near location' class

prev token in 'location' class

prev[2] token in 'near location' class

prev[2] token in 'location' class

Caps

Non-alpha

Numeric

 

LOCATION

TOKEN

-1.0

1.0

1.0

1.0

-1.0

1.0

-1.0

-1.0

-1.0

-1.0

=

0.0

in

-1.0

1.0

1.0

1.0

1.0

1.0

-1.0

1.0

-1.0

-1.0

=

1.0

WeH

-1.0

1.0

1.0

1.0

1.0

1.0

1.0

-1.0

-1.0

1.0

=

1.0

7220

Figure 1. Sample rows of a neural network’s input data

Each network has 10 inputs, 5 hidden units, and 1 output. The units have continuous activations and are fully connected. The networks were trained on 6000 examples with a learning rate of .3 until the difference in sum squared error stabilized at less than .0001. Each network was also tested on a validation set of 10000 examples and corrected for overfitting by choosing the weights that optimized both training and validation scores. The best resulting trained weights are stored in a file and used to score future messages. The length of time needed to train each network is on the order of five minutes, while the length of time needed to run the entire system on a new message is about three seconds. The neural network software in NAOMI is another homemade piece of code recycled from a previous project.

The result of running a network on a new message will be an array of tokens marked with a 0 or 1, where a 1 indicates that the network believes the token should be included in the output tag. There is a small amount of post-processing that takes place here to add a "sanity check". For instance, if a token is tagged as a name but its neighboring tokens are not, the output is flipped to 0. This is an arbitrary (handmade) rule to enforce the constraint that names should contain at least two tokens. Rules of this type generally improve the precision of the system with minimal impact upon the recall. The idea is to screen false positives, hopefully without causing false negatives in the process. The other rules here filter out times like "p.m." with no attached number, times containing dashes, and locations consisting of "banned words" that tend to turn up in false positives often.

Ideally, we should be able to add new features to the input stream that would help the network distinguish these cases on its own. That would likely take the form of improving the semantic word classes to exclude common false positives. Due to time constraints and the aforementioned desire to keep the semantic classes simple, the postprocessing filter seems to perform the task adequately.

Other pieces of software developed as parts of NAOMI include:

  • TagHelper: a program that translates XML and Brill style tags into standoff tag formats.
  • TagWindow: a GUI for displaying a message and editing its tags. This is useful for rapidly hand tagging the corpus with new types of tags, which will be necessary for NAOMI to go beyond the four basic tags of the CMU corpus.
  • Echo: a program to combine multiple standoff files into a comprehensively tagged text

Here is a sample output segment of a run of the location neural net (negative values indicate the postprocessing filter stepped in to help). Here is the original message. Here is a live run of the demo in action.


Results

We performed a statistical evaluation of the results of two of the systems, calculating recall, precision, and F1 score based on a set of examples from the CMU corpus. F1 score is defined as (2 * recall * precision) / (recall + precision). This gives us a single score for each system's performance that takes into account both pieces of the recall/precision tradeoff.

For the sake of comparison, here are the scores from Freitag's thesis on the same CMU tagset.

Freitag SRVLocationSpeakerStart TimeEnd Time
Recall69.558.398.392.6
Precision73.854.998.466.7
F1.716.565.983.775

VLRE Tagger F1

Computing F1 for the VLRE tagger on the CMU corpus was complicated by several issues. Foremost was the impoverished nature of the CMU corpus. Aside from the pervasive <paragraph> and <sentence> tagging in the corpus, the CMU corpus only has four kinds of tags. Also, many announcements have repeated tags. For example, if an announcement in one place says that the time of the talk is 1pm and another place the announcement is listed as 1:00pm, the CMU corpus will contain two <stime> tags. When duplicate tags within the same announcement were eliminated, the distribution of tags looked like this:

Original

Duplicates removed

Etime

427

227

Location

635

458

Speaker

749

408

Stime

972

484

Key things are missing from this tag set, including the dates of each talk, the abstract, the speaker’s affiliation, and so on. The VLRE tagger employs a much richer tag set.  Running the VLRE tagger over the CMU corpus, the following tags were found with the following frequencies:

center 

19

host   

87

refreshments_room      

1

refreshments_time      

12

rfc822_time    

3

series 

42

speaker

155

speaker_affiliation    

134

speaker_title  

37

talk_abstract  

14

talk_building  

67

talk_contact   

1

talk_date      

481

talk_room       

201

talk_room_building     

138

talk_time      

485

talk_time_end  

117

talk_time_start

117

talk_topic     

485

urls   

1

Clearly, the VLRE tagger is able to find more kinds of information in the training set. One of the advantages of the regular expression technique is that it is robust and frequently can find partial information. (IE: if it can’t find the room and building together, it may be able to identify the room or the building.) Unfortunately, the VLRE tagger also makes mistakes.

Alas, computing F1 for the VLRE tagger is complicated by the fact that the VLRE tagger does significant post-processing. Times, for instance, are rewritten into a canonical 24-hour form.  To use the example from above, the VLRE tagger would report that the talk would start at 13:00, not “1pm” or “1:00pm”. Calculating F1 therefore requires placing all of the times in the CMU corpus into a canonical time form as well.

Having written such a program (made simpler with the parse_time() function that uses a regular expression library), it is possible to compute an F1 for the talk_time/talk_time_start and talk_time_end tags.  To compute the number of possible matches in each announcement in the CMU corpus, the tags were extracted from each talk and turned into a 24-hour format. We then ran the VLRE tagger on the untagged version of the talk announcement. The talk_time_start tag was used if present, otherwise the talk_time tag was used. A successfully found tag is one in which the VLRE tag was found to be equal to the canonical CMU tag.

The final are presented below; errors from this tagging are in the file vlre-errors.txt.

VLRESpeakerStart TimeEnd Time
Recall.539.961.903
Precision.8151.00.995
F1.649.980.947

All of the source code for the VLRE system can be found in the /scan directory of this webserver.

NAOMI F1

F1 was computed for NAOMI on a slightly different way. Since NAOMI does not postprocess the output in any way that alters the content or format of the message, it is possible to compare the system's output to the correct output for each individual token. Also, NAOMI does not combine multiple tags in the same document (regardless of whether or not they match), instead simply returning all matched tokens to the user. This means that for the purposes of F1 score, if a speaker tag occurs twice in the same document, NAOMI is expected to correctly tag all tokens in each segment.

Note that since NAOMI was trained on a portion of the CMU corpus, for the sake of validity these test results are only from the half of the tagged corpus not used for training. Distribution of tags is as follows:

Tagged Tokens

Etime

185

Location

724

Speaker

617

Stime

469

The following table lists the scores of each of NAOMI's networks on the test set:

NAOMILocationSpeakerStart TimeEnd Time
Recall.950.957.809.960
Precision.857.800.774.846
F1.901.871.791.900

The following excel spreadsheets contain full logs of the test run as well as statistical analysis of the correct and incorrect tokens. Error rows are marked in orange, and rows containing negative values for NAOMI's guesses represent spots in which the postprocessor made a change to the original value.

Building the Calendar

The tags found by our systems can be thought of as name/value pairs. These pairs are converted into an SQL statement and are used to populate a database table. Duplicates are suppressed by first removing any SQL records that contain the same date, time, and room number. (This elimination is based on the theory that there will not be two talks simultaneously scheduled for the same room at the same time. Unfortunately, this methodology depends upon room tagging working properly --- something that we have not seen in all cases.)

Some announcements simply cannot be properly parsed. Consider http://www.nitroba.com/msgs/2003/4/msg.22.Ke1iB.txt, an email message that was sent to the cis-seminars@theory.lcs.mit.edu mailing list. This message seeks to reschedule a talk that was originally scheduled for 4/23/03. The new date is “FRIDAY, NOT sure when yet.. probably MAY.” Analyzing this message, the taggers are not able to determine the appropriate date.

Three scripts have been written to report from the database. The calendar.cgi script displays all of the matching talks for a given month. The today.cgi script displays the talks for a given day and the following day. The dump.cgi script dumps a few fields for the entire database. Whenever an entry is displayed, the entry itself can be clicked on to display the original talk announcement. Alternatively, the user can click on the link that says [demo] to see the results of parsing the entry with all three systems that we developed.

Discussion and Conclusions

Brill/SCOL

This system was not run against the CMU tagset to calculate F1 accuracy, but we are able to make other direct observations about the feasibility of using this approach to solve the problem of automated announcement parsing.

The difficulty in writing grammar rules for Brill tagged output is that the Brill tag set limits the accuracy of extracting the relevant information from the messages. Cass cannot take advantage of many of the formatting cues found in announcements, including the placement of information in the announcement and the use of capital letters for titles.  Therefore, ad hoc grammars have to be invented to extract the information desired.  Furthermore, Brill and Cass were designed for purposes other than to parse announcements, and the format for announcements is extremely different from the format of the type of text Brill and Cass were built for or trained upon.  Most of the information extracted from announcements are phrases or a series of strings that do not have any grammatical meaning.  Therefore, it is difficult to write rules matching a tag set meant for text with grammatical structures.

The order in which the rules are applied also affects the accuracy of the parser.  Therefore the most general rules, such as those for parsing people, are placed at the end and more specific rules, such as those for parsing the topic, are placed at the beginning.  This placement allows the more specific rules to be applied first and hopefully filter out most of the information, leaving a smaller set of potential matches the more general rules. Any attempt to use this type of filter based approach must be mindful of this constraint

The grammar could have been written to take advantage of the header style and therefore make the task of parsing announcements much easier.  For example to parse the topic, the grammar could have searched for the string "topic:" and parse the entire line.  However, this approach was not pursued, because it defeats the purpose of using NLP to extract the information from the announcement body.  The grammar that does not rely on information such as "topic:" is more robust, because it can handle announcements without those header information as well.

Future work to improve the system include developing a better function for Parser.java to select the correct information, studying a larger corpus in order to write better grammars for parsing the different types of information, and even developing a better tagger for tagging announcements.  This last idea is the major roadblock to successfully using SCOL to parse the announcement messages. Without a tagger more well adapted to the types of structures and word classes found in these messages, writing rules to match the message content is not a feasible approach.

VLRE

Sincere thanks go to MIT professor Rob Miller, who said that it was mandatory to do regression testing on the VLRE technique in order to gauge the effectiveness of new rules. A simple regression system was created which tagged the entire corpus and compared the tag results to previous runs. This regression system made it very easy to determine the effect of new regular expressions added to the system.

Another form of regression testing was to compare the tags found by the VLRE system with the tags in the tagged CMU corpus. By exploring the differences it was possible to find systematic errors in the VLRE tagger and correct them. This became an iterative process that was in principle similar to the learning that a neural network might perform, but more effective in practice since the programmer was free to make a variety of different changes to the source code.

One obvious area for improvement for this system is in the selection of regular expression weights.  Currently these were assigned by the programmer and hand-tuned as necessary to change rule priority. A genetic algorithm could be used to tune the weights, similar to the way in which SpamAssassin weights are tuned.

Overall, we feel that we have shown the VLRE approach to be a successful technique for automated information retrieval from announcement messages. The VLRE system, especially when combined with the NAOMI architecture as discussed below, appears to be powerful enough to be populate a calendar and serve a useful purpose for the users of that service.

NAOMI

NAOMI scored fairly well on the four types of tags present in the test corpus, despite the fact that the semantic word class input features were intentionally very weak, as discussed above. This is a good sign that the networks are able to take advantage of the other types of structural and semantic information presented to them, even though no single one of the inputs is sufficient to independently solve the problem. The architecture is flexible enough that it can accommodate a diverse group of input features, and the neural system has proven to be up to the task of performing data fusion by weighing the inputs and producing human readable output.

Speaker identification compared very well to other systems with an F1 of .871. The network did a good job of identifying names, using a combination of the leadup word classes, capitalization, and the inverse document frequency count. Since names are fairly unique to a document, the IDF tagger does particularly well at locating them. Interestingly, a common error made by the system here is tagging the name of the lecture host in addition to the speaker. A feature generator capable of detecting the presence of a "host" attribute would help the system to improve its performance. Other errors frequently occurred in the presence of initials in names. In general, erroneously tagging names that aren't the speaker is a larger problem (.8 precision) than actually finding the names (.95 recall). This suggests that future improvements will need to focus on providing the system with inputs that allow it to differentiate between speaker names and ordinary names, which can probably be done by improving the accuracy of the equivalency classes of words found nearby speaker tags.

The location tagger also scored relatively well. It was able to find room numbers by searching for numeric tokens nearby words in the semantic equivalency classes for other locations. Building names were found in a similar fashion except they required capitalized words instead of numbers. However this approach was not without problems. The largest source of errors was the situation in which NAOMI would correctly find the location but add extra tokens to the start or end of the tag. The extra tokens were almost always capitalized or numerical in nature. Although this is a problem, it is a relatively small error to make in comparison with completely missing or inventing a tag. The tags with additional tokens appended are still human readable and adequately express the intended information.

The tagging of end times for the events also worked very well. The types of errors made here are roughly 33% times that are incorrectly labeled as start times, and 67% extra tokens that are incorrectly added to the end or beginning of an existing tag (usually "-" or "pm"). These types of errors are similar to the errors in the speaker and location modules respectively; again, it seems reasonable to conclude that as the word class inputs are built up in a more robust manner, performance on these cases will improve. As an aside, the neural network in the etime task used a threshold of .25 instead of .5 to determine when a token should be tagged. This was likely because there were so few examples of end times in the training set, it was difficult for the neural network's weights to push the value all the way over .5 in the marginal cases. Setting the threshold lower improved performance by a decent amount.

Finally, the start time tagger worked adequately but showed room for improvement. As with the speaker tagging task, a large source of errors here was in differentiating between start times and the times that refreshment will be served (or in general, any other time). It is surprising to see this, given the accuracy of the end time tagger, but the difference is likely again in the classes of words that lead up to the time tag. For an end time, the leadup is likely to be another time (as in the format "1:30 to 2:30"). For a start time, the leadup is much more complicated and is therefore not handled by our simple word class system.

Another frequent problem with start times was mislabeling numerical date formats and phone numbers as times. These problems could be fixed if we had phone numbers, dates, or times tagged and used as inputs to the system. This would allow NAOMI to select only actual times as parts of tags for the output. Furthermore, if another input to the system could represent refreshment times or other types of non-start times (such as RFC header time), NAOMI would then be able to prevent from overgeneralizing on times and could do a better job of selecting only start times.

Future Work

We see here a common theme throughout the types of errors produced by NAOMI: specifically, the equivalency classes of words surrounding tags are weak. Also, the feature generators we have provided to the system have gaps when it comes to detecting certain types of information that is not tagged in the corpus (like dates or the time that refreshments are served). Furthermore, we see that one of the most problematic aspects of the VLRE system is that there is no intuitive way to weigh and combine the hundreds of regular expressions required to extract the interesting types of data from the messages.

Given the shortcomings of each system, it appears that there is a natural marriage waiting to be made between VLRE and NAOMI. By using VLRE's outputs as input tags to NAOMI, we accomplish two things. VLRE gets an effective algorithm for automatically weighing its outputs to determine which matching patterns are informative and which patterns are noise. NAOMI gets a powerful new host of feature generators capable of covering a much broader and more complicated range of information. These generators would serve as the improved semantic words classes, in that all tokens that match a VLRE pattern can be thought to be in the same equivalence class.

The two systems clearly would work well together in symbiosis. The resulting system would be a much more powerful and dynamic method for automatic information extraction of message announcements (or any natural text, for that matter). This will be the direction of our future work.


Bibliography

References

1. Fast Lexical Analyzer Generator. http://www.gnu.org/software/flex

2. Brill, E. "Some advances in rule based art of speech tagging", in Proceedings of the Third International Workshop on Parsing Technologies.

3. Abney, S. "Partial parsing via finite-state cascades", in Proceedings of the ESSLLI '96 Robust Parsing Workshop, 1996.

4. Cheong, K.  "Extracting Information from Conference Announcements: High Recall, High Precision", in D. Estival (ed.) Abstracts for the ANLP Post Graduate Workshop, pp 11-20.

5. Freitag, D. and McCallum, A.  "Information Extraction with HMM Structures Learned by Stochastic Optimization", in Proceedings of AAAI-2000.

6. Almgren, M. and Berglund, J.  "Information Extraction of Seminar Information", http://nlp.stanford.edu/courses/cs224n/2000/berglund/report.pdf

7. Ciravegna, F.  "Adaptive Information Extraction from Text by Rule Induction and Generalisation", in 17th International Joint Conference on Artificial Intelligence (IJCAI 2001), Seattle.