Electronic mail is a popular method of communication in both academic and industrial settings. One common email format is an announcement. People receive dozens of announcement emails every day about activities taking place around them, including seminars, conferences, and parties. There is a wealth of information in each announcement about a specific activity, and it can become tedious and time consuming to manually extract the useful and relevant information from the email. People can spend a significant amount of time everyday reading announcements, deciding if they have already received the announcement before, and updating their PDA or paper calendar with the relevant information.
The goal of the project is to explore and compare a number of techniques for automatically extracting relevant information from announcement emails. The ultimate objective is to use the lessons learned to create a system that can perform the task with high recall and precision, and to present the information in an organized fashion on an electronic calendar. Therefore, users no longer have to spend time reading through dozens of announcements in order to find out what activities are taking place around them. Instead, all of that information is summarized and conveniently sorted for the user in the calendar.
We decided to focus on seminar announcements because they are one of the most common types of emails received on a daily basis by members of the MIT community. Seminars are also a great source of food. However, our system can easily be applied to handle other types of announcement emails as well.
We implemented and compared three different techniques for analyzing seminar announcements:
* Using Flex1 and hand-crafted regular expressions to create a pattern matching program that takes advantage of the structure common to most announcement emails.Various attempts have already been made at trying to solve the problem of automatically extracting information from announcement emails, and most techniques take advantage of the regular format typical of most announcement emails. Cheong4 discusses knowledge-based engineering approaches, such as recognizing domain-relevant patterns and relationships amongst grammatical constituents in the text and parsing with finite state transducers (FST). Freitag and McCallum5 apply a machine learning approach to information extraction of announcement emails and compare the performance of a grown Hidden Markov Model with stochastic optimizations to the performance of static HMM models. Several approaches merge knowledge-based engineering with machine learning to increase recall while maintaining high precision. Almgren and Berglund6 describe a system focused on parsing time and location fields from announcements by using a pattern matcher together with a Hidden Markov Model. (LP)2, described by Ciravegna7 combines machine learning with shallow NLP to create an algorithm for parsing announcement emails, by training the system on a corpus tagged with SGML tags and using shallow NLP to generalize the rules for higher recall. Freitag and McCallum were also the creators of the tagged CMU corpus used in our project.
Semantic Understanding with Very Large Regular Expressions (SLVR) is based on the observations that information is commonly presented with common typographical or lexical idioms. For example, many talks at universities are structured as a word, a colon, and a brief explanation, as shown in Example 1. Other times, there is implied structure from the typographical layout, as shown in Example 2. Some announcements are quite unstructured, as shown in Example 3.
|
********************************REMINDER*********************************** THESIS DEFENSE TODAY AT ***************************************************************************** Hello, Please join us on DATE: TIME: LOCATION: NE43-941 SPEAKER: Paul Fitzpatrick WHAT: Ph.D. Thesis Defense Example 1: A structured talk announcement |
|
****************** H C
I S E M I N A R S E R I E S ****************** ****************** 5/ Mobile Devices for
Control Brad Myers Human Computer
Interaction Institute, CMU NE43-941 Abstract: With today's wireless technologies,
such as BlueTooth and IEEE 802.11, connecting handheld
computers and conventional computers together are … Example 2: A semi-structured talk announcement |
|
<0.09.11.93.22.01.32.Mark_Kantrowitz@GLINDA.OZ.CS.CMU.EDU.0> Type: cmu.andrew.org.epp Topic: Rayman talk:
Competition vs. Community Dates: Time: PostedBy: Mark Kantrowitz on Abstract: -------
Blind-Carbon-Copy To: gso+@ANDREW.CMU.EDU Subject: Rayman
talk: Competition vs. Community Date: Message-ID:
<6738.752900492@GLINDA.OZ.CS.CMU.EDU> From: Mark Kantrowitz
<Mark_Kantrowitz@GLINDA.OZ.CS.CMU.EDU> On
Wednesday, November 10 at of
Sociology and Director, Pathways for Women in Science Project at Equity
in the Sciences, Math & Engineering." This is the third event of the CMU/Pitt Women's Speakers Series. Benedum Hall is
located on Stadium.
Take Forbes to Bouquet, turn right and walk up the hill to O'Hare. Pathways
is a longitudinal study about women in science and math: what attracts
them and keeps them in these fields at critical stages in their lives. According
to Rayman, four factors keep undergraduates engaged
in science: parents, mentors, hands-on
experience research, and career advice. In graduate school, the importance of
mentors makes the difference. For
further information and to RSVP, contact Kathleen Minadeo
Johnson, x8-7970 or kj26. -
------- End of Forwarded Message -------
End of Blind-Carbon-Copy Example 3: An unstructured talk announcement cmu.andrew.org.epp-344-0.txt [tagdemo] |
We believe that once these idioms are introduced into a community, they are picked up and repeated by others within the community. As they become more widely used, we believe that people learn how to rapidly recognize them and extract information from them. As an idiom becomes more widely used still, there is an advantage to community members to use the idioms further. This process explains how typographical conventions are adopted.
Because many typographical conventions appear to be arbitrary, we believe that it is difficult to anticipate or enumerate them from first principles. To use the examples from the previous paragraph, there isn’t any real reason why academics should title some talks with a word followed by a colon and a brief explanation, but some academics starting doing so and the practice caught on. Likewise, there is no obvious reason why other academics should put the titles of their talks in quotes, or follow them with a line of delimiters. Whatever the reason, these typographical conventions can be manually enumerated by consulting examples, rules can be written, and then those rules can successfully find examples of the conventions in a corpus.
Regular expressions can also be used for extracting information from otherwise unstructured textual descriptions. For example, the announcement for a talk may not contain any obvious title or subtitle, but the announcement may include a sentence which reads ‘the topic of his talk will be “Semantic Understanding with Very Large Regular Expressions.”’ In our experience, such idioms are surprisingly common --- especially in the University environment.
At first glance, it may seem that the process of creating very large regular expressions for semantic analysis would be a never-ending task. In our experience, a relatively small number of rules can handle an extraordinarily large number of cases. And as new idioms emerge, it is a simple matter to create new regular expressions and add them to the list.
Overall, we are very pleased with the ease of constructing VLREs and impressed by their power. Frequently there are many different VLREs that can extract information from a given talk announcement. If one VLRE fails to match because of a typo or configuration error, another VLRE can compensate.
Our procedure for tagging talk announcements follows these steps:
The talk announcement is processed by a perl script called “filter.” This script:
1. Reads every line of the announcement.
2. Removes every MIME type that is not text/plain.
3. Searches the text lines for all lines that have prefixes or suffixes that commonly arise from forwarding email messages. For example, all lines that are prefixed with a greater-than sign have the greater than sign removed. This has the side-effect of eliminating many instances in which special characters are used for typographical decoration.
4. Finds all blocks of text that have the same amount of indentation and removes that indentation.
5. Flows all multi-line paragraphs into a single line of text.
6. Sends the resulting “filtered” message to standard output.
After being filtered, it is processed with the VLRE tagger. The tagger reads each line of its input and makes multiple passes with different very large regular expressions created with the “flex” lexical generator system. After lexical processing, a series of
In order, these expressions:
1. Recognize typographical conventions, including lines that end with a colon, lines that end with a question mark, lines that begin or end with quotes, lines of delimiters, and blank lines. (parse_typography.flex)
2. Recognize the speaker of a talk. Speakers are recognized by RFC822-style headers that say “Speaker:,” by sentences by say so-and-so “will be speaking” or “will be presenting” on a certain topic, and so forth. (parse_speaker.flex)
3. Recognize the topic of a talk, using heuristics that are similar to those that are used to recognize a speaker. (parse_topic.flex)
4. Recognize other common items from announcements:
a. rooms and buildings in which the talk is likely to be given, using a set of regular expressions that match the conventions of room numbers and the popular buildings at MIT and CMU. (parse_announce.flex and parse_iannounce.flex)
b. words that are commonly used in academic talk announcements, such as the names of universities, the names of industrial research labs, and words commonly used in job titles. (parse_announce.flex and parse_iannounce.flex)
5. Recognize dates and times. (parse_time.flex and parse_month.flex)
As indicated above, each of these regular expression systems
is created using flex, a system for creating lexical analyzers. A special flex
preprocessor was used so that each set of regular expressions could use a
common set of flex macros. One commonly used macro is
A significant number of macros were created for working with times. Because times follow fairly well-determined rules, it is possible to create a regular expression that matches on valid times such as 3pm, 3:30pm, 1530 and 15:30, but do not match on invalid times such as 15:30pm or 115:30. We believe that the amount of attention that has been applied to these regular expressions is partially responsible for the success that VLRE has with locating times.
|
Macro |
Purpose |
|
S |
Whitespace |
|
SP |
Mandatory white space |
|
NOTW |
Not white space |
|
END |
End of a word |
|
P |
Punctuation |
|
H |
Alphanumeric |
|
|
|
|
QUOTED_STRING |
A quoted string |
|
JAN |
Jan, Jan., or January |
|
FEB .. DEC |
Other calendar months |
|
DAY |
The word “DAY” |
|
MON |
Mon or Monday |
|
TUE .. SUN |
Other calendar days |
|
NAMED_DAY
|
Any named day of the week |
|
|
|
|
NDAY |
Any numeric day of the month |
|
SEC |
Any numeric second |
|
MIN |
Any numeric minute |
|
HOUR |
An hour in 12-hour format |
|
HOUR24 |
An hour in 24-hour format |
|
NMONTH |
A numeric month |
|
AMONTH |
An alphabetic month |
|
|
|
|
YEAR |
A year |
|
YEAR4 |
A 4-digit year |
|
NUM |
Any number |
|
NOTNUM |
Not a number |
|
|
|
|
TIMECHARS |
Any character that tends to be used in a time
representation |
|
AM |
|
|
PM |
|
|
|
|
|
TIME |
A time |
|
TIME_RANGE |
A time range |
|
TIME_OR_RANGE |
Either a time or a time range |
|
|
|
|
DATE1 |
Number-month-year |
|
DATE2 |
Month Number Year |
|
DATE |
Either date1 or date2 |
|
|
|
|
TALK |
(talk|presentation) |
|
DOMAIN |
Common top-level Internet domains |
|
|
|
|
ORDINAL |
1st, 2nd, 3rd, 4th,
and so on. |
|
|
|
|
PRO |
Possessive pronouns. |
|
PREPOSITIONS |
Common English prepositions |
|
PERSON_TITLES |
Typical titles that a person might have |
|
PERSON_DEGREES |
Typical degrees that a person might have. |
|
CAPWORD0 |
A capitalized word at the beginning of a capitalized word
group |
|
CAPWORD |
A capitalized word |
|
CAPWORDSP |
A capitalized word with a space |
|
CAPWORDS |
A string of capitalized words; this regular expression
understands that some words such as “and” and “but” can appear without being capitalized. |
|
|
|
|
NAMED_ROOMS |
Common rooms at MIT; “Kresge”
and “AI Lab Playroom.” |
|
|
|
|
NAMED_FLOOR |
Matched “5th Floor” |
|
ROOM3NUMBER |
A 3-digit room number |
|
ROOM4NUMBER |
A 4-digit room number |
|
ROOM_NN_NUMBER |
Either 3 or 4 digit room number |
|
ROOMNUMBER |
Any room number |
|
|
|
|
MIT_MAIN_BUILDINGS |
All of the numbered MIT buildings on main campus |
|
MIT_E_BUILDINGS |
All of the MIT buildings beginning with E. |
|
MIT_NE_BUILDINGS
|
All of the MIT buildings beginning with NE |
|
MIT_N_BUILDINGS |
All of the MIT buildings beginning with N |
|
MIT_NW_BUILDINGS
|
All of the MIT buildings beginning with NW |
|
MIT_W_BUILDINGS |
All of the MIT buildings beginning with W |
|
|
|
|
OBVIOUS_MIT_BUILDINGS
|
Any MIT building in standard form |
|
MIT_SPECIAL_ROOMS |
Special MIT buildings, like “Kresge” |
|
MIT_BUILDING_ROOM |
Any MIT building and a room number |
|
|
|
|
|
|
|
BUILDING |
Building names common at other schools |
|
|
|
|
SCHOOL_YEAR |
Matches all-caps and a year, like MIT ‘84 |
|
ROOM_TYPE |
Matches types of rooms, like Lounge, Conference Room, etc. |
Common regular
expressions in flex-macros
When a regular expression matches, it can either set a numeric flag. A list of numeric flags appears below. Alternatively, a regular expression can also register its matching value under a property name. This is called a “found” property. Each regular expression that creates a found property records the matching text, the name of the regular expression that matched, the line and offset that matched, and the rule’s “shurity.” The shurity of the rule is a percentage that reflects the likelihood that the rule actually found what it was supposed to have found. Currently these percentages are estimated by the person who created the rules, but at some future point in time they might be determined by a naive Bayesian analysis or a genetic algorithm. A list of found properties appears below as well.
|
Flags |
Meaning |
|
fIGNORE |
Ignore this match |
|
fSOMETHING |
We found something |
|
fDELIM |
A line of delimiters |
|
fBLANK_LINE |
A blank ilne |
|
fTEXT |
A line with text |
|
fABSTRACT |
A line with the word “Abstract” |
|
fPARENS |
A line surrounded by parenthesis. |
|
fORGANIZATION |
A word that tends to be in organizational names, like
University. |
|
fFEATURING |
Words like “Featuring,” “Starring,” etc. |
|
fTIME |
Something that looks like a time |
|
fDATE |
Something that looks like a date |
|
fOR |
The word “or” on a line by itself |
|
fEND_WITH_COLON |
A line that ends with a colon |
|
fEND_WITH_QUESTION |
A line that ends with a question mark |
|
fBEGIN_WITH_QUOTES |
A line that beings with a quote mark |
|
fEND_WITH_QUOTES |
A line that ends with quotes. |
Paragraph flags in found.h.
|
SPEAKER |
Who is giving the talk |
|
SPEAKER_TITLE |
The speaker’s title |
|
SPEAKER_AFFILIATION |
Where the speaker works |
|
TALK_TOPIC |
What the talk is about |
|
TALK_TIME |
What time the talk is given; might be a range. |
|
TALK_TIME_START |
When the talk starts. |
|
TALK_TIME_END |
When the talk ends |
|
TALK_DATE |
The date of the talk |
|
TALK_ABSTRACT |
What the talk is about |
|
TALK_ROOM |
The room that the talk is being given in |
|
TALK_BUILDING |
The building where the talk is being given |
|
TALK_ROOM_BUILDING |
Both the talk room and the building; this might be matched all at the same time. |
|
RFC822_DATE |
If the talk was announced in an email message, this it the Date: field of that message. Useful for determining the year in which a talk will be given if the message failed to mention this important fact. |
|
TALK_CONTACT |
Self-explanatory. |
|
REFRESHMENTS_TIME |
When is the food going to be served? |
|
REFRESHMENTS_ROOM |
Where will the food be? Might be different. |
|
HOST |
Who is hosting the talk. |
|
HOST_AFFILIATION |
Where is the host from? |
|
RSVP_DATE |
Some talks require an RSVP by a particular date. |
Properties that can be
found in a talk; declared in found.h.
After the initial parse with large regular expressions, a series of “short paragraph rules” are applied to all paragraphs that are “short.” Each rule operates by passing a window over the lines in the announcement, looking for a group of lines that are “short” (different rules have different definitions of short), and then looking for lines that match a certain pattern of text, recognized regular expressions, and recognized words. If a rule fires, “found” objects are created for each line in the window. The rule specifies what is found and the shurity of each finding.
There are 17 short paragraph rules in the production system. These rules are typically used to code for multi-line typographical conventions, since the flex-generated regular expressions are only used to process a line at a time. A typical short paragraph rule is that a line that is less than 80 characters long followed by a line of delimiters is 50% likely to be the topic of the talk. Another rule is that a line followed by a line in parenthesis that is less than 300 characters long is likely to be the topic of the talk. Yet another is that if there are three lines less than 60 characters and the second line contains a word that is commonly seen in job titles (like “professor” or “assistant”) and the third line contains a word that is commonly seen in the name of an organization (like “university” or “laboratory”) then the first line is likely to be the speaker, the second line is likely to be the speaker’s title, and the third line is likely to be the speaker’s affiliation.
A list of the short-paragraph rules appears below:
|
TITLE
-DELIM- |
If
a line is preceeded by a blank line and comes
before a line of delimiters, then it is the name of the talk. |
|
|
|
|
TITLE
(SUBTITLE) |
A
line preceeded by a blank line followed by a name
in parens is probably the title of the talk. |
|
|
|
|
TITLE:
SUBTITLE |
A
line ending with a colon followed by another line is the title of the talk |
|
|
|
|
TITLE:
SUBTITLE SUBTITLE
|
A
line ending with a colon followed by another line is the title of the talk |
|
|
|
|
TITLE?
SUBTITLE |
A
line ending with a question mark followed by another line is the title of the
talk |
|
|
|
|
``TITLE'' |
A
line in quotes surrounded by blank line is probably the title |
|
|
|
|
``TITLE---REST_OF_TITLE'' |
Two
lines with quotes |
|
|
|
|
TITLE
-OR- SUBTITLE |
a
tricky title with an -or- |
|
|
|
|
ABSTRACT |
If
a line says abstract and it is followed by a paragraph, we have found the
abstract! |
|
|
|
|
SPEAKER-TITLE |
A
line followed with a line that has a title is probably the name of somebody
and the following line is probably that person's title. |
|
|
|
|
KNOWNSPEAKER-ORGANIZATION
|
If
we don't have a speaker_affiliation, and a line is
found in the message that has an organization, and this line follows a line
that contains the speaker’s name, and this line is shorter than 40
characters, then this line must be the affiliation |
|
|
|
|
SPEAKER-ORGANIZATION |
If
we don't have a speaker_affiliation, and a line is
found in the message that has an organization, and this line follows a line
that contains the speaker’s name, and this line is shorter than 40
characters, then this line must be the affiliation |
|
|
|
|
INTRODUCING-SPEAKER-ORGANIZATION |
If
we don't have a speaker_affiliation, and a line is
found in the message that has an organization, and this line follows a line
that contains the speaker’s name, and this line is shorter than 40
characters, then this line must be the affiliation |
|
|
|
|
SPEAKER-TITLE-ORGANIZATION |
If
we don't have a name of a speaker, but we can find a block of three doc.lines with no tag and one ORG and one TITLE line,
then almost certainly the first line is the person, the second line is their
title, and the third line is their organization |
|
|
|
|
SPEAKER-ORGANIZATION-ORGANIZATION
|
If
we don't have a name of a speaker, but we can find a block of three doc.lines with no tag and two ORGANIZATION tags then
almost certainly the first line is the person, the second line is their
title, and the third line is their organization |
|
|
|
|
SPEAKER-TITLE-ORGANIZATION-ORGANIZATION |
If
we don't have a name of a speaker, but we can find a block of three doc.lines with no tag and two ORGANIZATION tags then
almost certainly the first line is the person, the second line is their
title, and the third line is their organization |
|
|
|
|
JUNK:-QUOTEDTITLE:-SUBTITLE-ENDQUOTE |
If
we have a line with a colon, followed by a line that begins with a quote and
has a colon, followed by a line that ends the quote then the second and third
lines are the title |
|
|
|
|
ANYNAME-TITLE |
A
line followed with a line that has a title is probably the name of somebody
and the following line is probably that person's title. |
|
|
|
|
TITLE
(bare) |
If
we haven't found a topic, just grab the first line surrounded by blanks
that's unidentified |
|
|
|
|
LAST-DITCH-SPEAKER |
Find
a line with 2 or 3 words followed by a line with an organizational name |
short paragraph rules from short.cpp; each
is implemented with the C++ class rule.cpp.
After all of the single-line rules and short paragraph rules run, the
tagger evaluates all of the “found” objects. For each recognized tag (e.g.
“speaker,” “speaker_affiliation,” “talk_time”) the found object with the largest shurity value is
chosen and written to standard output.
The final VLRE step is time processing. Two regular
expression systems are used. The first, parse_time,
can recognize roughly 25 different time formats. When a time format is matched,
the appropriate fields in a unix
struct tm structure are filled in. The parse_time system relies on parse_month
to turn alphabetic months into numeric months. Special attention needs to be
paid to boundary cases such as “
A separate entry point in the parse_time
system can be used to parse time ranges. These are ranges such as “
·
If neither AM nor PM is specified on the first
part of the time range, and the second part of the time range has PM specified,
and the hour in the first part is lower than the hour in the second part, then
increase the hour for the first part of the time range by 12. This correctly interprets
“
· If the time range ends up with the first part being later in time than the second part, then the range is invalid.
As part of the time processing, any discovered tag that has the word “time” in it is checked to see if is actually a time range. If the tag is a time range, it is split into two tags --- one for the start time, one for the end time. These derived tags are given the same shurity value as the original tag.
Finally, the highest-rated tags are printed. If the “-a” option is given to the tagger, then all of the found tags are printed. This output mode might be useful as further input to a neural network.
Seminar announcements have a unique topological structure that allows information to be conveyed quickly to the reader. The header of a typical seminar announcement usually contains information about the talk, such as its type, the speaker, the host, the speaker's affiliation with the host, the contact, the time, the place, and the topic.
It is a difficult task to use KE with NLP to extract useful information from announcements, because the goal is to parse small fragments of ungrammatical phrases, not to parse sentences with defined grammatical structures. Therefore, grammars cannot be written based on relationships between words with different parts of speech; instead, grammars have to be written based on keywords and symbols in the announcement and the topological structure the information is presented in.
The two types of NLP techniques used to extract relevant information from announcements are part of speech tagging (POS) and parsing. POS is accomplished with the Brill tagger and parsing with the Cass partial parser. The information that the system tries to extract from an announcement are the following: date, time, room, type of talk, speaker, speaker affiliation, host, topic, and contact information. Contact information can appear in the form of URLs or emails.
The order in which an announcement is processed is the following:
The different components - spacer2, Brill, debrill, tagfixes, Cass, and Parser.java are described in detail below.
Seminar announcements are tagged with the Brill tagger. Before feeding the announcement into Brill, the input is formatted with spacer2 to meet Brill's textual requirement concerning spaces in the text. Tagging the announcement is necessary in order to parse the announcement with grammars. Figure 1 shows an untagged announcement, and Figure 2 shows the corresponding announcement after tagging with Brill.
| Topic : " MHC Class II : A Target for Specific
Immunomodulation of the Immune Response " Dates : 3 - May - 95 Time : 3 : 30 PM Seminar : Departments of Biological Sciences Carnegie Mellon and University of Pittsburgh Name : Dr . Jeffrey D . Hermes . . . |
Figure 1: Untagged announcement.
| Topic/NNP :/: "/" MHC/NNP Class/NNP II/NNP :/: A/DT Target/NNP
for/IN Specific/JJ Immunomodulation/NNP of/IN the/DT Immune/NNP Response/NNP
"/" Dates/NNP :/: 3/CD -/: May/NNP -/: 95/CD Time/NNP :/: 3/CD :/: 30/CD PM/NNP Seminar/NNP :/: Departments/NNPS of/IN Biological/JJ Sciences/NNPS Carnegie/NNP Mellon/NNP and/CC University/NNP of/IN Pittsburgh/NNP Name/NN :/: Dr/NNP ./. Jeffrey/NNP D/NN ./. Hermes/NNP . . . |
Figure 2: Announcement after being tagged by Brill.
One major flaw with Brill, and any other type of tagger that is currently available, is that the tag-set provided by the tagger is not robust enough for tagging announcements. There is a significant difference between parsing announcements and parsing newspaper text. Brill provides a rich set of tags for differentiating between various parts of speech, but POS tags are useless for tagging announcements, because announcements do not have complex grammatical structures. Furthermore, Brill does not differentiate many subtle details that occur in an announcement. For example, Brill considers any noun that begins with a capital letter to be a singular proper noun (NNP), and this over generalized rule has the undesired side-effect of tagging both John and Laboratory as NNP.
The way Brill mishandles capitalization and overlooks many textual details that occur in announcements hinders the goal of extraction information from announcements, because these textual details are extremely common in announcements and play an important part in differentiating between different types of information. For example, we would like the parser to recognize that John is part of a person's name and Laboratory is part of building or a host's name based on the tags that Brill gives John and Laboratory. Unfortunately, this differentiation is not possible with the current implementation of Brill, because Brill tags both John and Laboratory as NNP.
Brill's rich tag-set for POS tagging is also a hindrance when parsing announcements with grammars, because the tags are not useful for parsing announcements and make the grammars more complicated, as described in the section "Parsing with SCOL". The ideal tagger for tagging announcements would recognize people's names, numbers, months, and company names. The tagger would tag everything else as either a noun, a verb, or an adjective without worrying about irrelevant details for parsing announcements, such as tense and pluralizing.
SCOL is a toolbox with many different NLP tools for analyzing text. The two tools used for analyzing announcements are tagfixes and cass. In order to write any useful grammar for parsing announcements, it was necessary to create a subset of more useful tags for various words after tagging with Brill. Grammars are written using the Cass partial parser. A partial parser is used instead of a normal parser, because extracting information from announcements does not require parsing every sentence and phrase in the announcement. Instead, only the relevant information from the announcement are of interest, and it is desirable for the parser to skip over the extraneous information that the parser cannot parse with the grammars provided. The input from Brill is also formatted before parsing with Cass by debrill.
Tagfixes translates a subset of tags associated with specific words to a new set of tags. This functionality is extremely useful for generating a new set of tags that is more useful for parsing announcements with grammars. Figure 3 shows the tagfixes file used for parsing announcements. The first column specifies the new tag of the string indicated in the second column, and the third column shows the original tag associated with the string. Tagfixes is performed after tagging with Brill and before parsing with Cass.
| month January NNP month Jan NNP month February NNP month Feb NNP month March NNP month Mar NNP month April NNP month Apr NNP month May NNP month June NNP month Jun NNP month July NNP month Jul NNP month August NNP month Aug NNP month September NNP month Sep NNP month October NNP month Oct NNP month November NNP month Nov NNP month December NNP month Dec NNP month Spring NNP month SPRING NN month Summer NNP month SUMMER NN month Fall NNP month FALL NN month Winter NNP month WINTER NN |
loc Room NNP loc room NN loc Auditorium NNP loc Hall NNP loc Wing NNP |
title Professor
NNP title Prof title Doctor NNP title Dr NNP title Mr NNP title Mrs NNP title Ms NNP title Doctoral NNP title PhD NNP |
talk Seminars NNP talk seminars NNS talk Seminar NNP talk seminar NN talk Colloquium NNP talk colloquium NN talk talk VB talk Talk NN talk TALK NNP talk Lecture NNP |
school department
NN school Department NNP school Deparments NNPS school University NNP school College NNP school Institute NNP school School NNP |
cmp
Committee NNP cmp Council NNP cmp Center NNP cmp Company NNP cmp Companies NNPS cmp Laboratory NNP cmp Laboratories NNPS cmp LABORATORIES NNPS cmp LABORATORIES NNP cmp Foundations NNS |
col
: : hyph - : at @ IN cma , , qa " " sq ~ NN SYM ( ( on on IN |
Figure 3: Tagfixes translates one set of tags to a new set of tags for specified words.
The tag month facilitates parsing dates. Brill tags all month and their abbreviations as NNP. Month is a more meaningful tag for months and their abbreviations, so tagfixes translates all NNP tags associated with months to the tag month. were translated into month. Seasons, such as spring, summer, fall, and winter were also tagged as month, because they are associated with groups of months.
The tag loc is used for parsing locations. Words that suggest a type of location, such as room, auditorium, hall, and wing, are tagged as either NNP or singular nouns (NN) by the Brill tagger. Again, the Brill tags do not convey the important information that these words are associated with places where the seminars take place. Therefore, the loc tag was added to the tag-set to facilitate the process of writing grammars to parse locations from announcements.
People's names are one of the most challenging types of information to parse, because there is no distinguishing factor that separates names from other types of information or from the background text. Brill tags most names as a pair of nouns - some combination of NN and NNP. If such a simple scheme were used to parse names, then any other noun pair in the announcement would also be considered as names, which reduces the precision of the system. Names are also difficult to parse because of their variation. Various symbols can occur within names depending on the origin of the name, such as Javier L~Pez. A more detailed discussion on parsing names is provided in "Writing Grammar with Cass". However, many speakers in announcements have titles in their names, such as Professor, Dr, Mr, or Mrs. Brill also tags the titles as NNP or NNPS, which again does not convey any useful information. Tagfixes translates the tags into the new tag title to provide contextual clues for finding the speaker's name during parsing.
The tag talk identifies keywords related to the type of talk in the announcement; whether it is a talk, lecture, or seminar series. Brill tags most of these keywords as NNP and NNPS. Therefore, the talk tag serves two purposes: identifying talk type and distinguishing these words from being parts of names.
Similarly, the tag school identifies keywords related to the speaker's affiliation, such as department and university. Again, Brill tags most of these keywords as NNP and NNPS, so it is important to distinguish them from other words that might be parts of names.
Finally, tag cmp identifies keywords related to the host of the talk. Keywords include words like company, laboratory, and center. Again, Brill tags most of these keywords as NNP and NNPS, so it is important to distinguish them from other words that might be parts of names. Identifying the host correctly is almost as challenging as identifying people's names, because the host is a name as well, such as IBM. However, methods have to be implemented to distinguish a company name from a person's name. A similar argument is applicable to the speaker's affiliation, if the affiliation is the name of a school, such as Carnegie Mellon, which could easily be the name of a person as well.
Other miscellaneous tags incorporated into the tag-set include dn for tagging parts of emails and URLs to indicate possible contact information; in, on, and tm for keywords identifying certain types of relevant information preceded or followed by keywords, such as in, on, am, and pm; col and at for special symbols such as : and @ that indicate the presence of time and emails. The last three entries in tagfixes are for fixing some of the mistakes Brill makes when tagging announcements and fixing spelling mistakes in the corpus, suggesting that the system has to be able to tolerate human errors and errors with Brill when trying to extract information from announcements.
A quick glance at Figure 3 reveals how much richer the tag-set is after incorporating tagfixes for tagging announcements. Many of the words that are important for parsing announcements are tagged by Brill as either NNP, NN, or their plural forms. Brill's impoverished tag-set for parsing announcements makes writing grammars extremely difficult, so tagfixes is used to expand the tag-set to make parsing more feasible.
A grammar rule is associated with each type of information extracted from an announcement and each rule consists of one or more sub-rules. Figure 4 shows the grammar rules used to parse announcements. Grammar rules were written using KE and whatever specific information that could be extracted about the domain. For example, most topics are encapsulated inside of quotes. These types of information were sometimes generalized or made more specific in order to capture the correct information from the announcement. If the rule was too general, then it would misinterpret other types of information. If the rule was too specific, then many rules were required to capture one type of information. Therefore, finding a good balance for rule-specificity was a challenge. How many exceptional cases handled by the rules also affected the size of each rule. Many of the rules were complicated by the presence of Brill's overabundance of tags associated with different parts of speech. Therefore, rules were often augmented with plural forms and different tenses of words. The order that the rules are applied also affects how well information is parsed from the announcement. More general rules, such as that for parsing the speaker, are placed towards the end so that more specific rules can be applied first during parsing to increase accuracy.
| ##
Date date -> (month .? CD cma CD) | ((CD | month .?) (hyph | /SL) (CD | month .?) (hyph | /SL) CD) | (CD? month .? CD) | (month CD hyph month CD (cma CD)+); ## Time time -> CD col CD (tm | ((NN | NNP) .? (NN | NNP) .))? (hyph (CD col CD (tm | ((NN | NNP) .? (NN | NNP) .))?))?; ## Room room -> ((CD? (NNP | NN | school)* loc+ (CD | (CD col CD))?) | (CD NNP) | (NNP CD) | (NNP hyph CD)) ((IN NNP) | cma (NNP | NNPS)+ loc CD?)?; ## Type type -> ((NN | NNS| NNP | NNPS| JJ | VB | DT | month)+ talk (IN (NN | NNS | NNP | NNPS)+)? (NNP | NNS)?) | (talk NNP? IN (NN | NNS | NNP | NNPS| JJ | VB | DT)+) | (NNP . NNP . talk); ## Topic topic -> (qa (WP | NN | NNS | NNP | NNPS | JJ | VB | DT | VBN | VBG) (NN | NNS | NNP | NNPS | POS | JJ | VB | DT | . | IN | TO | CC | col | cma | and | hyph | VB | VBG | VBN | MD)+ qa); ## University univ -> (NNP? school IN JJ? (NNP | NNPS) (cma (NNP | NN | NNPS | NNS))?) | (NNP NNP? school) | (NNP NNP? and (NNP? school IN JJ? (NNP | NNPS) (cma (NNP | NN | NNPS | NNS))?) | (NNP NNP? (cma NNP)? and? NNP school))? | (school IN (NNP | NNPS)+ (and | CC) DT? NNP+ school?); ## Company company -> ((NNP | NNPS)+ cmp (SYM NNP+ SYM)? (and DT (NNP | NNPS)+ cmp)?) | (cmp IN (NNP | NNPS)+ (IN NNP)? (cma NNP)?) | (fr (NNP | NN | NNPS | NNS)+) | (school NNP+ cmp); ## Person person -> ((title .?)? (NNP | NNP . | NNP ((NN | NNP) .))+ (hyph (NNP | NN))? (NNP | NNS) (sq (NN | NNS | NNP))? | NNP NNP VBZ) | (title . ((NN | NNP) .)+ (SYM (NNP | NN) SYM)? NNP) | (NNP cma NNP (and NNP)?); ## URL url -> ((NN | NNS | NNP | JJ | VB | DT) . (NN | NNS | NNP | JJ | VB | DT)?)+ . dn; ## E mails email -> (NN | NNP | JJ | VB | DT | CD) SYM? at (url | NN | NNS); |
Figure 4: Grammar rules for parsing announcements.
The grammar rule for parsing a date is the following: (month .? CD cma CD) | ((CD | month .?) (hyph | /SL) (CD | month .?) (hyph | /SL) CD) | (CD month .? CD) | (month CD hyph month CD (cma CD)+) and each sub-rule handles a different representation of a date, as shown in Table 1. The sub-rules are pretty straightforward, and it was observed that the structure of dates is relatively consistent throughout the corpus. Dates are one of the easier pieces of information to parse, because the information and the arrangement of the information that makes up a date is unique. For example, no other piece of information consists of a word with the month tag or a specific arrangement of numbers separated by symbols (other than time, which is discussed later). The difference between a date and a time is the symbol that separates the numbers in each type of information. In the case of a date, the symbols are usually "-", "/", or sometimes ".", and in the case of time, the symbol is usually a ":".
| Sub-Rule | Examples |
| (month .? CD cma CD) | Jan. 19, 2004; Jan 19, 2004 |
| ((CD | month .?) (hyph | /SL) (CD | month .?) (hyph | /SL) CD) | 1 - 19 - 2004; 1 / 19 / 2004; 19 - Jan - 2004 |
| (CD? month .? CD) | 19 Jan 2004; Jan 2004 |
| (month CD hyph month CD (cma CD)+) | Jan 19 - Jan 22 |
Table 1: Examples of each sub-rule for parsing date.
The grammar rule for parsing a time is the following: CD col CD (tm | ((NN | NNP) .? (NN | NNP) .))? (hyph (CD col CD (tm | ((NN | NNP) .? (NN | NNP) .))?))?. Examples of the rule are shown in Table 2. There is only one rule for parsing a time. Similar to dates, the structure of times is also relatively consistent throughout the corpus, and make times one of the easier pieces of information to parse. The information and the arrangement of the information that makes up a time is unique, because no other piece of information extracted consists of a specific arrangement of numbers separated by a colon and sometimes a dash if the start and end time of the talk are on the same line. No other information is followed by a string associated with the tm tag either to indicate AM and PM.
| Sub-Rule | Examples |
| CD col CD (tm | ((NN | NNP) .? (NN | NNP) .))? (hyph (CD col CD (tm | ((NN | NNP) .? (NN | NNP) .))?))? | 9:00; 9:00 PM; 9:00 P.M.; 9:00 - 10:00; 9:00 PM - 10:00 PM |
Table 2: Examples of rule for parsing time.
The grammar rule for parsing a room is the following: ((CD? (NNP | NN | school)* loc+ (CD | (CD col CD))?) | (CD NNP) | (NNP CD) | (NNP hyph CD)) ((IN NNP) | cma (NNP | NNPS)+ loc CD?)?. Examples of the rule are shown in Table 3. Similar to time, room consists of one grammar rule. Unlike the previous time and date, the representation of room is less consistent throughout the corpus, because room numbers can appear before or after building names and not all rooms are associated with numbers. Therefore, in an effort to capture all of the different types of representations of rooms, some of the sub-rules are made to be more general and overlap with sub-rules for extracting other types of information. This is a tradeoff for being able to find information about rooms in announcements. Therefore, the order in which the grammars are executed affects what type of information is parsed, as discussed in the Results section. However, most rooms have some type of structure with one or more words with the loc tag and some combination of numbers, which helps to write grammar rules for parsing rooms.
| Sub-Rule | Examples |
| ((CD? (NNP | NN | school)* loc+ (CD | (CD col CD))?) | (CD NNP) | (NNP CD) | (NNP hyph CD)) ((IN NNP) | cma (NNP | NNPS)+ loc CD?)? | Mellon Institute Conference Room; room 261 of GSIA; Wean Hall 7500; 2315 Doherty Hall |
Table 3: Examples of rules for parsing room.
The grammar rule for parsing the type of talk is the following ((NN | NNS| NNP | NNPS| JJ | VB | DT | month)+ talk (IN (NN | NNS | NNP | NNPS)+)? (NNP | NNS)?) | (talk NNP? IN (NN | NNS | NNP | NNPS| JJ | VB | DT)+) | (NNP . NNP . talk) and each sub-rule handles a different representation of a type of talk, as shown in Table 4. The structure of the information for talk type is relatively consistent throughout the corpus, since most announcements indicate what type of seminar the announcement is for by stating the type followed or preceded by a word with the talk tag. What isn't consistent is the type of words used in talk type, and this is one place where Brill's POS tags become a hindrance, because the rule has to account for different noun and verb types, as well as other parts of speech, including determiners, prepositions and adjectives. If these differentiations were not made, then the rules could be more concise, and it would not be necessary to worry about whether the rule has accounted for every possible POS tag in order to parse the information correctly.
| Sub-Rule | Examples |
| ((NN | NNS| NNP | NNPS| JJ | VB | DT | month)+ talk (IN (NN | NNS | NNP | NNPS)+)? (NNP | NNS)?) | UTC Summer Seminars for Graduate Students; Marketability Talk |
| (talk NNP? IN (NN | NNS | NNP | NNPS| JJ | VB | DT)+) | Lecture Series |
| (NNP . NNP . talk) | Chem. Eng. Seminar |
Table 4: Examples of rules for parsing talk type.
The grammar rule for parsing a topic is the following: (qa (WP | NN | NNS | NNP | NNPS | JJ | VB | DT | VBN | VBG) (NN | NNS | NNP | NNPS | POS | JJ | VB | DT | . | IN | TO | CC | col | cma | and | hyph | VB | VBG | VBN | MD)+ qa). Examples of the rule are shown in Table 5. After studying many examples from the CMU corpus, it was noticed that many topics are distinguished from the rest of the text by quotation marks and consist of at least two words within the quotation marks. However, the topics of many announcements are also not in quotes and those topics are distinguished by their placement within the announcement or consist of all capital letters. There are also cases where the topic is not distinguished by quotes or distinct text formatting.
Using quotation marks as markers to distinguish a topic is a good start at parsing topics consistently, since grammars cannot take advantage of any placement or text (capitalization) information within an announcement. A grammar rule for parsing topics without using quotation marks would be too broad and could potentially correspond to any consecutive series of two or more strings. Unfortunately, topics do not have any keywords to distinguish them from other types of information and could include any word from a person's vocabulary. This diversity makes topics difficult to parse without some type of knowledge about the domain to write good grammars with.
One problem associated with using quotation marks to identify a topic is information within the body of the announcement sometimes uses quotation marks for literary purposes, not to demarcate a topic. Therefore, an algorithm would be useful for selecting the correct quoted phrase as the topic, as described in the section "Extracting information with Parser.java". This is another place where the richness of Brill's POS tags is a hindrance, because all of the different tags that cannot be used for parsing announcements still have to be accounted for in order to parse the topic, making the grammar rule more cumbersome.
| Sub-Rule | Examples |
| (qa (WP | NN | NNS | NNP | NNPS | JJ | VB | DT | VBN | VBG) (NN | NNS | NNP | NNPS | POS | JJ | VB | DT | . | IN | TO | CC | col | cma | and | hyph | VB | VBG | VBN | MD)+ qa) | " MHC Class II : A Target for Specific Immunomodulation of the Immune Response "; " Graft Modification of Polymers in Twin Screw Extruders , " |
Table 5: Examples of rules for parsing topic.
The grammar rule for parsing the speaker affiliation is the following (NNP? school IN JJ? (NNP | NNPS) (cma (NNP | NN | NNPS | NNS))?) | (NNP NNP? school) | (NNP NNP? and (NNP? school IN JJ? (NNP | NNPS) (cma (NNP | NN | NNPS | NNS))?) | (NNP NNP? (cma NNP)? and? NNP school?)) | (school IN (NNP | NNPS)+ (and | CC) DT? NNP+ school?) and each sub-rule handles a different representation of a speaker affiliation, as shown in Table 6. Speaker affiliation is usually a department title or the name of a school; therefore, the school tag facilitates identifying the speaker affiliation. Not unlike names, the structure of the speaker affiliation varies, and writing a grammar rule that encapsulates all of the different ways the name of a department or school can be represented is difficult.
| Sub-Rule | Examples |
| (NNP? school IN JJ? (NNP | NNPS) (cma (NNP | NN | NNPS | NNS))?) | University of Pittsburgh |
| (NNP NNP? school) | McGill University; Carnegie Mellon University |
| (NNP NNP? and (NNP? school IN JJ? (NNP | NNPS) (cma (NNP | NN | NNPS | NNS))?) | (NNP NNP? (cma NNP)? and? NNP school?)) | Carnegie Mellon and University of Pittsburgh; University of Tennessee , Oak |
| (school IN (NNP | NNPS)+ (and | CC) DT? NNP+ school?) | department of Electrical and Computer Engineering |
Table 6: Examples of rules for parsing speaker affiliation.
The grammar rule for parsing the host is the following ((NNP | NNPS)+ cmp (SYM NNP+ SYM)? (and DT (NNP | NNPS)+ cmp)?) | (cmp IN (NNP | NNPS)+ (IN NNP)? (cma NNP)?) | (fr (NNP | NN | NNPS | NNS)+) | (school NNP+ cmp) and each sub-rule handles a different representation of a host, as shown in Table 7. Many of the problems encountered in parsing the speaker affiliation are also encountered in parsing the host. There are several keywords that uniquely identify a host, such as those tagged with cmp. However, the names of the hosts are often inconsistent and vary greatly in structure, and sometimes there are substructures, such as information surrounded by parentheses. Trying to find a rule that encapsulates the different ways that the name of a host can be represented is a challenge. Aside from the cmp tag, the from tag was also used to parse structures such as "John, from IBM".
| Sub-Rule | Examples |
| ((NNP | NNPS)+ cmp (SYM NNP+ SYM)? (and DT (NNP | NNPS)+ cmp)?) | Merck Research Laboratories; Dow Chemical Company; Student Alumni Relations Council ( SARC ) and the Career Center |
| (cmp IN (NNP | NNPS)+ (IN NNP)? (cma NNP)?) | Laboratory of Computer Science |
| (fr (NNP | NN | NNPS | NNS)+) | from IBM |
| (school NNP+ cmp) | Department of Biology Committee |
Table 7: Examples of rules for parsing host.
The grammar rule for parsing the speaker is the following ((title .?)? (NNP | NNP . | NNP ((NN | NNP) .))+ (hyph (NNP | NN))? (NNP | NNS) (sq (NN | NNS | NNP))? | NNP NNP VBZ) | (title . ((NN | NNP) .)+ (SYM (NNP | NN) SYM)? NNP) | (NNP cma NNP (and NNP)?) and each sub-rule handles a different representation of a speaker, as shown in Table 8. Parsing the speaker, a person's name, is one of the most challenging tasks in extracting information from an announcement. Names can take a variety of forms depending on ethnicity and sometimes are preceded or followed by a word tagged as title. The title tag is extremely useful in identifying names, because no other type of information contains the tag. Most of the time, a name is simply two NNP, and in some cases just one NNP. Therefore, a simple rule that covers NNPs could correspond to almost anything in the announcement, including information about location, host, topic, talk type, and speaker affiliation. The wide variety of forms that names can take - some names even include symbols such as - and ~, and the wide variety of arrangements the title and name can take all contribute to the difficulty of parsing names.
Due to the difficulty in parsing names and the lack of a proper tag-set for distinguishing names, the grammar rule for parsing names is particularly general. The rule depends on the title tag to indicate names that do have titles. However, for announcements with names that do not have the title tag, then the accuracy of the rule depends on the grammar rules for parsing the other information. As shown in Figure 4, names are one of the last information that the parser parses. Therefore, if the other information in the announcement is parsed correctly, then there should be a limited number of pairs of NNPs that can be the name of the speaker. Whether the speaker's name is parsed correctly also depends on the extracting function described in "Extracting Information with Parser.java".
| Sub-Rule | Examples |
| ((title .?)? (NNP | NNP . | NNP ((NN | NNP) .))+ (hyph (NNP | NN))? (NNP | NNS) (sq (NN | NNS | NNP))? | NNP NNP VBZ) | Dr . Jeffrey D . Hermes; Shelley Brozenick |
| (title . ((NN | NNP) .)+ (SYM (NNP | NN) SYM)? NNP) | Dr . R . J . ( Bob ) Pangborn |
| (NNP cma NNP (and NNP)?) | Ermolinsky , Meleshchenko and Plaksiyev |
Table 8: Examples of rules for parsing speaker.
The grammar rule for parsing the contact information are the following ((NN | NNS | NNP | JJ | VB | DT) . (NN | NNS | NNP | JJ | VB | DT)?)+ . dn and (NN | NNP | JJ | VB | DT | CD) SYM? at (url | NN | NNS). Examples of the rule is shown in Table 9. The only type of contact information parsed are URLs and emails. Phone numbers were are parsed, because they take the form of CD hyph CD after being tagged by Brill, which is too similar to the structure for dates. URLs and emails are relatively easy to parse, because URLs take the form of a series of strings separated by "." and usually ends with a domain name tagged with dn, such as org or edu. Similarly, emails are distinguished by the @ sign.
| Sub-Rule | Examples |
| ((NN | NNS | NNP | JJ | VB | DT) . (NN | NNS | NNP | JJ | VB | DT)?)+ . dn | frc . ri . cmu . edu; gp . cs . cmu . edu |
| (NN | NNP | JJ | VB | DT | CD) SYM? at (url | NN | NNS) | sutner @ aiki; plp @ cs |
Table 9: Examples of rules for parsing contact information.
Figure 5 shows a sample output from Cass with text tagged by these rules.
After parsing the announcement with the Cass partial parser, the parsed information is extracted from the rest of the text and the correct information is chosen from the parsed information based the tags Cass associates with the parsed information, such as date, time, room, type, topic, univ, company, person, url, and email. Parser.java is a java program that extracts the parsed information from the irrelevant text, decides which information to extract, and translates the Cass tags into standard tags for conveying the extracted information, such as date, time, speaker, host, speaker_affiliation, contact, topic, room, and series. The program returns one instance of each type of information parsed by Cass. parseInput collects all of the parsed information generated by Cass, and parseOutput selects one instance of each type of information parsed.
Currently, parseOutput selects the first instance of each type of information and assumes that the first instance is correct. This assumption is based on the knowledge that most announcements place the relevant information in the headers or beginning of the announcement. Therefore, the earliest information parsed should contain the pertinent information to be extracted. The grammar does a relatively good job of parsing the announcement, so if most of the header information is parsed correctly, then the information returned to the user should also be correct. Trying to write more complicated functions to determine which information to parse is an extremely difficult task, since the program has no semantic understanding of the contents of the announcement. Furthermore, the Brill tags do not provide any information about the contents of the announcement, so the grammars depend mainly on the few tags incorporated with tagfixes.
Neural Analysis of Multiple Inputs
The problem of locating an informative segment buried inside a larger text can be broken up into two subproblems: 1) determining which words are likely to be a part of that segment and 2) identifying contextual cues that provide readers with hints that the segment will appear nearby. For instance, finding the time of an event requires knowledge of what a time might look like (eg "4:30" or "noon") and also what words indicate that a particular time is the time of the event (eg "the talk will start at").
Humans performing these tasks can employ strategies that make use of both structural and semantic content from the text. A computer’s representation of the message is necessarily much weaker since this high level natural language understanding is not available to the machine. However, it is possible to simulate partial understanding by combining a number of elementary structural or semantic features into a higher level model. This requires not only detecting the elementary features but also weighing them such that all basic pieces of information complement the others and form the building blocks of the program’s final decision.
This is the model for our third and final system for automated message parsing. The Neural Analysis of Multiple Inputs (NAOMI) uses machine learning classifiers to automatically learn weights for each piece of evidence that contributes to the end solution. A collection of homegrown feature generators read the text and provide inputs to backpropagation neural networks. Each network is trained to identify message segments that contain a specific type of information – speaker name, talk location, start time, or end time. The networks’ outputs are collected, sanity checked, and stored to serve as a summary of each message.
Architecture
The feature generators are fairly simple natural language processing programs that tag input at the level of individual tokens. When one is run on an input message, it produces an output file containing the original message and a separate set of standoff tags. A standoff tag consists of a tag type and character offset pointers into the message body indicating which text the tag applies to. Standoff tags were used because they offer a simple method for merging tagged files with overlapping and nonstandard tags.
The feature generators built for this project are:
[# of appearances in message] * log([# of stories] / [# of stories containing token])This will penalize common words that occur in many documents while rewarding words that are typically important and unique to one particular message, such as names, dates, and talk topics. The top ten scored tokens of each message are tagged as "interesting". This is often useful for distinguishing between human names and other capitalized or proper nouns that could be mistaken for names, since the speaker's name tends to appear repeatedly within the message but infrequently in other messages.
Our IDF program is also able to chunk words together into phrases, when appropriate. If an "interesting" word is frequently co-located with other interesting words, they are combined to form a larger phrase. This is again useful for picking out names which typically consist of multiple tokens. Talk topics are also commonly recognized as phrases.
The program can be run in batch mode, reading in and tagging an entire directory of messages. It can also be run on individual messages using saved frequency data. Sample run, code [~700 lines]
This is clearly an oversimplified metric allowing the program to determine which phrases "look like" they could be part of or near a tag. No attempt was made to improve this system; rather, we wanted to test how well the algorithm would perform with an impoverished model of semantic word classes. Since most of the previous work in information extraction of seminar announcements has been focused on building up these classes (as with HMMs or the handcrafted parsers discussed above), it will be straightforward to replace this feature generator program in the future with a more sophisticated model of semantic class analysis.
The program can be "trained" on a directory of tagged files or produce its own tags on individual or batch sets of untagged message files using saved frequency data. Sample run, code [~400 lines].
These programs illustrate the range of complexity that can be found in the feature generators for the NAOMI system. Features can be extemely simple or as complicated as full blown NLP systems. Adding a new type of input to NAOMI is as simple as plugging in a new feature generator that produces some type of potentially helpful tagged output.
The above programs certainly do not cover the entire breadth of possibly helpful data. It is also clear that no single one of these features is sufficient to provide the information needed to reach a conclusion. However, the neural networks are able to use these basic inputs to generate meaningful results with surprising accuracy. Future work to include more sophisticated feature generators should only continue to improve the performance of the system.
The neural networks are the other important component of NAOMI. Each neural network is charged with the job of learning how to combine the different tags in the input into a decision about which tokens should be tagged as a location, speaker, start time or end time. They process each message one token at a time, building an array of binary inputs representing the standoff feature tags that are present in the current or nearby tokens in the text.
|
Inter-esting |
token in 'near location' class |
token in 'location' class |
prev token in 'near location' class |
prev token in 'location' class |
prev[2] token in 'near location' class |
prev[2] token in 'location' class |
Caps |
Non-alpha |
Numeric |
|
LOCATION |
TOKEN |
|
-1.0 |
1.0 |
1.0 |
1.0 |
-1.0 |
1.0 |
-1.0 |
-1.0 |
-1.0 |
-1.0 |
= |
0.0 |
in |
|
-1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
-1.0 |
1.0 |
-1.0 |
-1.0 |
= |
1.0 |
WeH |
|
-1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1.0 |
-1.0 |
-1.0 |
1.0 |
= |
1.0 |
7220 |
Figure 1. Sample rows of a neural network’s input data
Each network has 10 inputs, 5 hidden units, and 1 output. The units have continuous activations and are fully connected. The networks were trained on 6000 examples with a learning rate of .3 until the difference in sum squared error stabilized at less than .0001. Each network was also tested on a validation set of 10000 examples and corrected for overfitting by choosing the weights that optimized both training and validation scores. The best resulting trained weights are stored in a file and used to score future messages. The length of time needed to train each network is on the order of five minutes, while the length of time needed to run the entire system on a new message is about three seconds. The neural network software in NAOMI is another homemade piece of code recycled from a previous project.The result of running a network on a new message will be an array of tokens marked with a 0 or 1, where a 1 indicates that the network believes the token should be included in the output tag. There is a small amount of post-processing that takes place here to add a "sanity check". For instance, if a token is tagged as a name but its neighboring tokens are not, the output is flipped to 0. This is an arbitrary (handmade) rule to enforce the constraint that names should contain at least two tokens. Rules of this type generally improve the precision of the system with minimal impact upon the recall. The idea is to screen false positives, hopefully without causing false negatives in the process. The other rules here filter out times like "p.m." with no attached number, times containing dashes, and locations consisting of "banned words" that tend to turn up in false positives often.
Ideally, we should be able to add new features to the input stream that would help the network distinguish these cases on its own. That would likely take the form of improving the semantic word classes to exclude common false positives. Due to time constraints and the aforementioned desire to keep the semantic classes simple, the postprocessing filter seems to perform the task adequately.
Other pieces of software developed as parts of NAOMI include:
Here is a sample output segment of a run of the location neural net (negative values indicate the postprocessing filter stepped in to help). Here is the original message. Here is a live run of the demo in action.
We performed a statistical evaluation of the results of two of the systems, calculating recall, precision, and F1 score based on a set of examples from the CMU corpus. F1 score is defined as (2 * recall * precision) / (recall + precision). This gives us a single score for each system's performance that takes into account both pieces of the recall/precision tradeoff.
For the sake of comparison, here are the scores from Freitag's thesis on the same CMU tagset.
| Freitag SRV | Location | Speaker | Start Time | End Time |
| Recall | 69.5 | 58.3 | 98.3 | 92.6 |
| Precision | 73.8 | 54.9 | 98.4 | 66.7 |
| F1 | .716 | .565 | .983 | .775 |
Computing F1 for the VLRE tagger on the CMU corpus was
complicated by several issues. Foremost was the impoverished nature of the CMU
corpus. Aside from the pervasive <paragraph> and <sentence> tagging
in the corpus, the CMU corpus only has four kinds of tags. Also, many
announcements have repeated tags. For example, if an announcement in
one place says that the time of the talk is
|
Original |
Duplicates removed |
|
|
Etime |
427 |
227 |
|
Location |
635 |
458 |
|
Speaker |
749 |
408 |
|
Stime |
972 |
484 |
Key things are missing from this tag set, including the dates of each talk, the abstract, the speaker’s affiliation, and so on. The VLRE tagger employs a much richer tag set. Running the VLRE tagger over the CMU corpus, the following tags were found with the following frequencies:
|
center |
19 |
|
host |
87 |
|
refreshments_room |
1 |
|
refreshments_time |
12 |
|
rfc822_time |
3 |
|
series |
42 |
|
speaker |
155 |
|
speaker_affiliation |
134 |
|
speaker_title |
37 |
|
talk_abstract |
14 |
|
talk_building |
67 |
|
talk_contact |
1 |
|
talk_date |
481 |
|
talk_room |
201 |
|
talk_room_building |
138 |
|
talk_time |
485 |
|
talk_time_end |
117 |
|
talk_time_start |
117 |
|
talk_topic |
485 |
|
urls |
1 |
Clearly, the VLRE tagger is able to find more kinds of information in the training set. One of the advantages of the regular expression technique is that it is robust and frequently can find partial information. (IE: if it can’t find the room and building together, it may be able to identify the room or the building.) Unfortunately, the VLRE tagger also makes mistakes.
Alas, computing F1 for the VLRE tagger is complicated by the
fact that the VLRE tagger does significant post-processing. Times, for instance,
are rewritten into a canonical 24-hour form.
To use the example from above, the VLRE tagger would report that the talk
would start at
Having written such a program (made simpler with the parse_time() function that uses a regular expression library), it is possible to compute an F1 for the talk_time/talk_time_start and talk_time_end tags. To compute the number of possible matches in each announcement in the CMU corpus, the tags were extracted from each talk and turned into a 24-hour format. We then ran the VLRE tagger on the untagged version of the talk announcement. The talk_time_start tag was used if present, otherwise the talk_time tag was used. A successfully found tag is one in which the VLRE tag was found to be equal to the canonical CMU tag.
The final are presented below; errors from this tagging are in the file vlre-errors.txt.
| VLRE | Speaker | Start Time | End Time |
| Recall | .539 | .961 | .903 |
| Precision | .815 | 1.00 | .995 |
| F1 | .649 | .980 | .947 |
All of the source code for the VLRE system can be found in the /scan directory of this webserver.
F1 was computed for NAOMI on a slightly different way. Since NAOMI does not postprocess the output in any way that alters the content or format of the message, it is possible to compare the system's output to the correct output for each individual token. Also, NAOMI does not combine multiple tags in the same document (regardless of whether or not they match), instead simply returning all matched tokens to the user. This means that for the purposes of F1 score, if a speaker tag occurs twice in the same document, NAOMI is expected to correctly tag all tokens in each segment.
Note that since NAOMI was trained on a portion of the CMU corpus, for the sake of validity these test results are only from the half of the tagged corpus not used for training. Distribution of tags is as follows:
|
Tagged Tokens |
|
|
Etime |
185 |
|
Location |
724 |
|
Speaker |
617 |
|
Stime |
469 |
The following table lists the scores of each of NAOMI's networks on the test set:
| NAOMI | Location | Speaker | Start Time | End Time |
| Recall | .950 | .957 | .809 | .960 |
| Precision | .857 | .800 | .774 | .846 |
| F1 | .901 | .871 | .791 | .900 |
The following excel spreadsheets contain full logs of the test run as well as statistical analysis of the correct and incorrect tokens. Error rows are marked in orange, and rows containing negative values for NAOMI's guesses represent spots in which the postprocessor made a change to the original value.
The tags found by our systems can be thought of as name/value pairs. These pairs are converted into an SQL statement and are used to populate a database table. Duplicates are suppressed by first removing any SQL records that contain the same date, time, and room number. (This elimination is based on the theory that there will not be two talks simultaneously scheduled for the same room at the same time. Unfortunately, this methodology depends upon room tagging working properly --- something that we have not seen in all cases.)
Some announcements simply cannot be properly parsed.
Consider http://www.nitroba.com/msgs/2003/4/msg.22.Ke1iB.txt,
an email message that was sent to the cis-seminars@theory.lcs.mit.edu
mailing list. This message seeks to reschedule a talk that was originally
scheduled for
Three scripts have been written to report from the database. The calendar.cgi script displays all of the matching talks for a given month. The today.cgi script displays the talks for a given day and the following day. The dump.cgi script dumps a few fields for the entire database. Whenever an entry is displayed, the entry itself can be clicked on to display the original talk announcement. Alternatively, the user can click on the link that says [demo] to see the results of parsing the entry with all three systems that we developed.
This system was not run against the CMU tagset to calculate F1 accuracy, but we are able to make other direct observations about the feasibility of using this approach to solve the problem of automated announcement parsing.
The difficulty in writing grammar rules for Brill tagged output is that the Brill tag set limits the accuracy of extracting the relevant information from the messages. Cass cannot take advantage of many of the formatting cues found in announcements, including the placement of information in the announcement and the use of capital letters for titles. Therefore, ad hoc grammars have to be invented to extract the information desired. Furthermore, Brill and Cass were designed for purposes other than to parse announcements, and the format for announcements is extremely different from the format of the type of text Brill and Cass were built for or trained upon. Most of the information extracted from announcements are phrases or a series of strings that do not have any grammatical meaning. Therefore, it is difficult to write rules matching a tag set meant for text with grammatical structures.
The order in which the rules are applied also affects the accuracy of the parser. Therefore the most general rules, such as those for parsing people, are placed at the end and more specific rules, such as those for parsing the topic, are placed at the beginning. This placement allows the more specific rules to be applied first and hopefully filter out most of the information, leaving a smaller set of potential matches the more general rules. Any attempt to use this type of filter based approach must be mindful of this constraint
The grammar could have been written to take advantage of the header style and therefore make the task of parsing announcements much easier. For example to parse the topic, the grammar could have searched for the string "topic:" and parse the entire line. However, this approach was not pursued, because it defeats the purpose of using NLP to extract the information from the announcement body. The grammar that does not rely on information such as "topic:" is more robust, because it can handle announcements without those header information as well.
Future work to improve the system include developing a better function for Parser.java to select the correct information, studying a larger corpus in order to write better grammars for parsing the different types of information, and even developing a better tagger for tagging announcements. This last idea is the major roadblock to successfully using SCOL to parse the announcement messages. Without a tagger more well adapted to the types of structures and word classes found in these messages, writing rules to match the message content is not a feasible approach.
Sincere thanks go to MIT professor Rob Miller, who said that it was mandatory to do regression testing on the VLRE technique in order to gauge the effectiveness of new rules. A simple regression system was created which tagged the entire corpus and compared the tag results to previous runs. This regression system made it very easy to determine the effect of new regular expressions added to the system.
Another form of regression testing was to compare the tags found by the VLRE system with the tags in the tagged CMU corpus. By exploring the differences it was possible to find systematic errors in the VLRE tagger and correct them. This became an iterative process that was in principle similar to the learning that a neural network might perform, but more effective in practice since the programmer was free to make a variety of different changes to the source code.
One obvious area for improvement for this system is in the selection of regular expression weights. Currently these were assigned by the programmer and hand-tuned as necessary to change rule priority. A genetic algorithm could be used to tune the weights, similar to the way in which SpamAssassin weights are tuned.
Overall, we feel that we have shown the VLRE approach to be a successful technique for automated information retrieval from announcement messages. The VLRE system, especially when combined with the NAOMI architecture as discussed below, appears to be powerful enough to be populate a calendar and serve a useful purpose for the users of that service.
NAOMI scored fairly well on the four types of tags present in the test corpus, despite the fact that the semantic word class input features were intentionally very weak, as discussed above. This is a good sign that the networks are able to take advantage of the other types of structural and semantic information presented to them, even though no single one of the inputs is sufficient to independently solve the problem. The architecture is flexible enough that it can accommodate a diverse group of input features, and the neural system has proven to be up to the task of performing data fusion by weighing the inputs and producing human readable output.
Speaker identification compared very well to other systems with an F1 of .871. The network did a good job of identifying names, using a combination of the leadup word classes, capitalization, and the inverse document frequency count. Since names are fairly unique to a document, the IDF tagger does particularly well at locating them. Interestingly, a common error made by the system here is tagging the name of the lecture host in addition to the speaker. A feature generator capable of detecting the presence of a "host" attribute would help the system to improve its performance. Other errors frequently occurred in the presence of initials in names. In general, erroneously tagging names that aren't the speaker is a larger problem (.8 precision) than actually finding the names (.95 recall). This suggests that future improvements will need to focus on providing the system with inputs that allow it to differentiate between speaker names and ordinary names, which can probably be done by improving the accuracy of the equivalency classes of words found nearby speaker tags.
The location tagger also scored relatively well. It was able to find room numbers by searching for numeric tokens nearby words in the semantic equivalency classes for other locations. Building names were found in a similar fashion except they required capitalized words instead of numbers. However this approach was not without problems. The largest source of errors was the situation in which NAOMI would correctly find the location but add extra tokens to the start or end of the tag. The extra tokens were almost always capitalized or numerical in nature. Although this is a problem, it is a relatively small error to make in comparison with completely missing or inventing a tag. The tags with additional tokens appended are still human readable and adequately express the intended information.
The tagging of end times for the events also worked very well. The types of errors made here are roughly 33% times that are incorrectly labeled as start times, and 67% extra tokens that are incorrectly added to the end or beginning of an existing tag (usually "-" or "pm"). These types of errors are similar to the errors in the speaker and location modules respectively; again, it seems reasonable to conclude that as the word class inputs are built up in a more robust manner, performance on these cases will improve. As an aside, the neural network in the etime task used a threshold of .25 instead of .5 to determine when a token should be tagged. This was likely because there were so few examples of end times in the training set, it was difficult for the neural network's weights to push the value all the way over .5 in the marginal cases. Setting the threshold lower improved performance by a decent amount.
Finally, the start time tagger worked adequately but showed room for improvement. As with the speaker tagging task, a large source of errors here was in differentiating between start times and the times that refreshment will be served (or in general, any other time). It is surprising to see this, given the accuracy of the end time tagger, but the difference is likely again in the classes of words that lead up to the time tag. For an end time, the leadup is likely to be another time (as in the format "1:30 to 2:30"). For a start time, the leadup is much more complicated and is therefore not handled by our simple word class system.
Another frequent problem with start times was mislabeling numerical date formats and phone numbers as times. These problems could be fixed if we had phone numbers, dates, or times tagged and used as inputs to the system. This would allow NAOMI to select only actual times as parts of tags for the output. Furthermore, if another input to the system could represent refreshment times or other types of non-start times (such as RFC header time), NAOMI would then be able to prevent from overgeneralizing on times and could do a better job of selecting only start times.
We see here a common theme throughout the types of errors produced by NAOMI: specifically, the equivalency classes of words surrounding tags are weak. Also, the feature generators we have provided to the system have gaps when it comes to detecting certain types of information that is not tagged in the corpus (like dates or the time that refreshments are served). Furthermore, we see that one of the most problematic aspects of the VLRE system is that there is no intuitive way to weigh and combine the hundreds of regular expressions required to extract the interesting types of data from the messages.
Given the shortcomings of each system, it appears that there is a natural marriage waiting to be made between VLRE and NAOMI. By using VLRE's outputs as input tags to NAOMI, we accomplish two things. VLRE gets an effective algorithm for automatically weighing its outputs to determine which matching patterns are informative and which patterns are noise. NAOMI gets a powerful new host of feature generators capable of covering a much broader and more complicated range of information. These generators would serve as the improved semantic words classes, in that all tokens that match a VLRE pattern can be thought to be in the same equivalence class.
The two systems clearly would work well together in symbiosis. The resulting system would be a much more powerful and dynamic method for automatic information extraction of message announcements (or any natural text, for that matter). This will be the direction of our future work.
References
1. Fast Lexical Analyzer Generator. http://www.gnu.org/software/flex
2. Brill, E. "Some advances in rule based art of speech tagging", in Proceedings of the Third International Workshop on Parsing Technologies.
3. Abney, S. "Partial parsing via finite-state cascades", in Proceedings of the ESSLLI '96 Robust Parsing Workshop, 1996.
4. Cheong, K. "Extracting Information from Conference Announcements: High Recall, High Precision", in D. Estival (ed.) Abstracts for the ANLP Post Graduate Workshop, pp 11-20.
5. Freitag, D. and McCallum, A. "Information Extraction with HMM Structures Learned by Stochastic Optimization", in Proceedings of AAAI-2000.
6. Almgren, M. and Berglund, J. "Information Extraction of Seminar Information", http://nlp.stanford.edu/courses/cs224n/2000/berglund/report.pdf
7. Ciravegna, F. "Adaptive Information Extraction from Text by Rule Induction and Generalisation", in 17th International Joint Conference on Artificial Intelligence (IJCAI 2001), Seattle.