-->

KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

Who is That “he?” Using Pronouns and Anaphors in Text Extraction

Text extraction is a powerful tool to find and categorize elements in unstructured documents. These elements, or entities, are connected together to form the relationships, facts and events in a document. Oftentimes, the surface forms of an entity are not sufficient to capture and glean all of the necessary information for further analysis. Text extraction needs to be able to capture and link three underlying pieces of information to rightly categorize elements in a document: pronouns, anaphors and inferable attributes. This article will first describe how leveraging pronominal information improves text extraction. Second, I'll explain how anaphors are useful to text extraction. Finally, this article will describe how inferable attributes are used to further enhance text extraction.

Pronouns

Very often in a document, a relationship occurs that cannot be determined solely from its surface structure. Pronouns, such as he, she and it, are only valuable if they can be related back to the entities which they refer. Consider the following example.

he was the operational chief of the organization

This phrase has little value for text extraction unless one can determine who the he refers to. Once a text extraction tool can determine pronominal reference, relationships between entities become more apparent. Now consider the previous example when one knows the pronominal referent.

he[Ali Ghufron] was the operational chief of the organization

This gets us closer to understanding, but we still need to know who the organization references.

Anaphors

Anaphors are referents to some other entity within a document. Another type of anaphora that is not discussed in this article is exophora. Exophora is when the referent requires real-world knowledge and lies outside of the document. While pronouns mentioned above are also anaphors, referents are typically other entities or noun phrases. Look again at our previous example.

he[Ali Ghufron] was the operational chief of the organization

An anaphor reference capability can enhance a text extraction tool. Relating referents to their anaphors provides additional information when discovering relationships, facts and events. Now consider the previous example when one knows the referent to the organization.

he[Ali Ghufron] was the operational chief of the organization [Jemaah Islamiah]

A relationship that Ali Ghufron is the operational chief of the Jemaah Islamiah is now apparent. But is there any more information we can glean from this example?

Inferable Attributes

Inferable attributes are those things about an entity that can be determined from other bits of the document or are implicit in the entity itself. For example, the gender of a name like John Smith can be inferred to be male simply because one knows that the name John tends to refer to males. However, some names are ambiguous when it comes to their attributes. From our example, the gender of the name Ali is ambiguous. The name is used for both males and females. One could merge the attributes of all occurrences of an entity together to make a composite entity. We learned that the pronoun he is linked together with Ali Ghufron in our previous example. Since he is a pronoun that refers to males, one can assume that the entity Ali Ghufron is also male. Reconsider our previous example.

he[Ali Ghufron, GENDER=Male] was the operational chief of the organization[Jemaah Islamiah]

We now know that the male Ali Ghufron is the operational chief of the Jemaah Islamiah, even though the relationship was never explicitly stated in our document.

Structured Language

In this article, I showed how leveraging pronominal information can provide more meaningful results to a text extraction tool. I also demonstrated how anaphors add value to text extraction by making relationships become more apparent. Finally, I described how a text extraction tool could use inferable attributes, either implied from the entity itself or by merging the attributes of all the various occurrences of the entity, to further enhance text extraction. These are just a few examples of how the underlying structures of natural language can be used by text extraction tools to enhance the value of extracted information for real-world applications.


Headquartered in Herndon, VA, Lockheed Martin's Integrated Systems & Solutions (IS&S) was formed in June 2003, in response to the increasing demand for solutions that promise a comprehensive, real-time information picture for faster, better informed decisions. Developed with more than 20 years of Lockheed Martin experience, AeroText is a high-performance data extraction engine and development environment that worldwide companies and governments use to find and correlate relevant information in text documents.

Special Advertising Section

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues