How Do Function Words Function in Text Analytics? (Video)
In a presentation at Text Analytics Forum 2018, co-located with KMWorld, Kiki Adams, head of science at Receptiviti, focused on the importance of function words in text analytics.
Most text analysis methods include the removal of stopwords, which generally overlap with the linguistic category of function words, as part of pre-processing. While this makes sense in the majority of use cases, function words can be extremely powerful.
Research within the field of language psychology, largely centered around linguistic inquiry and word count (LIWC), has shown that function words are indicative of a range of cognitive, social, and psychological states. This makes an understanding of function words vital to making appropriate decisions in text analytics. In model design, differences in expected distributions of function words compared with content words have an impact on feature engineering.
“The language and Korean word count system is based on function words. And many of you, if you are practitioners of natural-language processing, may know them as stopwords, and may mostly be familiar with just throwing them out as the first step in pre-processing,” said Adams.
“For many natural-language processing techniques, they kind of get in the way. They’re extremely frequent,” she noted, adding that “they are the things that glue words together, kind of the types of words that you may remember from English class as prepositions, pronouns, conjunctions, all of those.”
Part of what's unique about them is how high-frequency they are, Adams said. “The function words, such as ‘and,’ ‘about,’ ‘the,’ make up less than 0.04% of all of the words that we know, but over half of the words that we actually use, which makes them quite unique, in terms of distribution, for dealing with, in natural-language processing.” The most common function words, basically the top 10 of all function words—including a, an, the, and, but—are even much higher-frequency than the rest of the function words, said Adams. “And then content words, which is what we call nouns, verbs, and adjectives, anything that's not function words, are much less frequent.”
Why are function words important and why do we use them in such a different way than we use content words? asked Adams. “To answer that question, we kind of have to dive into some neuroscience.”
When someone says the word “table,” there may be a mental image of a table that pops up, andhat is thanks to the Wernicke’s Area of the brain, which stores images associated with words, sociocultural connections, memories, emotional connections, said Adams. “If I say the word ‘love,’ all of the associations that you have with the word love are stored in that Wernicke's Area.”
Now, how to actually put those words in a sentence is what's stored in the Broca’s Area, Adams continued. “And, so the difference between ‘I love that table’ and ‘I tabled that love’ is thanks to the Broca’s Area. When you're actually speaking, it’s the Wernicke’s Area that you’re consciously aware of. So, if you're talking about a table, you're not thinking about how to put that word in a sentence correctly; you're thinking about the table. The Broca’s Area is always working as you process language, either hearing it, speaking, reading, writing—all of those—but it’s very subconscious.”
View the video.
Many speakers have made their slide decks available at www.text-analytics-forum.com/2018/Presentations.aspx
Learn more at Text Analytics Forum 2019, coming to Washington, DC, Nov. 6-7.