EXCLUSIVE – Coupling top-down and bottom-up approaches to Natural Language Processing
“I love deadlines. I love the whooshing noise they make as they go by.”
Most humans would be able to detect the sarcasm in the quote above, even if it takes them a moment or two. But imagine making a computer understand the sentiment expressed in the above sentence.
That is the sort of challenge, Dr. Erik Cambria (Assistant Professor at the School of Computer Science and Engineering at Nanyang Technological University) and his team at SenticNet are trying to tackle. They are dealing with the fundamental problems of natural language processing (NLP) for sentiment analysis. Natural language, which is the language we use for communicating with each other, is rather different from the way we communicate with computers. Natural language is ambiguous, complex, chaotic. Constructed languages, such as programming languages, adhere to strict rules and logic.
Wikipedia defines sentiment analysis as the use of NLP, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Applications involve analysing the positive, negative and neutral sentiments in online customer reviews, surveys, feedback, social media postings and this has great utility in range of fields, from marketing to finance and healthcare.
The problem is much more complicated than it seems. For instance, if a statement is sarcastic, as the one above, something which looks positive is actually negative (love is hate). Understanding this polarity (whether a sentiment is positive or negative) is a core aspect of sentiment analysis. It involves the use of deep learning, psychology, and also linguistics, demonstrating the multi-disciplinary nature of the field.
Deep learning helps detect some patterns, such as the usual occurrence of a big shift in polarity in a sarcastic comment (positive followed by negative), linguistics provide insights on sentence structure, while psychology is important because whether a statement is sarcastic or not can be dependent on the personality of the individual.
To take another example, saying “This phone is expensive but nice” is not the same as saying “This phone is nice but expensive.” In fact, the sentiments expressed are polar opposites, though the words used are the same. Here, the understanding of sentence structure based on linguistics is key. When the ‘but’ conjunction is used, positive followed by negative yields negative but negative followed by positive yields positive.
To understand the approach of SenticNet to dealing with such challenges and improving sentiment analysis, we need to look at its origins.
Origins from a commonsense knowledge base
SenticNet started as a project in MIT Media Lab in 2009.
“They had this big knowledge base of commonsense and I thought, why don’t we use it for sentiment analysis,” Dr. Cambria said, “back then, sentiment analysis was not very popular but in the past few years, its popularity has increased dramatically. Because of the research challenges, and also because of the business opportunities. For instance, so many companies want to know what their customers like about their products.”
In AI research, commonsense knowledge is the collection of facts and information that an ordinary person is expected to know. Facts so obvious, so trivial, that no one would think of mentioning them explicitly, like a chair is for sitting down, or that we drink water to quench our thirst.
Natural language is only used to communicate knowledge which we don’t have based on shared experience. The challenge is to get this general knowledge that most people possess, represented in a way that it is available to AI programs.
A knowledge base here refers to a semantic network with millions of nodes, connected by links that encode the commonsense piece of information. For example, beer and drink could be two nodes and connecting the two would represent the taken-for-granted information that “beer is a drink”.
The MIT Media Lab has a portal called the Open Mind Common Sense (OMCS), which collects pieces of knowledge from volunteers on the Internet by enabling them to enter commonsense into the system with no special training or knowledge of computer science.
Volunteers on the web would answer questions like– “what is a bed used for?”, “what is a beer for?”, “where do you usually find the knife?”. Only those answers which occurred more than a few times would be inserted into the semantic graph. “If many people said that the bed is for sleeping, you take that as a good piece of commonsense” Dr. Cambria said.
ConceptNet is a semantic network based on the information in the OMCS database. SenticNet was built based on ConceptNet, focusing on concepts that are either positive or negative, because the eventual objective of SenticNet is to conduct sentiment analysis.
“We started as just a knowledge base, then from there we went on into the fundamental problems of natural language processing for sentiment analysis. While before we were just focusing on knowledge representation, later we got more and more interested in commonsense reasoning and linguistics. We went from having just SenticNet to having Sentic patterns and other reasoning techniques like AffectiveSpace and things that altogether allow us to do sentiment analysis in a human-like way,” Dr. Cambria said describing the evolution of SenticNet.
Machine learning is not enough
Dr. Cambria said, “We try to take inspiration from how the human brain actually understands things, which is a very different approach from pure machine learning.”
The big difference between Sentic computing and other techniques is that Sentic computing is a hybrid approach that uses machine learning alongside knowledge representation, reasoning and linguistics.
With recent developments in machine learning methods like deep networks, most researchers are pinning their hopes on feeding massive volumes of data to algorithms. Dr. Cambria believes that commonsense is key to improving AI. Simply relying only on statistics, probabilities, co-occurrence frequencies is not enough.
He went on to highlight three big issues with machine learning. The first is ‘Dependency’, as machine learning requires a lot of training data and is domain-dependent.
The second issue is ‘Consistency’, as changes or tweaks in the learning model may lead to different results. The third is ‘Transparency’, that is, the way machine learning performs decision-making is a black box. We do not know why the algorithms arrived at the conclusions they did. In fact, this very same fact makes machine learning a powerful tool. Researchers don’t need to understand the data. They can just feed data to a neural network or whatever learning algorithm they are using, this learns the features automatically, and then it takes decisions. But we never know why the algorithm takes those decisions. This lack of transparency can be a major problem if we are using AI to perform activities that involves ethics like, say, selecting candidates for a job opening.
In the context of NLP, Dr. Cambria said that these issues are crucial because, unlike in other fields, they prevent AI from achieving human-like performance. AI researchers need to bridge the gap between statistical NLP and many other disciplines that are necessary for understanding human language, such as linguistics, commonsense reasoning, and affective computing (affective computing is the study and development of systems and devices that can recognise, interpret, process, human affects or emotions).
Coupling top-down and bottom-up AI
Because of the reasons discussed above, Dr. Cambria advocates a combination of symbolic and sub-symbolic AI. Symbolic models, such as semantic networks, represent a top-down approach to encode meaning. Sub-symbolic methods, such as neural networks, represent a bottom-up approach to infer syntactic patterns from data (syntax is the set of rules, principles, and processes that govern word order and sentence structure). The top-down approach helps gain transparency, while data-driven deep learning enables the automatic detection of patterns.
In a paper titled “SenticNet 5: Discovering Conceptual Primitives for Sentiment Analysis by Means of Context Embeddings”, Dr. Cambria along with his co-authors explores how the two approaches might complement each other. The paper talks about the use of the bag-of-concepts model (as opposed to bag-of-words in which a text is represented as a bag or set of its constituent words) for sentiment analysis. The bag-of-concepts has the advantage over bag-of-words of being able to deal with multiword expressions like ‘pretty ugly’ or ‘sad smile’, which would be split up in the latter model and hence lose their polarity, i.e., their positive or negative meaning (as in pretty used as an adjective rather than an adverb). And it avoids the blind use of keywords and word co-occurrence counts.
But now the problem is that the bag-of-concepts model cannot achieve a comprehensive coverage of meaningful concepts, i.e., a full list of multiword expressions that actually make sense. Models could be used to extract concepts from raw data but such approaches are prone to errors due to the richness and ambiguity of natural language. This is based on the idea that there is a finite set of mental primitives for affect-bearing concepts and a finite set of principles of mental combination governing their interaction.
The paper goes on to propose the generalisation of concepts with related meaning, such as ‘munch toast’ and ‘slurp noodles’, into the conceptual primitive ‘EAT FOOD’. Sub-symbolic AI could now be used to automatically discover the conceptual primitives that can better generalise SenticNet’s commonsense knowledge.
This approach would also help in tackling the symbol grounding problem. Our understanding of language is grounded in the physical world, in sensations, in memory. A computer does not learn meaning like that. A meaning of a word on a page or computer screen is ungrounded. And looking it up in a dictionary would not help.
This article explains the problem like this: “If I tried to look up the meaning of a word I did not understand in a (unilingual) dictionary of a language I did not already understand, I would just cycle endlessly from one meaningless definition to another. My search for meaning would be ungrounded. In contrast, the meaning of the words in my head -- the ones I do understand -- are "grounded" (by a means that cognitive neuroscience will eventually reveal to us). And that grounding of the meanings of the words in my head mediates between the words on any external page I read (and understand) and the external objects to which those words refer.”
In the approach presented in the paper, several adjectives and verbs are defined in function of only one ‘primitive’ item thereby grounding those meanings in that one primitive. It does not solve the symbol grounding problem but reduces it.
SenticNet’s research is being applied in several projects spanning from fundamental knowledge representation problems to applications of commonsense reasoning in contexts such as big social data analysis and human-computer interaction.
For instance, a project in collaboration with Prof. Roy Welsch from MIT Sloan School of Management focuses on natural language based financial forecasting (NLFF). Markets are driven by sentiments. Understanding those sentiments from data can be used for predicting market movements.
SenticNet is also developing tools that allow patients to easily and efficiently measure their health related quality of life and improving human-computer interaction (HCI) by developing dialogue systems with commonsense.
Another project, called PONdER (Public Opinion of Nuclear Energy) aims to collect, aggregate, and analyse opinions towards nuclear energy in different languages and across Singapore, Malaysia, Indonesia, Thailand, and Vietnam. Understanding how the public perceives nuclear energy in the region enables policymakers to make informed national policies and decisions pertaining to nuclear energy, as well as shape communication strategies to inform the public about nuclear energy.
Dr. Cambria said that personally he is more interested in the fundamental problems of AI and sentiment analysis. For example, solving the symbol grounding problem or building machines that can really understand language (IQ), emotions (EQ), and culture (CQ).
“Today, we still don’t have machines that really understand natural language. Siri does not understand natural language, Watson is an amazing answering machine but it does not understand language. At SenticNet, we want to go beyond rule-based and stats-based systems. What we are working on is not really NLP research anymore; it is natural language understanding.”