Chennai Mathematical Institute


Data Science Seminar
Date: 26/11/2021
Day: Friday
Time: 2 pm to 3.15 pm
Information extraction in NLP

Prakash Selvakumar
Assistant Vice President at Genpact digital.


The process of information extraction (IE) turns the unstructured information embedded in texts into structured data, for example, populating a relational database to enable further processing. As part of this I will discuss following topics in detail:

Relation Extraction

Relation Extraction Algorithms

Extracting Times

Extracting Events and their Times

Template Filling

I will begin with the task of relation extraction: finding and classifying semantic relation extraction relations among entities mentioned in a text, like child-of (X is the child-of Y), or part-whole or geo-spatial relations. Relation extraction has close links to populating a relational database, and knowledge graphs. Datasets of structured relational knowledge graphs are a useful way for search engines to present information to users. Next, we discuss three tasks related to events. Event extraction is finding events in which these entities participate, for example, the fare increases by United and American and the reporting events said and cite. Event coreference is needed to figure out which event mentioned in a text refer to the same event; the two instances of increase and the phrase the move all refer to the same event. To figure out when the events in a text happened we extract temporal expressions like days of the week (Friday and Thursday) or two days from now and times temporal expression such as 3:30 P.M., and normalize them onto specific calendar dates or times. We’ll need to link Friday to the time of United’s announcement, Thursday to the previous day’s fare increase, and produce a timeline in which United’s announcement follows the fare increase and American’s announcement follows both of those events. Finally, many texts describe recurring stereotypical events or situations. The task template filling of template filling is to find such situations in documents and fill in the template slots.

Reference : Speech and Language Processing by Dan Jurafsky and James H. Martin

About the speaker: Prakash is the assistant vice president of the Data Science and Insights at Genpact, India. He has more than 15 years of experience in the industry. Prior to Genpact he has worked at Wipro, Philips, IBM, Siemens and Logic Soft. Prakash has a PhD in Opinion mining and sentiment analysis from Bharatidasan University, Tiruchirapalli.