NLP+CSS research review with hints for undergrad research
Published (updated: ) in computation.
Many undergrad students whom I know of feel intimated to the idea of thesis guided study. There are many reasons for this but the most common one might be this whole thing suddenly struck in one’s mind at the beginning of the final year of undergrad. You are thrown into the ocean where you know nothing of. After multiple heartbreaks and distress, one can think of something like a pet project in machine learning, bioinformatics, IoT, 5G, or blockchain but one could easily be get perplexed on finding a good and well-defined problem to work on. There are tons of adversity in this path. No clue on the domain knowledge, no hints for reference materials, no idea on computational/coding stuff, no guidance!
The core purpose of this article is to introduce a few interesting tasks in NLP+CSS plus a brief discussion on the domains, also hints for future works, and references to handy materials. I chose these two fields because I think these two areas are worth exploring and they are well suited for low resource background people like us.
Generating ideas, problem formulation, finding the gaps, and connecting the dots is a very daunting job for an early researcher. But ideas and problems matter a lot while we think of a research project. Some ideas are so useless that they don’t add value to the research or create a negative impact on student’s learning curve. Sometimes you spend a lot of time on a very hard problem that can’t be solved with the given time which in turn leads to burning out or frustration. Also, a lack of domain knowledge and working on the wrong problem may harm your motivation and eventually cause failure. I emphasize this because I myself wasted a lot of time wandering here and there for finding a project to start with. Finding a suitable topic for research is hard but proper resources can help you to find one. This is not easy to know beforehand while you’re just getting started but a good research overview might be a help for sure. However, it’s important to know what’s going on in the areas you’re interested in and which problems are good enough to deserve your attention.
Okayy! Let’s get started. First, I will refer to some of the interesting CSS+NLP research problems for undergrad research with a title, short commentary, and rough hints for each of them. Please review those linked pdf/codes to have a proper understanding. At the end of this article, I will discuss briefly the mainstream research tasks and problems in the fields of natural language processing and machine learning to provide a rough intro on the domains and mentioned fields. I will try to point out common research questions, proper references, and also pointers to research materials. I hope you can use this material for building an undergrad research project and formulating viable research questions given you have reviewed the provided materials.
NLP + CSS research on COVID-19 outbreak
We are coping with the ongoing pandemic, it’s been six months since the outbreak. With this new normal, a lot of our life has changed. Social media is never this much popular and useful like this. With a global increased interaction online, with this huge plethora of communication – it opens a lot of interest in the areas of societal and language processing research. This section will repeatedly try to connect NLP+CSS studies on the COVID-19 outbreak. The topics of research such as sentiment analysis, emotion recognition, online abuse detection, misinformation, event extraction, empathetic chatbot are very common research topics among the NLP community. Lots of potential projects one can think of the interdisciplinary study of computational social science (CSS) and natural language processing (NLP) amid the COVID-19 outbreak. Due to the enormous scope of research in these areas, the Association for Computer Linguistics (ACL) group called for two workshops where they detailed possible research directions. Here I am referring to their call  to make a primary look on the potential project ideas.
[ACL] We welcome submissions related to any aspect of NLP applied to combat the COVID-19 pandemic, including (but not limited to):
Text mining of scientific literature related to COVID-19 (e.g. CORD-19)
Analysis of text from the web, social media or clinical data in support of public health activities related to COVID-19
Sentiment analysis, mental health, or well-being analysis in social media or clinical data related to COVID-19
Application of NLP to analysis of the collateral effects of COVID-19. Collateral effects include anything that is happening as a result of the virus, including economic effects.
Multi-lingual or cross-lingual analysis of COVID-19 related textual data
NLP for semantic search of COVID-19 related textual data
Chatbots and other interactive support systems related to COVID-19
Analysis of spoken language related to COVID-19
I am pointing to a few project ideas and key questions, references, followed by a simple project title. These are rough ideas and lack proper details. If anyone of you (my reader) feels interested, I can provide details pointer to the research challenges, reference materials, dataset, and coding tools, etc if asked.
1. Natural Language Processing for Understanding Mental Health around Covid-19
collect mental health-related data, analyze them using topic modeling, top hashtags, top keywords, visualizations, a broader analysis on a large corpus, create temporal graphs to demonstrate increase use of social media, discussed topics, mental health, well-being; network propagation analysis, cross-lingual and regional data analysis, etc.
- An “Infodemic”: Leveraging High-Volume Twitter Data to Understand Public Sentiment for the COVID-19 Outbreak [pdf]
- The COVID-19 Social Media Infodemic [pdf]
- Health, Psychosocial, and Social issues emanating from COVID19 pandemic based on Social Media Comments using Natural Language Processing [pdf]
- #lockdown: Network-Enhanced Emotional Profiling in the Time of COVID-19 [pdf]
- Cross-language sentiment analysis of European Twitter messages during the COVID-19 pandemic (https://openreview.net/pdf?id=VvRbhkiAwR)
2. Sentiments and emotions evoked by news headlines of COVID-19 outbreak in Bangladesh and India
collect news headlines from Bangladesh and Indian news outlets, use pre-trained sentiment and emotion classifiers (python package / GitHub) for annotation, create a summary of the annotated label, showcase emotion dynamics, extract topics and discuss a set of example data
- Sentiments and emotions evoked by news headlines of coronavirus disease (COVID-19) outbreak (https://www.nature.com/articles/s41599-020-0523-3.pdf)
3. Online abuse, gender bias
- On Analyzing Antisocial Behaviors Amid COVID-19 Pandemic [pdf]
- When does a Compliment become Sexist? Analysis and Classification of Ambivalent Sexism using Twitter Data [pdf]
- Mitigating Gender Bias in Natural Language Processing: Literature Review [pdf]
- Reducing Gender Bias in Abusive Language Detection [pdf]
4. Misinformation amid pandemic
analysis of fake news and misinformation of COVID-19 pandemic, misinformation, fake news, conspiracy theories; outrage, stress, and anxiety among social media users caused by misinformation; fact-checking mechanisms
- NLP-based Feature Extraction for the Detection of COVID-19 Misinformation Videos on YouTube [pdf]
- Detecting COVID-19 Misinformation on Social Media [pdf]
- Conspiracy in the time of corona: Automatic detection of covid-19 conspiracy theories in social media and the news [pdf]
- Fake News Research: Theories, Detection Strategies, and Open Problems [pdf]
5. Topic modeling reveals social concerns of COVID-19 pandemic
use LDA, LSA topic modeling tools to measure topic analysis (healthcare, COVID-19 cases, jobs, mental health, education, economy) on the pandemic twitter data
- Top Concerns of Tweeters During the COVID-19 Pandemic: Infoveillance Study [pdf]
- Unsupervised Modeling of Twitter Conversations [pdf]
6. What school students talk about in the days of the current pandemic
a focused study on students and their education; uncertainty, key concerns, etc
7. Question answering system from pandemic data released by Govt. outlets
develop a question answering system/chatbot from tabular or press release type data
- SemEval-2019 Task 8: Fact Checking in Community Question Answering Forums [pdf]
- COVID-QA: A Question Answering Dataset for COVID-19 [pdf]
- What Are People Asking About COVID-19? A Question Classification Dataset [pdf]
- SQuAD: 100,000+ Questions for Machine Comprehension of Text [pdf]
8. News headline / Short answer generation from Corona bulletin and press releases
generate news headline from tabular / news story, generate answers from crowd-sourced data
- Automatic Dialogue Generation with Expressed Emotions [pdf]
- A survey on empathetic dialogue systems [pdf]
- Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models [pdf]
- Adversarial Learning for Neural Dialogue Generation [pdf]
9. CoronaViz: An interactive visualization of a pandemic using Twitter data
an interactive visualization, use d3, release API
- Using Named Entity Recognition and Natural Language Processing to build a map of accumulated infections of n-Cov2019 [html]
- COVID-19 Outbreak Prediction with Machine Learning [pdf]
- COVID-19 Future Forecasting Using Supervised Machine Learning Models [pdf]
- Estimating Uncertainty and Interpretability in Deep Learning for Coronavirus (COVID-19) Detection [pdf]
10. Information retrieval on CORD-19 corpus
BERT based transfer learning, relation-extraction, named entity recognition on covid-19 data on a variety of tasks such as extract scientific content, events detection, etc
- COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter [pdf]
- Transfer learning for health-related Twitter data [pdf]
- Transfer Learning for Named-Entity Recognition with Neural Networks [pdf]
- Effective Use of Bidirectional Language Modeling for Transfer Learning in Biomedical Named Entity Recognition [pdf]
- Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set [resouces on github]
- Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society [resources on github]
- CORD-19: COVID-19 Open Research Dataset [data]
Whom to follow
A few professor’s works I am listing here. They are well-known faces in the area of NLP+CSS. I will update author’s most interesting papers and summarize there research and key takeaways later.
- Professor Oren Tsur – Social network dynamics, Information diffusion
- Professor Erik Cambria – Sentiment analysis
- Professor Soujanya Poria – Sentiment analysis, Affective computing
- Dr. Saif M. Mohammad – Emotion recognition, Lexical Semantics
- Aditya Joshi – Sarcasm Detection
- Zeerak Waseem – Hate speech detection
- Professor Kai Shu – Fake news detection
- James Thorne – Automated Fact-checking
Optional Reading: Mainstream Natural language processing research
Natural language processing is a huge research field and there are tons of open research problems in the areas of language modeling, neural machine translation, multilingual NLP, text summarization, question answering, NLP for low resource language, multilingual nlp, transfer learning, robustness, and explainability, etc. In this section, I will provide brief details on tasks that have got research attentions in the last few years. Working on these problems requires a deeper understanding of Natural Language Processing techniques. But one should not feel intimated by the difficulty of the problem. If you can work on any of this, that will definitely expand and enrich your understanding of machine learning and NLP. Also, these problems are widely explored by top research labs which will enable you to update your knowledge about the state of the art NLP systems.
I will update this part later.*
Neural Machine Translation
- Neural machine translation by jointly learning to align and translate [pdf]
- Sequence to Sequence Learning with Neural Networks [pdf]
- Memory Networks [pdf]
- Ask Me Anything: Dynamic Memory Networks for Natural Language Processing [pdf]
- SQuAD: 100,000+ Questions for Machine Comprehension of Text [pdf]
Neural Text Generation
- Neural Text Generation: Past, Present and Beyond [pdf]
- Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates [pdf]
(Aspect based) Sentiment Analysis
- BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis [pdf]
- Interactive Attention Networks for Aspect-Level Sentiment Classification [pdf]
Named Entity Recognition
- CamemBERT: a Tasty French Language Model [pdf]
- Deep contextualized word representations [pdf]
- Named Entity Recognition with Bidirectional LSTM-CNNs [pdf]
Trending research in machine learning
A few genuine ideas that have emerged in the last 5-10 years. Those research is groundbreaking and shifted the field into a new dimension. This doesn’t only contribute to the field of NLP rather it advances the whole area of machine learning, computer vision, robotics, etc. To name a few of such ideas from the old days to today are Backpropoation, Long Short Term Memory Networks, AlexNet, GANs, Transformers.
Recently, the Transformer network by Vaswani et al. 2017 is the core of recent advances in several Natural Language Processing tasks. This was a paradigm shift from the standard way NLP applications were built upon recurrent neural nets. The transformer was proposed in the paper Attention is All You Need. That lead to most of the influencing papers in today’s NLP domain such as BERT, GPT-3, ELMO, ERNIE, XLNET, and so on. Another very recent and developing idea is Graph Neural Networks. Here is a nice review paper by Zhou et al. 2019.
Transformer – TBA
Graph Neural Networks – TBA