Introduction

Natural language processing (NLP) is a subfield of linguistics, artificial intelligence aiming at helping computers can understand human language and can interact with human.  With the rapid development of data science, NLP has a big progress in creating applications that can bring many benefits to life. Some applications of NLP are machine translation, chatbot, social media monitoring, survey analysis, targeted advertising, hiring and recruitment, voice assistants, spelling correction.

Our research group focuses on exploiting machine learning and deep learning techniques, incorporating with NLP features and other knowledges to develop high performance NLP applications. We also investigate methods to construct knowledge base and taxonomy for specific NLP tasks, and to create large datasets for training NLP tasks. See the slides here for more detail.

Contact: Assoc. Prof. Le Thanh Huong, Email: huonglt@soict.hust.edu.vn

Research Directions

Exploiting machine learning, deep learning techniques, in companied with NLP features to research and develop NLP applications in the following directions:

  • Information extraction: Several tasks are investigated including named entity recognition, relation extraction, event extraction.
  • Chatbot/question answering: Generation answers for questions based on different sources such as paragraphs, knowledge bases, databases, … Chatbot/question answering is used in many real-life applications such as customer service, study counseling, … We address different problems in this research direction including intent classification, slot tagging, question similarity, dialog management, …
  • Speech Technologies: Focusing on expressive speech synthesis, speech synthesis with state-of-the-art research, automatic speech recognition; speaker verification, speaker identification
  • Text Summarization: Summarizing single or multi-documents, either by picking up important sentences or creating new summaries with condensed content. We also look at query-based summarization, in which the answer is generated by summarizing all the documents returned by the query.
  • Sentiment analysis: Detecting positive/negative sentiment in text. Sentiment analysis is often used by businesses to detect sentiment in social data, and to understand customers.
  • Machine translation: We concentrate on several aspects: developing multilingual neural machine translation; increasing the performances (accuracy, speed) of the system; dealing with low resource languages; automatically building MT corpus for training machine translation systems.
  • Plagiarism detection: Automatically identifying the copied fragments in a suspicious document from other source documents. We also concern about cross-language plagiarism detection where the source of plagiarism is in a different language.
  • Vietnamese spelling correction: Spelling and grammatical errors make input texts difficult to understand. If such documents are used for training, it leads to bad model quality. In real-world NLP problems, we often meet texts with a lot of typos. Because of that, data should be cleaned before using. We focus on correcting spelling errors in two data types: academic text and social data.

Research Problems

  • Synonym discovery from multiple sources: The project aims at discovering synonyms from multiple Web data sources. Synonyms are in form of various alias of the same entity, or equivalent representations of attribute relationships. The main sources come from user interaction with web search engines such as web search logs, semi-structured data such as web tables, and unstructured data such as web documents.
  • Weakly supervised aspect extraction: The project aims at extracting domain aspects from user-generated content which serves as an essential step in opinion mining. It tackles the bottleneck of data annotation by studying the paradigm of weak supervision empowered by neural representation and neural learning frameworks.
  • Weakly supervised taxonomy construction: A taxonomy is a scheme of classification that helps to organize and index knowledge. Generally, the development and the maintenance of a taxonomy is a labor-intensive task requiring significant resources and expertise. Our objective aims at exploring weak supervision to accelerate the process in an automated manner while keeping a minimum requirement on manual tasks.
  • Knowledge base construction from semi-structured documents: Today, our data universe is increasing exponentially and more than 70% of those data are unstructured and semi-structures (e.g. word, pdf, excel files). Those data are commonly un-touched as they are not in the right forms for data analytic software. Our objective is to develop natural language understanding methods to extract valuable information in semi-structured documents. We are then able to construct knowledge bases, which benefit further analytics and beyond.

Team Members

Assoc. Prof. Le Thanh Huong
Team Leader

Assoc. Prof. Nguyen Thi Kim Anh
Member

Dr. Nguyen Thi Thu Trang
Member

Dr. Nguyen Kiem Hieu
Member

Dr. Tran Viet Trung
Member

Post-doc and PhD Students

Ha Thi Thanh
PhD Student

Luu Minh Tuan
PhD Student

Projects and Solutions

Tools and Resources

Latest Publications

Publications in 2022

  1. Viet-Trung Tran; Van-Sang Tran; Xuan-Bang Nguyen; The-Trung Tran. A liveness detection protocol based on deep visual-linguistic alignment. International Conference on Knowledge and Systems Engineering (KSE). 19/10/2022
  2. Vinh Van Nguyen, Ha Nguyen, Huong Thanh Le, Thai Phuong Nguyen, Tan Van Bui, Luan Nghia Pham, Anh Tuan Phan, Cong Hoang-Minh Nguyen, Viet Hong Tran and Anh Huu Tran. KC4MT: A High-Quality Corpus for Multilingual Machine Translation. The 13th Language Resources and Evaluation Conference. 5494‑5502. Marseille, France. 20/06/2022
  3. Nguyen Thi Thu Trang, Dang Trung Duc Anh, Vu Quoc Viet and Park Woomyoung. Advanced Joint Model for Vietnamese Intent Detection and Slot Tagging. 8th EAI International Conference on Industrial Networks and Intelligent Systems (INISCOM 2022). 125-135. Danang, Vietnam. 21/04/2022
  4. Viet-Trung Tran, Hai-Nam Cao & Tuan-Dung Cao. A Practical Method for Occupational Skills Detection in Vietnamese Job Listings. Asian Conference on Intelligent Information and Database Systems. Ho Chi Minh, Vietnam. 28/11/2022
  5. Hanh Pham Van, Huong Le Thanh. Improving Khmer-Vietnamese Machine Translation with Data Augmentation methods. SoICT 2022: The 11th International Symposium on Information and Communication Technology. 276–282. Vietnam. 01/12/2022
  6. Thi-Thanh Ha, Van-Nha Nguyen, Kiem-Hieu Nguyen, Kim-Anh Nguyen, Quang-Khoat Than. Utilizing SBERT For Finding Similar Questions in Community Question Answering. 13th International Conference on Knowledge and Systems Engineering (KSE). 1-6. Bangkok, Thailand. 10/11/2021
  7. Hai-Nam Cao, Duc-Thai Do, Viet-Trung Tran, Tuan-Dung Cao & Young-In Song. Synonym Prediction for Vietnamese Occupational Skills. Lecture Notes in Computer Science. 351-362. 19/07/2022
  8. Bui Thi Mai Anh, Nguyen Thi Thu Trang, Tran Thi Dinh. A Novel Type-based Genetic Algorithm for Extractive Summarization. Thirty-Fifth International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems. 143-155. 19/07/2022
  9. Tuan Anh Phan, Ngoc Dung Nguyen, Huong Le Thanh, Khac-Hoai Nam Bui. Neural Inverse Text Normalization with Numerical Recognition for Low Resource Scenarios. ACIIDS 2022: Intelligent Information and Database Systems. 582–594. Ho Chi Minh city. 28/11/2022

Publications in 2021

  1. Ha Nguyen Tien, Dat Nguyen Huu, Huong Le Thanh, Vinh Nguyen Van and Minh Nguyen Quang. KC4Align: Improving Sentence Alignment Method for Low-resource Language Pairs. The 35th Pacific Asia Conference on Language, Information and Computation (PACLIC). 358-367. 05/11/2021
  2. Hai-Nam Cao, Viet-Trung Tran. Deep neural network based learning to rank for address standardization. The 2021 RIVF International Conference on Computing and Communication Technologies. 1-6. 19/08/2021
  3. Huong T. Le, Que X. Bui. Keyphrase Extraction Using PageRank and Word Features. RIVF (Research, Innovation and Vision for the Future). 257-261. 02/12/2021
  4. Thi-Trang Nguyen, Huu-Hoang Nguyen, Kiem-Hieu Nguyen. A Study on Seq2seq for Sentence Compression in Vietnamese. Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation. 488-495. Hanoi, Vietnam. 24/10/2020
  5. Thi-Thanh Ha, Van-Nha Nguyen, Kiem-Hieu Nguyen, Kim Anh Nguyen, Tien-Thanh Nguyen. Utilizing Bert for Question Retrieval on Vietnamese E-commerce Sites. Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation. 92-99. Hanoi, Vietnam. 24/10/2020
  6. Minh-Tuan Luu, Thanh-Huong Le, Minh-Tan Hoang. An Effective Deep Learning Approach for Extractive Text Summarization. IJCSE Indian Journal of Computer Science and Engineering. 434-444. 07/04/2021
  7. Huong T. Le, Dung T. Cao, Trung H. Bui, Long T. Luong and Huy Q. Nguyen. Improve Quora Question Pair Dataset for Question Similarity Task. RIVF (Research, Innovation and Vision for the Future). 279-283. 02/12/2021
  8. Dang Trung Duc Anh, Nguyen Thi Thu Trang. TDP – A Hybrid Diacritic Restoration with Transformer Decoder. The 34th Pacific Asia Conference on Language, Information and Computation (PACLIC 2020). 76-83. Hanoi, Vietnam. 24/10/2020
  9. Tuan Luu Minh, Huong Le Thanh, Tan Hoang Minh. A hybrid model using the pre-trained BERT and deep neural networks with rich feature for extractive text summarization. Journal of Computer Scienceand Cybernetics. 123--143. 05/06/2021
  10. Bùi Thị Mai Anh, Nguyễn Thị Thu Trang. A Feature-Augmented Deep Learning Model for Extractive Summarization. INISCOM 2021. Vol 379. Le Quy Don University, Hanoi, Vietnam. 22/04/2021
  11. Anh Son TA. Sovling problem. NICS. 17/01/2021
  12. Thi Thu Trang Nguyen, Bui Thi-Mai-Anh, Tran Thi Dinh, Nguyen Thi Hoai. A Hybrid PSO-GA for Extractive Text Summarization. PACLIC 2021. 757-766. Shanghai, China. 04/11/2021
  13. Nguyen Van Son, Le Thanh Huong, Nguyen Chi Thanh. A two-phase plagiarism detection system based on multi-layer LSTM Networks. IAES International Journal of Artificial Intelligence. 636-648. 26/02/2021
  14. Nguyen Thi Thu Trang, Nguyen Hoang Ky, Albert Rilliard, Christophe D’Alessandro. Prosodic Boundary Prediction Model for Vietnamese Text-To-Speech. The 22th Conference of the International Speech Communication Association (Interspeech 2021). 3885-3889. Brno, Czech Republic. 30/08/2021
  15. Luu Minh Tuan , Le Thanh Huong , Hoang Minh Tan. Một phương pháp kết hợp các mô hình học sâu va kỹ thuật học tăng cường hiệu quả cho tóm tắt văn bản hướng trích rút. Tạp chí Khoa học và Công nghệ Đại học Thái Nguyên. 208 - 215. 09/08/2021
  16. Thi-Nhung Nguyen, Kiem-Hieu Nguyen, Young-In Song, Tuan-Dung Cao. An Uncertainty-Aware Encoder for Aspect Detection. Findings of the Association for Computational Linguistics: EMNLP 2021. 797–806. Punta Cana, Dominican Republic. 07/11/2021
  17. Tuan Minh Luu, Huong Thanh Le, Tan Minh Hoang. A HYBRID MODEL USING THE PRETRAINED BERT AND DEEP NEURAL NETWORKS WITH RICH FEATURE FOR EXTRACTIVE TEXT SUMMARIZATION. Journal of Computer Science and Cybernetics. 123--143. 13/05/2021