Natural Language Processing

Introduction

Natural language processing (NLP) is a subfield of linguistics, artificial intelligence aiming at helping computers can understand human language and can interact with human. With the rapid development of data science, NLP has a big progress in creating applications that can bring many benefits to life. Some applications of NLP are machine translation, chatbot, social media monitoring, survey analysis, targeted advertising, hiring and recruitment, voice assistants, spelling correction.

Our research group focuses on exploiting machine learning and deep learning techniques, incorporating with NLP features and other knowledges to develop high performance NLP applications. We also investigate methods to construct knowledge base and taxonomy for specific NLP tasks, and to create large datasets for training NLP tasks. See the slides here for more detail.

Contact: Assoc. Prof. Le Thanh Huong, Email: huonglt@soict.hust.edu.vn

Research Directions

Exploiting machine learning, deep learning techniques, in companied with NLP features to research and develop NLP applications in the following directions:

Information extraction: Several tasks are investigated including named entity recognition, relation extraction, event extraction.
Chatbot/question answering: Generation answers for questions based on different sources such as paragraphs, knowledge bases, databases, … Chatbot/question answering is used in many real-life applications such as customer service, study counseling, … We address different problems in this research direction including intent classification, slot tagging, question similarity, dialog management, …
Speech Technologies: Focusing on expressive speech synthesis, speech synthesis with state-of-the-art research, automatic speech recognition; speaker verification, speaker identification
Text Summarization: Summarizing single or multi-documents, either by picking up important sentences or creating new summaries with condensed content. We also look at query-based summarization, in which the answer is generated by summarizing all the documents returned by the query.

Sentiment analysis: Detecting positive/negative sentiment in text. Sentiment analysis is often used by businesses to detect sentiment in social data, and to understand customers.
Machine translation: We concentrate on several aspects: developing multilingual neural machine translation; increasing the performances (accuracy, speed) of the system; dealing with low resource languages; automatically building MT corpus for training machine translation systems.
Plagiarism detection: Automatically identifying the copied fragments in a suspicious document from other source documents. We also concern about cross-language plagiarism detection where the source of plagiarism is in a different language.
Vietnamese spelling correction: Spelling and grammatical errors make input texts difficult to understand. If such documents are used for training, it leads to bad model quality. In real-world NLP problems, we often meet texts with a lot of typos. Because of that, data should be cleaned before using. We focus on correcting spelling errors in two data types: academic text and social data.

Research Problems

Synonym discovery from multiple sources: The project aims at discovering synonyms from multiple Web data sources. Synonyms are in form of various alias of the same entity, or equivalent representations of attribute relationships. The main sources come from user interaction with web search engines such as web search logs, semi-structured data such as web tables, and unstructured data such as web documents.
Weakly supervised aspect extraction: The project aims at extracting domain aspects from user-generated content which serves as an essential step in opinion mining. It tackles the bottleneck of data annotation by studying the paradigm of weak supervision empowered by neural representation and neural learning frameworks.
Weakly supervised taxonomy construction: A taxonomy is a scheme of classification that helps to organize and index knowledge. Generally, the development and the maintenance of a taxonomy is a labor-intensive task requiring significant resources and expertise. Our objective aims at exploring weak supervision to accelerate the process in an automated manner while keeping a minimum requirement on manual tasks.
Knowledge base construction from semi-structured documents: Today, our data universe is increasing exponentially and more than 70% of those data are unstructured and semi-structures (e.g. word, pdf, excel files). Those data are commonly un-touched as they are not in the right forms for data analytic software. Our objective is to develop natural language understanding methods to extract valuable information in semi-structured documents. We are then able to construct knowledge bases, which benefit further analytics and beyond.

Team Members

Assoc. Prof. Le Thanh Huong
Team Leader

Assoc. Prof. Nguyen Thi Kim Anh
Member

Dr. Nguyen Thi Thu Trang
Member

Dr. Nguyen Kiem Hieu
Member

Dr. Tran Viet Trung
Member

Post-doc and PhD Students

Ha Thi Thanh
PhD Student

Luu Minh Tuan
PhD Student

Projects and Solutions

Yourway.vn: An Online System for Labor Market Data Collection and Analysis

COOPY for Anti-plagiarism

Tools and Resources

vi_spacy: A Vietnamese language model for spaCy

pyvi: A Python Vietnamese Toolkit

Latest Publications

Publications in 2025

Tho Tran Duc; Huy Nguyen Trong; Huong Le Thanh. Improving Quality of Vietnamese to Khmer Neural Machine Translation Using Multi-stage Fine-Tuning Strategy. Information and Communication Technology. 69-79. Đà Nẵng, Việt Nam. 12/12/2024
Nguyen Hoang-Long, Tran Viet-Trung. BKCrawler: A Scalable Web Data Extraction System Using Weak Supervision. Communications in Computer and Information Science. 16-26. 12/12/2024
Cui Wei, Ullah Ismat, Lin Weiming, Zhang Jupei, Chen Zhaowei, Yang Shuyi, Peng Wei, Zhuang Yin, Chen Wenjin, Cao Yi, Zhang Shujun, Jin Shengyang, Yang Liang. Multifunctional Sr²⁺/Zn²⁺ Co‐Doped Mesoporous Silica Nanoparticles in Injectable Hydrogel for Ameliorating Osteoporotic Osseointegration. Advanced Healthcare Materials. 15/06/2025
Thai Nguyen-Quoc; Hoan Nguyen-Cong; Huong Le-Thanh. Contrastive Perturbation Enhancement for LLM-Based Machine Translation. Information and Communication Technology. 273–283. Da Nang city, Vietnam. 12/12/2024
Ba Thiem Nguyen, Thanh Binh Nguyen, Tran Viet-Trung. CoverNexus: Multi-agent LLM System for Automated Code Coverage Enhancement. Communications in Computer and Information Science. 472-484. 12/12/2025
Tu Huu Tuong, Thanh Long Luong, Huan Vu, Phuong Thao Nguyen Thi, Van Thang Nguyen, Cuong Nguyen Tien, Thi Thu Trang Nguyen. Voice Conversion for Low-Resource Languages via Knowledge Transfer and Domain-Adversarial Training. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1-5. India. 05/04/2025
Tu Huu Tuong, Vu Huan, Nguyen Cuong Tien, Ngo Dien Hy, Trang Nguyen Thi Thu. O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion. Findings of the Association for Computational Linguistics: EMNLP 2025. 16197-16208. Suzhou, China. 04/11/2025
Le Huong, Luu Ngoc, Nguyen Thanh, Dao Tuan, Dinh Sang. Optimizing Answer Generator in Vietnamese Legal Question Answering Systems Using Language Models. ACM Transactions on Asian and Low-Resource Language Information Processing. 1-17. 12/02/2025
Viet Ngo Q. , Huong Le T.. Open-domain Named Entity Recognition for Low Resource LanguagesA Case Study on Vietnamese. Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation. 762–772. Tokyo, Japan. 07/12/2024
Hai Nguyen T. and Huong Le T.. Enhancing ColBERT: A Method for Reducing Space Complexity andAccelerating Retrieval Speed. Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation. 820-829. Tokyo, Japan. 07/12/2024
Hoang Long Vu, Phuong Tuan Dat, Pham Thao Nhi, Nguyen Song Hao, Nguyen Thi Thu Trang. VoxVietnam: a Large-Scale Multi-Genre Dataset for Vietnamese Speaker Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1-5. India. 05/04/2025

Publications in 2024

Thang Duc Phan and Huong Thanh Le. Utilize Pre-Trained PhoBERT to Compute Text Similarity and Rerank Documents for Question-Answering Task. 12th International Conference on Control, Automation and Information Sciences (ICCAIS). 200-205. Hanoi. 27/11/2023
Sikandar Ali Qalati, Domitilla Magni, and Faiza Siddiqui. Senior Management's Sustainability Commitment and Environmental Performance: Revealing the Role of Green Human Resource Management Practices.. Business Strategy and the Environment. 02/08/2024
T. K. Lai, and I. L. Ngo. A new design and optimization of VD-ECF micro-pump: Advancements in electrohydraulic performance. Physics of Fluids. 29/07/2024
T. K. Lai, and I. L. Ngo. An investigation on the thermo-electrohydraulic performance of novel ECF micro-pump.. International Journal of Heat and Mass Transfer. 29/09/2024
Thi-Nhung Nguyen, Bang Tien Tran, Trong-Nghia Luu, Thien Huu Nguyen, Kiem-Hieu Nguyen. BKEE: Pioneering Event Extraction in the Vietnamese Language. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2421-2427. Torino, Italia. 20/05/2024
Sikandar Ali Qalati, MengMeng Jiang, Samuel Gyedu, and Emmanuel Kwaku Manu. Do Strong Innovation Capability and Environmental Turbulence Influence the Nexus Between Customer Relationship Management and Business Performance?. Business Strategy and the Environment. 02/07/2024
T. K. Lai, and I. L. Ngo. An investigation on the electrohydraulic performance of novel ECF micro-pump with NACAshaped electrodes. Theoretical and Computational Fluid Dynamics. 29/02/2024
Tuyen Tran, Khanh Le, Ngoc Dang Nguyen, Minh Vu, Huyen Ngo, Woomyoung Park, Thi Thu Trang Nguyen. VN-SLU: A Vietnamese Spoken Language Understanding Dataset. INTERSPEECH 2024. 1335-1339. Kos, Greece. 01/09/2024
Pham Viet Thanh, Ngo Thi Thu Huyen, Pham Ngoc Quan, Nguyen Thi Thu Trang. A Robust Pitch-Fusion Model for Speech Emotion Recognition in Tonal Languages. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 12386-12390. Seoul, Republic of Korea. 11/04/2024
JYE Tin, WW Tan, AA Bakar, MS Mahali, FF Lothai, NF Mohammad, SSA Hassan & KF Chin. A Conceptual Design of Sustainable Solar Photovoltaic (PV) Powered Corridor Lighting System with IoT Application. ICREEM 2022. 09/03/2024
Trinh Thi Ha, Nguyen Trung Dung, Nguyen Thanh Huong, Tran Trong An, Pham Van Tuan, Vu Ngoc Hung, Chu Manh Hoang. Investigating the coupling length of two triangle hybrid gap plasmonic waveguides. The International Conference on Advanced Materials and Technology (ICAMT 2024). 10-13. Hanoi. 09/10/2024
Vu Hoang, Viet Thanh Pham, Hoa Nguyen Xuan, Pham Nhi, Phuong Dat, Thi Thu Trang Nguyen. VSASV: a Vietnamese Dataset for Spoofing-Aware Speaker Verification. INTERSPEECH 2024. 4288-4292. Kos, Greece. 01/09/2024

Introduction

Research Directions

Research Problems