TY - JOUR
T1 - Paraphrase thought
T2 - Sentence embedding module imitating human language recognition
AU - Jang, Myeongjun
AU - Kang, Pilsung
N1 - Funding Information:
This work was supported by the National Research Foundation of Korea (NRF) grants funded by the Korea government (MSIT) (No. NRF-2019R1F1A1060338) and Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0008691, The Competency Development Program for Industry Specialist).
Publisher Copyright:
© 2020 Elsevier Inc.
PY - 2020/12
Y1 - 2020/12
N2 - Sentence embedding is an important research topic in natural language processing. It is essential to generate a good embedding vector that fully reflects the semantic meaning of a sentence in order to achieve an enhanced performance for various natural language processing tasks, such as machine translation and document classification. Thus far, various sentence embedding models have been proposed, and their feasibility has been demonstrated through good performances on tasks following embedding, such as sentiment analysis and sentence classification. However, because the performances of sentence classification and sentiment analysis can be enhanced by using a simple sentence representation method, it is not sufficient to claim that these models fully reflect the meanings of sentences based on good performances for such tasks. In this paper, inspired by human language recognition, we propose the following concept of semantic coherence, which should be satisfied for a good sentence embedding method: similar sentences should be located close to each other in the embedding space. Then, we propose the Paraphrase-Thought (P-thought) model to pursue semantic coherence as much as possible. Experimental results on three paraphrase identification datasets (MS COCO, STS benchmark, SICK) show that the P-thought models outperform the benchmarked sentence embedding methods.
AB - Sentence embedding is an important research topic in natural language processing. It is essential to generate a good embedding vector that fully reflects the semantic meaning of a sentence in order to achieve an enhanced performance for various natural language processing tasks, such as machine translation and document classification. Thus far, various sentence embedding models have been proposed, and their feasibility has been demonstrated through good performances on tasks following embedding, such as sentiment analysis and sentence classification. However, because the performances of sentence classification and sentiment analysis can be enhanced by using a simple sentence representation method, it is not sufficient to claim that these models fully reflect the meanings of sentences based on good performances for such tasks. In this paper, inspired by human language recognition, we propose the following concept of semantic coherence, which should be satisfied for a good sentence embedding method: similar sentences should be located close to each other in the embedding space. Then, we propose the Paraphrase-Thought (P-thought) model to pursue semantic coherence as much as possible. Experimental results on three paraphrase identification datasets (MS COCO, STS benchmark, SICK) show that the P-thought models outperform the benchmarked sentence embedding methods.
KW - Natural language processing
KW - Paraphrase
KW - Recurrent neural network
KW - Semantic coherence
KW - Sentence embedding
UR - http://www.scopus.com/inward/record.url?scp=85087589799&partnerID=8YFLogxK
U2 - 10.1016/j.ins.2020.05.129
DO - 10.1016/j.ins.2020.05.129
M3 - Article
AN - SCOPUS:85087589799
SN - 0020-0255
VL - 541
SP - 123
EP - 135
JO - Information Sciences
JF - Information Sciences
ER -