Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC

Chanjun Park, Midan Shim, Sugyeong Eo, Seolhwa Lee, Jaehyung Seo, Hyeonseok Moon, Heuiseok Lim

Research output: Contribution to journalArticlepeer-review

4 Citations (Scopus)

Abstract

The machine translation system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation. One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance.

Original languageEnglish
Article number5545
JournalApplied Sciences (Switzerland)
Volume12
Issue number11
DOIs
Publication statusPublished - 2022 Jun 1

Bibliographical note

Funding Information:
Funding: This study was supported by the Ministry of Science and ICT (MSIT), Korea, under the Information Technology Research Center (ITRC) support program (IITP-2018-0-01405) supervised by the Institute for Information & Communications Technology Planning & Evaluation (IITP), Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-00368, A Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques) and the Ministry of Science and ICT (MSIT), Korea, under the ICT Creative Consilience program (IITP-2021-2020-0-01819) supervised by the Institute for Information & communications Technology Planning & Evaluation (IITP).

Publisher Copyright:
© 2022 by the authors. Licensee MDPI, Basel, Switzerland.

Keywords

  • AI Hub
  • Korean–English neural machine translation
  • neural machine translation
  • parallel corpus
  • transformer

ASJC Scopus subject areas

  • General Materials Science
  • Instrumentation
  • General Engineering
  • Process Chemistry and Technology
  • Computer Science Applications
  • Fluid Flow and Transfer Processes

Fingerprint

Dive into the research topics of 'Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC'. Together they form a unique fingerprint.

Cite this