Improving Parallel Corpus Quality for Chinese-Vietnamese Statistical Machine Tra


Huu-anh Tran, Yuhang Guo, Ping Jian, Shumin Shi and Heyan Huang,,

(1.Department of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China; 2.Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Application, Beijing Institute of Technology, Beijing 100081, China)

Large scale parallel corpora play an important role in the language study and statistical machine translation (SMT) researches. They mainly provide training data for translation models[1-2]and represent resources for automatic lexical obtainment and enrichment[3-4]as well as for grammar introduction[5].

Previous researches focused on building bilingual corpora from the Internet. However, few researches about bilingual corpora were related to Vietnamese. Ref.[6] collected French-Vietnamese bilingual data from theVietnamNewsAgencywebsite and aligned the documents based on filtering steps (publishing date, special words and sentence alignments results). They used the Champollion toolkit to align the sentences. Eventually, they obtained 12 100 pairs of bilingual documents and 50 300 bilingual sentence pairs in the news form. Ref.[7] designed a Chinese-Vietnamese bilingual parallel corpus management platform and collected more than 110 000 Chinese-Vietnamese sentence pairs from the Internet. 53 000 words in several fields were aligned manually in their work. In order to extract an English-Vietnamese bilingual corpus from the web, Ref.[8]designed two content-based features: cognates and translation segments. These content-based features together with the structure of the web pages were fed into a machine learning model to extract the bilingual texts. For bilingual e-books sources, they used some forms of linkages between the blocks of the text in two languages. Three steps were adopted: pre-processing, paragraph alignment and sentence alignment.

All the above works have one feature in common, they have to search for the websites containing bilingual data, and then collect, process and align the paragraphs, sentences and words. These methods may produce a quality corpus but it is difficult to construct the bilingual corpus in a large scale.

In addition, there are several sentence-alignment tools which have been published, such as Refs.[9-12]. However, all of them do not support Vietnamese sentence alignment from other languages and vice versa.

Fortunately, there is a kind of large scale multilingual corpus, in ever growing and easily accessible recources available on-line for SMT, i.e., movie subtitles. It’s easy to access a million-sentence scale of a movie or TV subtitles bilingual corpus in various languages pairs, and the best part is that the data possesses natural rough alignment in the sentence level. The work of Refs.[13-15] focused on this direction and all the sources adopted in these works were obtained from Open-Subtitles.

The Open-Subtitles have many different versions (years: 2011, 2012, 2013, 2016), and the latest is Open-Subtitles-2016. From a linguistic perspective, subtitles cover a wide and interesting breadth of genres, from colloquial language or slang to narrative and expository discourse (e.g. documentaries). Open-Subtitles-2016 includes a total of 1 689 bi-texts extracted from a collection of subtitles containing 2.6 billion sentences (17.2 billion tokens) distributed over 60 languages[16].

Ref.[16] pre-processed the raw subtitle files to generate XML subtitle files, and then performed a cross-lingual alignment to generate XML alignment files (1 XML file per language pair, encoded as a collection of alignments). Finally, they produced a bilingual corpus in the Moses format.

Although the amount of this kind of data is very large, there are still too many problems such as sentence mismatching, translation errors, free translations, font errors, etc. In order to take full advantage of this large scale of subtitle corpus, three filtering methods are introduced: ① the sentence length difference, ② the semantic similarity, and ③ the machine learning to pick out the sentence pairs of good quality.

The rest of this paper is organized as follows.Section 1 presents our proposed model. Section 2 presents a filtering method based on the sentence length. Section 3 presents a filtering method based on machine translation references (based on semantic similarity and machine learning methods). Section 4 describes our experiments and experimental results. Finally, section 5 made the conclusions and future work directions are given.

1 Proposed Model

The raw data in this paper is obtained from the Open-Subtitles-2016(http:∥opus.lingfil.uu.se/OpenSubtitles2016.php). To clean up the noises in this corpus, pre-processing are conducted as: ① remove unnecessary symbols such as:, §, #, [ , ], ※, -, *, @, 「,」, 「, 」, ② remove the sentence pairs that contain font errors, ③ remove the sentences that contain English words, and ④ convert traditional Chinese characters to simplified Chinese characters. After pre-processing, a baseline corpus called C0-corpus is obtained. Fig.1 illustrates the architecture of the system and the pre-processing is in the first part of the architecture. In this section the remaining parts of the system will be described in detail.

C0-corpus is obtained through the pre-processing stage. Next, we do the first filtering which is based on a Chinese-Vietnamese dictionary. At this step, we filter to remove all pairs of sentences that have big differences in length. Then, C1-corpus is obtained. However, there are still some sentences with very poor quality translation; or some translated sentences do not match with the meaning of the original sentences. So, we will apply a further filter to remove those poor-quality translations by employing a method based on a machine translation reference. Since the number of sentence pairs in C1-corpus is still too big, we randomly take 2 000 and 5 000 sentence pairs representatives from the C1-corpus to find the most appropriate threshold values, which will be described below. It is done by combining two methods, namely automatic labelling (based on measures: Cosine, Jaccard, Dice and smoothed-BLEU (Smoothing 3)[17]) and manual labelling. With each measure, we will keep C1-V1 sentence pairs in the C1-coprus to group them into the C2-corpus, if the semantic similarity of V1 and V1-google(V1-google is translated from C1 by using https:∥translate.google.com/) is greater than or equal to the threshold value. Moreover, for utility of the measures from above steps, we use them as features in machine learning methods (support vector machine-SVM, logistic regression-LR) for classification. In the classification, Yes is denoted for good quality sentence pairs, and No is denoted for being removed sentence pairs. The Yes results will be considered as the C2-corpus.

Fig.1 Architecture of corpus filtering system

2 Filtering Based on Sentence Length Difference

2.1 Characteristics of Vietnamese and Chinese

In terms of language typology, Chinese and Vietnamese are the same type of isolated form, so there are some similarities between them. The basic unit of Vietnamese is syllable. In writing, syllables are separated from each other by a white space. The white space itself cannot be used to determine the word because words often include one or more syllables. Tab.1 presents an example of a Vietnamese sentence and its corresponding Chinese one segmented into syllables and words.

Tab.1Example of a Chinese-Vietnamese sentence pair segmented into syllables and words

With some similarities in the type of language as above,it is recognized that Chinese sentences and their Vietnamese translation often have proportional sentence lengths. Therefore, the length of sentences in a sentence pair will be a very good criterion to determine whether they are accurate translations of one another.

2.2 Sentence relative length based filtering

As mentioned above, the length of the sentence is a good benchmark to determine whether they are accurately translated from each other. It is intuitive in the bilingual corpus. Tab.2 shows the examples of sentence alignment errors in C0-corpus.

In detail: ① Sentence alignment errors, where one Vietnamese sentence is aligned with several Chinese sentences (1st row) and vice versa. ② One Chinese sentence is aligned with several Vietnamese sentences (2nd row). ③ A short Chinese sentence is aligned with a very long Vietnamese sentence (3rd row), and vice versa. ④ A long Chinese sentence is aligned to a very short Vietnamese sentence one (4th row). These errors cause the length of Chinese-Vietnamese sentence pairs to have a big difference, which indicates that they are not translated properly.

Tab.2Examples of sentence alignment errors in C0-corpus

Here,a Chinese-Vietnamese dictionary with 340 000 word pairs is taken as the reference, the average difference of Chinese-Vietnamese word pairs is calculated by the following formula:


where dicthresholdis the threshold value of the dictionary,lch_wordis the length of Chinese words,lvi_wordis the corresponding length of Vietnamese words, andldicis the number of the word pairs in this dictionary.

In C0-corpus, the difference of each sentence pair is calculated by


where dif(ch, vi)is the length difference of a Chinese-Vietnamese sentence pairs,lch_senis the length of a Chinese sentence,lvi_senis the length of a Vietnamese sentence, min(lch_sen,lvi_sen) is the smallest value oflch_senandlvi_sen.

Sentence pairs whose dif(ch, vi)is greater than dicthresholdare removed from C0-corpus. To estimate the impact of sentence length on C1-corups quality, several values of dicthresholdare tried as follows. Then the original dicthresholdvalue is multiplied by different coefficients, such as coefficients of 1, 2, …, 6 (as mentioned in section 4.2.1) to get new threshold values. From each new dicthresholdvalue, a corresponding C1-corpus will be obtained from C0-corpus.

3 Filtering Based on Machine Translation Reference

After the previous step, although we have collected sentence pairs in the C1-corpus that have similar sentence lengths, these pairs still contain several mis-translated ones as exemplified below.

Tab.3 shows some common mis-translations: ① sentence pairs in line 1 and line 2 that are aligned but differ completely in meaning; ② mis-alignment, i.e., the Chinese sentence in line 3 that’s aligned to the Vietnamese sentence in line 4, the Chinese sentence in line 4 is aligned to the Vietnamese sentence in line 5,which also leads to; ③ free translation, i.e., the Vietnamese sentence in line 3 and the Chinese sentence in line 5 have no matching translation aligned to them.

Tab.3Examples of mis-translated sentences in C1-corpus

Therefore, in this step we aim to eliminate these sentence pairs, which include translation errors, sentence mismatch, free translations, etc. relying on semantic criterions.

3.1 Filtering based on semantic similarity

As the size of the C1-corpus is very big, it is infeasible for manual evaluation of the translating quality of each sentence pair in C1-corpus. Instead, first we randomly sample 2 000 sentence pairs which represent C1-corpus. They are notated as 2K_ch, 2K_vi and manually labelled: Yes for the good quality sentence pairs and No for bad quality sentence pairs that will be removed. These 2 000 sentence pairs are used as a training data set.

First, Google Translate(https:∥translate.google.com/) is utilized to translate the extracted 2 000 Chinese sentences (2K_ch) to Vietnamese (2K_google). Then, the semantic similarity of the 2 000 sentence pairs (2K_vi, 2K_google) is measured by using the measures: Cosine, Jaccard, Dice, and smoothed-BLEU.

At each measure, the most appropriate threshold value is identified to differentiate good-bad quality sentence pairs from C1-corpus. Only sentence pairs (C1-V1) which satisfy the semantic similarity of V1 and V1-google are greater than or equal to the threshold value of that measure will be kept in C2-corpus.

Each Vietnamese sentence pair of the 2K_vi-2K_google set has similarity of 0 if they are semantically completely different and of 1 if they are completely identical. Typically, the range of the similarity is [0,1].

In this step, we search for the most appropriate threshold value for each measure. The most appropriate threshold value is the value at whichF1reaches max value. Taking Cosine measure for example, the following table is considered (the other measure was similarity performed).

As the threshold value is in [0, 1], the original temporary threshold value is set att0(=0.0) and the final threshold value at 1. For the original temporary threshold (t0), the corresponding values in theTt0and theNt0columns are calculated as in Tab.4, whereiis the number of sentence pairs that are manually labelled,i=1,2,…,2 000.

Tab.4 Cosine measure of threshold value

Here, sentence pairs where Cosine similarity is greater thant0will haveTt0set at Yes, and No otherwise. In theNt0column, we combine Manual labelling column value withTt0column value.Nt0is set at Yes when both these two columns are Yes, and No otherwise. With the other temporary threshold values (fromt1to 1), the corresponding values in theTt1and theNt1columns, etc. are similarity calculated.



Finally, we calculate the Precision (P), Recall (R),F1and the threshold value for Cosine measure at temporary thresholdt0as follows:




Cosinethresholdis calculated as:



wherej=0,1,2,…,10 corresponds to the temporary threshold values from 0.0 to 1 and with each repeating the threshold is increased by 0.1. If the original temporary threshold values are set at 0.00 or 0.000, (e.g,j=0,1,2,…,100 orj=0,1,2,…,1 000) respectively with each repeat, the threshold is increased by 0.01 or 0.001, etc.

Cosinethresholdis the threshold value at whichF1reaches a maximum value when combining the two methods of manual and automatic labelling. Similarly, we also calculate the Jaccardthreshold, Dicethresholdand smoothed-BLEUthresholdvalues.

In order to test the stability of the classifier, similarly, from C1-corpora we also randomly sample 5 000 sentences pairs(5K_ch, 5K_vi) to find the threshold value for each measure. Tab.5 is the corresponding parameters for each measure of 2 000 and 5 000 sentence pairs. These values are the criteria to determine whether a new sentence pair would be labelled Yes or No.

Tab.5 Each measure and their parameters

3.2 Filtering based on machine learning

Above, we have manually labelled (Yes, No) for 2 000 sentence pairs (2K_ch, 2K_vi), 5 000 sentence pairs (5K_ch, 5K_vi), and calculated their similarity (2K_vi, 2K_google) (5K_vi, 5K_google) using the Cosine, Jaccard, Dice and smoothed-BLEU measures (2K_google, 5K-google are translated from 2K-ch, 5K-ch by using https:∥translate.google.com/.). In this step, we build SVM and LR classifiers to conduct machine learning methods, based on these labelled data and measures. The classification features are as follows:

SVM: Cosine, Jaccard, Dice, smoothed-BLEU, absdif, reldif

LR: Cosine, Jaccard, Dice, smoothed-BLEU, absdif, reldif

where absdifand reldifare calculated using the formulas:




wherelchis the length of the Chinese sentence andlviis the length of the Vietnamese sentence. absdif, reldifis the absolute and relative difference of each Chinese-Vietnamese sentence pair in the C1-corpus.

With each C1-corpus corresponding to each threshold value calculated using Eq.(1), we differentiate (Yes/No) sentence pairs from C1-corpus and obtain respective C2-corpus.

4 Experiments

4.1 Data setting

We redraw the subsequent steps of the method in Fig.2 and distinguish the data after each filtering step.

The initial corpus includes 1 120 000 sentence pairs. After pre-processing (see section 1) and manually selecting relatively good 5 000 sentence pairs, we obtained C0-corpus that consists of 997 424 sentence pairs. We combined 5 000 sentence pairs just extracted with other 5 000 sentence pairs collected from: vietnamese.cri.cn to create develop_set and test_set.

For the C1-corpus (C1-V1), Google Translate is used to translate Chinese sentences (C1) into Vietnamese sentences (V1-google) and then the measures are used: Cosine, Jaccard, Dice, smoothed-BLEU to measure the similarity of sentence pairs (V1, V1-google). In the experiments below, depending on each threshold value calculated using Eq. (1), from C0-corpus we extract the number of pairs (Chinese, Vietnamese) for each measure (Cosine, Jaccard, Dice, smoothed-BLEU) accordingly to create training data sets for C1-corpus and C2-corpus.

Next, 2 000 and 5 000 pairs are extracted from the C1-corpus randomly (see section 3.1) to build training_set for the next experiments. Note that the only difference of all the experiments is the training_set, develop_set and test_set are completely identical. Thus, the corpus preparation includes 6 categories in Tab.6.

Tab.6 Data sets for the experiments

In this study, Moses[18]is used as a decoder, SRILM is applied to build the language model, Giza++ for word alignment process, and BLEU for scoring the translation.

4.2 Experimental results

In order to empirically evaluate the quality of the C0-corpus, C1-corpora and C2-corpora, we used them as the training set for a SMT system based on Moses. First, a SMT system is built for evaluating C0-corpus by the BLEU score. With the train set of 997 424 sentence pairs, the C0-BLEU result is 18.78 (baseline).

4.2.1C1-BLEU scores

Filtering by the different thresholds in the filtering step based on sentence length, different C1-corpora can be obtained from C0-corpus. Tab.7 shows the number of sentence pairs of each corpus and the corresponding BLEU score.

Tab.7 Change of BLEU score with their threshold

The initial threshold calculated by Eq. (1) is 0.236 (level 1). Then this threshold is multiplied with the coefficients from 2 to 6, corresponding to the threshold values from 0.472 to 1.416. For convenience, the levels (1, 2, …,6) are introduced to indicate the different C1-corpora below.

Fig.3 shows that, at level 1, C1-BLEU score is 18.41, which is smaller than the baseline C0-BLEU score (18.78). This is due to the reason that the overly strict filtering will discard too many translation pairs and the rest is insufficient for training a translation model.

Fig.3 Thresholds and their corresponding BLEU scores

From level 2 onward, at each threshold, most of C1-BLEU scores are higher than the C0-BLEU score and the highest value occurs at level 3. This demonstrates that the filtering step based on sentence length is effective.

However, from level 4 to level 6,it can be observed that C1-BLEU tends to decrease. The reason is that the filter conditions are being loosened and the amount of data added to the training set at the next level is containing more and more poor quality sentence pairs. Therefore, it is completely logical that the lowest C1-BLEU at level 6 is 19.15.

4.2.2C2-BLEU scores

Now, based on the above C1-corpora with various threshold values, further filtering is conducted based on the measures (Cosine, Jaccard, Dice and smoothed-BLEU) of the threshold value and the machine learning methods (SVM, LR). Representatives of thresholds at levels 1, 2, 3, 4, 5, 6 are taken and obtained the results shown in Fig.4.

This chart (Fig.4) has six intervals for six levels (level 1, 2, … , 6) respectively. The first point of each interval is the value of C1-BLEU score. The six remaining points in each interval are the C2-BLEU scores with the measures:Cosine, Jaccard, Dice, smoothed-BLEU, SVM and LR respectively. The horizontal line is C0-BLEU score.

Fig.4 Thresholds and their C2-BLEU scores

According to this chart, when the threshold at level 1, the value of C1-BLEU score and C2-BLEU score are smaller than C0-BLEU score. Due to the fact that at the filtering step based on the length of the sentence, the threshold value is too small (0.236), and too many good quality sentence pairs are removed. Therefore, in the next step, the number of good quality sentence pairs obtained in C1- corpora and C2- corpora is smaller in the C0-corpus, thus driving down BLEU-score.

Thus, from the level 1 to level 3, C1-BLEU scores tend to increase and reach the highest value at level 3 (19.6), then decrease to level 6 (19.15) in Fig.3. Note that the two data sets (2 000 and 5 000 pairs) we extracted in section 3.1 have no effect on C1-BLEU scores at levels, but only on C2-BLEU scores. Similarly, C2-BLEU scores (with both data set: 2 000 and 5 000 pairs) are also neither good nor stable at level 1 (as above discussed). From level 2 onwards, C2-BLEU scores are always stable and higher than the C1-BLEU scores at the respective levels. C2-BLEU reaches the highest value at level 3 (SVM), which is 20.01 with the data set of 2 000 pairs and level 3 (SVM), which is 20.10 with the data set of 5 000 pairs.

The BLEU score, percentage of sentence pairs and experimental results are presented in Tab.8.

Tab.8 Original corpus, C0-corpus, C1-corpus, C2-corpus and their BLEU scores

Tab.8 compares results of our experiments in different filtering steps (i.e. C1-corpus and C2-corpus at level 3) with the baseline (C0-corpus) and original corpus. From the results, it is clear that the selected data using our method could obtain better quality corpora and get a higher score using small amount of data.

Noticeably, we could realize the competitive performance of much less data, and this will reduce the computation load in process. To illustrate, C1-corpus uses 74.70% of the baseline data and attain a BLEU score of 19.60. Meanwhile, C2-corpus uses 55.47% (with 2 000 pairs) and 57.70% (with 5 000 pairs) of the baseline data and get a BLEU score of 20.01 and 20.10 (i.e. higher BLEU score in C1-corpus and baseline). It means that only above half of the data can get a competitive performance compared to using all the data.

5 Conclusions and Future Work

This paper proposes the filtering methods based on the sentence length difference, the semantic of sentence pairs, and machine learning for the Chinese-Vietnamese parallel corpus construction. In the filtering step based on the sentence length difference, with all of the various threshold values, the C1-BLEU scores (from level 2) are higher than the C0-BLEU score. Next, two manual data sets are used for the remaining two filter steps. In the filtering step based on the semantic similarity of sentence pairs, the filter quality and quantity depend on the threshold values in step 1. The C2-BLEU scores are consistently higher than the C1-BLEU scores.

In addition, the machine learning methods (SVM, LR) also achieve more stable and higher BLEU scores compared with the single feature methods (based on each measure). Using the manual data sets having the size of 2 000 or 5 000 did not significantly affect C2-BLEUs, as they are extracted randomly to represent the C1-corpora. These methods can also be easily transferred to other language pairs.

In the future, we will attempt to calculate C2-BLEU scores at the remaining threshold values. It is hopeful that at these thresholds,the C2-BLEU scores could reach values higher than the current value of 20.10.

