Unwritten Languages Demand Attention Too! Word Discovery with Encoder-Decoder Models

Marcely Zanon Boito, Alexandre Berard, Aline Villavicencio, Laurent Besacier

Introduction

Computational Language Documentation (CLD) aims at creating tools and methodologies to help automate the extraction of lexical, morphological and syntactic information in languages of interest. This paper focuses on languages (most of them endangered and unwritten) spoken in small communities all across the globe. Specialists believe that more than 50% of them will become extinct by the year 2100 , and manually documenting all these languages is not feasible. Initiatives for helping with this issue include organizing tasks and proposing pipelines for automatic information extraction from speech signals .

Methodologies for CLD should consider the nature of the collected data: endangered languages may lack a well-defined written form (they often are oral-tradition languages). Therefore, in the absence of a standard written form, one alternative is to align collected speech to its translation in a well-documented language. Due to the challenge of finding bi-lingual speakers to help in this documentation process, the collected corpora usually are of small size.

One of the tasks involved in the documentation process is word segmentation. It consists of, given an unsegmented input, finding the boundaries between word-like units. This input can be a sequence of characters or phonemes, or even raw speech. Such a system can be very useful to linguists, helping them start the transcription and documentation process. For instance, a linguist can use the output of such a system as an initial vocabulary, and then manually validate the generated words. Popular solutions for this task are Nonparametric Bayesian models and, more recently, Neural Networks . The latter have also been used for related tasks such as speech translation or unsupervised phoneme discovery .

Contribution. This paper is the first attempt to leverage attentional encoder-decoder models for language documentation of a truly unwritten language. We show that it is possible, from very little data, to perform unsupervised word discovery with a performance (F-score) only slightly lower than that of Nonparametric Bayesian models, known to perform very well on this task in limited data settings. Moreover, our approach aligns symbols in the unknown language with words from a known language which, as a by-product, bootstraps a bilingual dictionary. Therefore, in the remainder of this paper, we will use the term word discovery (instead of word segmentation), since our approach does not only find word boundaries but also aligns word segments to their translation in another language.

Another reason why we are interested in attentional encoder-decoder models, is that they can easily be modified to work directly from the speech signal, which is our ultimate goal.

Approach. In a nutshell, we train an attention-based Neural Machine Translation (NMT) model, and extract the soft-alignment probability matrices generated by the attention mechanism. These alignments are then post-processed to segment a sequence of symbols (or speech features) in an unknown language (Mboshi) into words. We explore three improvements for our neural-based approach: alignment smoothing presented in , vocabulary reduction discussed in , and Moses-like symmetrization of our soft-alignment probability matrices. We also propose to reverse the translation direction, translating from known language words to unknown language tokens. Lastly, we also study a semi-supervised scenario, where prior knowledge is available, by providing the 100 most frequent words to the system.

Outline. This paper is organized as follows: we present related work in Section 2, and the neural architecture, corpus, and our complete approach in Section 3. Experiments and their results are presented in Section 4 and 5, and are followed by an analysis in Section 6. We conclude our work with a discussion about possible future extensions in Section 7.

Related Work

Nonparametric Bayesian Models (NB models) are statistical approaches that can be used for word segmentation and morphological analysis. Recent variants of these models are able to work directly with raw speech , or with sentence-aligned translations . The major advantage of NB models for CLD is their robustness to small training sets. Recently, achieved their best results on a subset (1200 sentences) of the same corpus we use in this work by using a NB model. Using the dpseg systemAvailable at http://homepages.inf.ed.ac.uk/sgwater/resources.html. , they retrieved 23.1% of the total vocabulary (type recall), achieving a type F-score of 30.48%.

Although NB models are well-established in the area of unsupervised word discovery, we wish to explore what neural-based approaches could add to the field. In particular, attention-based encoder-decoder approaches have been very successful in Machine Translation , and have shown promising results in End-to-End Speech Translation (translation from raw speech, without any intermediate transcription). This latter approach is especially interesting for language documentation, which often uses corpora made of audio recordings aligned with their translation in another language (no transcript in the source language).

While attention probability matrices offer accurate information about word soft-alignments in NMT systems , we investigate whether this is reproducible in scenarios with limited amounts of training data. That is because a notable drawback of neural-based models is their need of large amounts of training data .

We are aware of only one other work using an NMT system for unsupervised word discovery in a low-resource scenario. This work used an 18,300 Spanish-English parallel corpus to emulate an endangered language corpus. Their approach for unsupervised word discovery is the most similar to ours. However, we go one step further: we apply such a technique to a real language documentation scenario. We work with only five thousand sentences in an unwritten African language (Mboshi), as we believe that this is more representative of what linguists may encounter when documenting languages.

Methodology

We use a 5,157 sentence parallel corpus in Mboshi (Bantu C25), an unwrittenEven though it is unwritten, linguists provided a non-standard grapheme form, considered to be close to the language phonology. African language, aligned to French translations at the sentence level. Mboshi is a language spoken in Congo-Brazzaville, and it has 32 different phonemes (25 consonants and 7 vowels) and two tones (high and low). The corpus was recorded using the LIG-AIKUMA tool in the scope of the BULB project .

For each sentence, we have a non-standard grapheme transcription (the gold standard for segmentation), an unsegmented version of this transcription, a translation in French, a lemmatizationFor tokenization and lemmatization we used TreeTagger . of this translation, and an audio file. It is important to mention that in this work, we use Mboshi unsegmented non-standard grapheme form (close to language phonology) as a source while direct use of speech signal is left for future work.

We split the corpus into training and development sets, using 10% for the latter. Table 1 gives a summary of the types (unique words) and tokens (total word counts) on each side of the parallel corpus.

2 Neural Architecture

The attention function is defined as follows:

where $v$ , $W_{1}$ , $W_{2}$ , and $b_{2}$ are learned jointly with the other parameters of the model. At each time step ( $t$ ) a score $e_{i}^{t}$ is computed for each encoder state $h_{i}$ , using the current decoder state $s_{t}$ . These scores are then normalized using a $softmax$ function, thus giving a probability distribution over the input sequence $\sum_{i=1}^{A}{\alpha_{i}^{t}}=1$ and $\forall{i},0\leq\alpha_{i}^{t}\leq 1$ . The context vector $c_{t}$ used by the decoder, is a weighted sum of the encoder states. This can be understood as a summary of the useful information in the input sequence for the generation of the next output symbol $z_{t}$ . The weights $\alpha^{t}_{i}$ can be seen as a soft-alignment between input $x_{i}$ and output $z_{t}$ .

Our models are trained using the Adam algorithm, with a learning rate of $0.001$ and batch size ( $N$ ) of $32$ . We minimize a cross-entropy loss between the output probability distribution $p_{t}=softmax(y_{t})$ and reference translation $w_{t}$ :

3 Neural Word Discovery Approach

Our full word discovery pipeline is illustrated in Figure 1. We start by training an NMT system using the Mboshi-French parallel corpus, without the word boundaries on the Mboshi side. This is shown as step 1 in the figure.

We stop training once the training loss stops decreasing. At this point, we expect the alignment model to be the most accurate on the training data. Then we ask the model to force-decode the entire training set. We extract soft-alignment probability matrices computed by the attention model while decoding (step 2).

Finally, we post-process this soft-alignment information and infer a word segmentation (step 3). We first transform the soft-alignment into a hard-alignment, by aligning each source symbol $x_{i}$ with target word $w_{t}$ such that: $t=\arg\max_{i}{\alpha_{i}^{t}}$ . Then we segment the input (Mboshi) sequence according to these hard-alignments: if two consecutive symbols are aligned with the same French word, they are considered to belong to the same Mboshi word.

Unsupervised Word Discovery Experiments

For the unsupervised word discovery experiments, we used the unsegmented transcription in Mboshi provided by linguists, aligned with French sentences. This Mboshi unsegmented transcription is made of 44 different symbols.

We experimented with the following variations:

Alignment Smoothing: to deal with source (phones or graphemes) vs. target (words) sequence length discrepancy, we need to encourage many-to-one alignments between Mboshi and French. These alignments are needed in order to cluster Mboshi symbols into word-units. For this purpose, we implemented the alignment smoothing proposed by . The softmax function used by the attention mechanism (see eq. 6) takes an additional temperature parameter: $\alpha_{i}^{t}=\exp{(e_{i}^{t}/T)}/\sum_{j}{\exp{(e_{j}^{t}/T)}}$ A temperature $T$ greater than oneWe use $T=10$ , like the original paper . will result in a less sharp softmax, which boosts many-to-one alignments. In addition, the probabilities are smoothed by averaging each score with the scores of the two neighboring words: $\alpha^{t}_{i}\leftarrow(\alpha^{t}_{i-1}+\alpha^{t}_{i}+\alpha^{t}_{i+1})/3$ (equivalent to a low-pass filtering on the soft-alignment probability matrix).

Reverse Architecture: in NMT, the soft-alignments are created by forcing the probabilities for each target word $t$ to sum to one (i.e. $\sum_{i}\alpha_{i}^{t}=1$ ). However, there is no similar constraint for the source symbols, as discussed in . Because we are more interested in the alignment than the translation itself, we propose to reverse the architecture. The reverse model translates from French words to Mboshi symbols. This prevents the attention model from ignoring some Mboshi symbols.

Alignment Fusion: statistical machine translation systems, such as the Moses , extract alignments in both directions (source-to-target and target-to-source) and then merge them, creating the final translation model. This alignment fusion is often called symmetrization. We investigate whether this Moses-like symmetrization improves our results by merging the soft-alignments probability matrices generated by our base (Mboshi-French) and reverse (French-Mboshi) models. We replace each probability $\alpha_{i}^{t}$ by $\frac{1}{2}(\alpha_{i}^{t}+\beta_{t}^{i})$ , where $\beta_{t}^{i}$ is the probability for the same alignment $i\leftrightarrow t$ in the reverse architecture.

Target Language Vocabulary Reduction: to reduce vocabulary size on the known language, we replace French words by their lemmas. The intuition is that, by simplifying the translation information, the model could more easily learn relations between the two languages. For the task of unsupervised word discovery, this technique was recently investigated by .

The base model (Mboshi to French) uses an embedding size and cell size of 12. The encoder stacks two bidirectional LSTM layers, and the decoder uses a single LSTM layer. The reverse model (French to Mboshi) uses an embedding size and cell size of 64, with a single layer bidirectional encoder and single layer decoder.

We present in Table 2 the unsupervised word discovery task results obtained with our base model, and with the reverse model, with and without alignment smoothing (items 1 and 2). We notice that the alignment smoothing technique presented by improved the results, especially for types.

Moreover, we show that the proposed reverse model considerably improves type and token retrieval. This seems to confirm the hypothesis that reversing the alignment direction results in a better segmentation (because the attention model has to align each Mboshi symbol to French words with a total probability of 1). This may also be due to the fact that the reverse model reads words and outputs character-like symbols which is generally easier than reading sequences of characters . Finally, we achieved our best result by using the reverse model with alignment smoothing (last row in Table 2).

We then used this latter model for testing alignment fusion and vocabulary reduction (items 3 and 4). For alignment fusion, we tested three configurations using matrices generated by the base and reverse models. We tested the fusion of the raw soft-alignment probability matrices (without alignment smoothing), the fusion of already smoothed matrices, as well as this latter fusion followed by a second step of smoothing. All these configurations lead to negative results: recall reduction between 3% and 5% for tokens and between 1% and 9% for types. We believe this happens because by averaging the reverse model’s alignments with the ones produced by the base model (which does not have the constraint of using all the symbols) we degrade the generated alignments, more than exploiting information discovered in both directions.

Lastly, when running the reverse architecture (with alignment smoothing) using French lemmas (vocabulary reduction), we also noticed a reduction in performance. The lemmatized model version had a recall drop of approximately 2% for all tokens and types metrics. We believe this result could be due to the nature of the Mboshi language, and not necessarily a generalizable result. Mboshi has a rich morphology, creating a different word for each verb tense, which includes radical and all tense information. Therefore, by removing this from the French translations, we may actually make the task harder, since the system is forced to learn to align different words in Mboshi to the same word in French.

Semi-supervised Word Discovery Experiments

A language documentation task is rarely totally unsupervised, since linguists usually immerse themselves in the community when documenting its language. In this section, we explore a semi-supervised approach for word segmentation, using our best reverse model from Section 4.

To emulate prior knowledge, we select the 100 most frequent words in the gold standard for Mboshi segmentation. We consider this amount reasonable for representing the information a linguist could acquire after a few days. Our intuition is that providing the segmentation for these words could help improve the performance of the system for the rest of the vocabulary.

To incorporate this prior information to our system, we simply add known tokens on the Mboshi side of the corpus, keeping the remaining symbols unsegmented. This creates a mixed representation, in which the Mboshi input has at the same time unsegmented symbols and segmented words. Since languages follow Zipfian distributions and we are giving to the model the most frequent words in the corpus, analysis is not done in terms of tokens, since this would be over-optimistic and bias the model evaluation, but only in terms of types. Results are presented in Table 3.

For types, we observed an increase of 2.4% in recall. This is not a huge improvement, considering that we are giving 100 words to the model. We discovered that our unsupervised model was already able to discover 97 of these 100 frequent words, which could justify the small performance difference between the models. In addition to the 100 types already known, the semi-supervised model found 50 new types that the unsupervised system was unable to discover.

Finally, it is interesting to notice that, while the performance increase is not huge, the semi-supervised system reduced considerably the number of types generated, from 11,266 to 7,473. This suggests that this additional information helped the model to create a better vocabulary representation, closer to the gold standard vocabulary.

Analysis

As a baseline, we used dpseg which implements a Nonparametric Bayesian approach, where (pseudo)-words are generated by a bigram model over a non-finite inventory, through the use of a Dirichlet-Process.

We used the same hyper-parameters as , which were tuned on a larger English corpus and then successfully applied to the segmentation of Mboshi. We use a random initialization and 19,600 sampling iterations.

Table 4 shows our results for types compared to the NB model. Although the former is able to retrieve more from the vocabulary, the latter has higher precision, and both are close in terms of F-score. Additionally, ours has the advantage of providing clues for translation.

It is interesting to notice that our neural approach, which is not specialized for this task (the soft-alignment scores are only a by-product of translation), was able to achieve close performance to the dpseg method, which is known to be very good in low-resource scenarios. This highlights the potential of our approach for language documentation.

2 Vocabulary Analysis

To understand the segmentation behavior of our approach, we looked at the generated vocabulary. We compare our unsupervised and semi-supervised methods with the gold standard and the NB baseline, dpseg. The first characteristic we looked at was the word distribution of the generated vocabularies. While we already knew that dpseg constraints the generated vocabulary to follow a power law, we observed that our approaches also display such a behavior. They produce curves that are as close to the real language distribution as dpseg (see Figure 2).

We also measured the average word length to identify under-segmentation and over-segmentation. To be able to compare vocabularies of varying sizes, we normalized the frequencies by the total number of generated types. The curves are shown in Figure 3. Reading the legend from left to right, the vocabulary sizes are 6,245, 2,285, 11,266, and 7,473.

Our semi-supervised configuration is the closest to the real vocabulary in terms of vocabulary size, with only 1,228 more types. All the approaches (including dpseg) over-segment the input in a similar way, creating vocabularies with average word length of four (Figure 3).

Since both dpseg and neural-based approaches suffer from the same over-segmentation problem, we believe that this is a consequence of the corpus used for training, and not necessarily a general characteristic of our approach in low-resource scenarios. For our neural approaches, another justification is the corpus being small, and the average tokens per sentence being higher at the French side (shown in Table 1), which can potentially disperse the alignments over the possible translations, creating multiple boundaries.

Moreover, as Mboshi is an agglutinative language, there were several cases in which we had a good alignment but wrong segmentation. An example is shown in Figure 4, where we see that the word “ímok $\acute{\omega}$ s $\acute{\omega}$ ” was split in two words in order to keep its alignment to both parts of its French translation “suis blessé”. This is also the case of the last word in this figure: Mboshi does not require articles preceding nouns, which caused misalignment. We believe that by exploiting translation alignment, we could constraint our segmentation procedure, creating a more accurate word discovery model. Finally, we were able to create a model of reasonable quality which gives segmentation and alignment information using only 5,157 sentences for training (low-resource scenario).

Conclusion

In this work, we presented a neural-based approach for performing word discovery in low-resource scenarios. We used an NMT system with global attention to retrieve soft-alignment probability matrices between source and target language, and we used this information to segment the language to be documented. A similar approach was presented in , but this work represents the first attempt at training a neural model with a real unwritten language based on a small corpus made of only 5,157 sentences.

By reversing the system’s input order and applying alignment smoothing, we were able to retrieve 27.23% of the vocabulary, which gave us an F-score close to the NB baseline, known for being robust to low-resource scenarios. Moreover, this approach has the advantage of naturally incorporating translation, which can be used for enhancing segmentation and creating a bilingual lexicon. The system is also easily extendable to work with speech, a requirement for most of the approaches in CLD.

Finally, as future work, our objective is to discover lexicon directly from speech, inspired by the encoder-decoder architectures presented in . We will also explore different training objective functions more correlated with segmentation quality, in addition to MT metrics. Lastly, we intend to investigate more sophisticated segmentation methods from the generated soft-alignment probability matrices, identifying the strongest alignments in the matrices, and using their segmentation as prior information to the system (iterative segmentation-alignment process).