GENERATING AND EXPLOITING LANGUAGE RESOURCES FOR INDONESIAN PREPOSITION ERROR CORRECTION

Budi Irmawati (1261204)


Automatic error detection and correction systems are currently bene t to second language (L2) learners. They provide responses about error types and error posi- tions in learners' sentences. To develop such a system, we need a huge learner data that is annotated with error tags. Beside error annotations, the system also needs linguistic features such as morphological information, syntactic information, or se- mantic information to be extracted as training features. Those data are dicult to be obtained for low-resource languages. On the other hand, most learner data are paper-based and need to be transcribed into a machine-readable format, which is time consume and laborious. Therefore, it is challenging to automatically obtain a large learner corpus, which have been annotated, for a low-resource language like Indonesian.

In this study, we developed a learner error-corrected corpus and generated a de- pendency annotation scheme by considering linguistic properties of the language. We then assigned dependency relations to the sentences to enable extracting syn- tactic information from the training data. Even though we could only construct a small learner corpus due to the availability of the raw data, we demonstrated that we were able to apply the corpus to initiate a preposition error correction task. Our preliminary experiments showed that a preposition error correction model trained and tested on cross-validation learner data performed better than the model trained on only native data, even though the size of the native data was much larger. Then, to cope with the data shortage, we propose two novel language independent methods, to create arti cial learner data and obtain more training data. Experimental results using the obtained arti cial learner data did improve the model performance. Therefore, we argue that our methods are promis- ing and can be applied to other languages because they only assume a small-size of learner error data.