Language modeling is a fundamental research problem that has wide application for many NLP tasks. For estimating probabilities of natural language sentences, most research on language modeling use n-gram based approaches to factor sentence probabilities. However, the assumption under n-gram models is not robust enough to cope with the data sparseness problem, which affects the final performance of language models.
In this dissertation, based on the basic idea of cognitive grammar structure, we propose a hierarchical word sequence structure, where different assumptions can be adopted to rearrange word sequences in a totally unsupervised fashion. We present three different methods to construct the hierarchical word sequence structures. Unlike the n-gram which factors sentence probability from left-to-right, our model factors using a more flexible strategy.
For evaluation, we compare our rearranged word sequences to normal n-gram word sequences. Both intrinsic and extrinsic experiments verify that our language model can achieve better performance, proving that our method can be considered as a better alternative for n-gram language models.