Application of N-Gram IDF on Software Artifacts Classification

Pannavat Terdchanakul (1551209)


The automated tool set has been made to be available for many software engineering activities and proven to be very useful in helping software developers to accelerate and increase the reliability of their software development process.

Our thesis target on providing a new and pragmatic tool that identifies the type of any software artifacts by automatically capture the structure of texts and classify types of the documents based on contexts in the documents. This is challenging yet will be very helpful for people who involved in the production of software for the reason that manually classifies the types of documents is a time-consuming task and need a lot of effort to be done because each document need an independent inspection to correctly classify them.

In this presentation, we will talk our proposed software artifacts classification model with N-gram IDF, a theoretical extension of Inverse Document Frequency (IDF) for handling words and phrases of any length. By applying N-gram IDF on a corpus of the documents, We can extract key terms of any length from the corpus, these key terms can be used as a feature to classify software artifacts. We will introduce and show the result of our classification on three applications; which are (1) source code algorithm classification, (2) issue reports classification, and (3) high-impact bug reports classification.