Top of page
go to main navigation
go to sub navigation
go to main content
Meraka Institute

   
start of sub navigation
HLT Home | People | Research | Collaborators | Projects | Publications
end of sub navigation
start of content

Human Language Technologies (HLT) – Statistical pattern recognition

Classification of patterns in data is employed in numerous applications; these applications include speech recognition, text analysis, machine vision, astronomy, medical research and many more. In relation to Human Language Technologies, classification is used in many applications of speech technology as well as text classification and text-based language identification.

Classification is thus of great practical importance and theoretical interest to the research of the HLT research group. Two of the main focuses of our research are (1) how to characterise classification data in order to select the optimal classifier for a classification task and (2) understanding Naive Bayes classifiers in the context of high dimensional feature spaces.

A wide variety of classifiers are available today; popular examples include the Naive Bayes, Gaussian, k-nearest-neighbour, decision tree, multilayer perceptron and support vector machine classifiers. There is, however, no single classifier that reliably outperforms all other classifiers on all classification problems and the process of classifier selection is still mainly a process of trial and error. The optimal classifier for any classification task is determined by the characteristics of the data set employed, it is therefore crucial to understand this relationship between data characteristics and the performance of classifiers.

The HLT research group is investigating this relationship in order to select the optimal classifiers for speech applications. Understanding of this relationship will allow the combination of classifiers in interesting ways; we will also be able to automatically change the specific classifier employed for an application as the data characteristics of the application changes over time.

Naive Bayesian classifiers are useful in domains of high dimensionality. These classifiers, which assume that all features are uncorrelated, are of great interest because of practical in-feasibilities in estimating the complete correlation structure between features. Their new-found popularity can be observed in HLT applications such as text processing where high dimensional feature spaces arise very naturally. The HLT research group is developing theoretical models for understanding naive Bayesian classifiers in the context of frequency counts. Issues that are addressed by this theoretical approach include feature selection and expected learning curves.

Selected publications

C.M. van der Walt, “Data measures that characterise classification problems”, Master’s dissertation, Department of Electrical, Electronic and Computer Engineering, University of Pretoria, South Africa, February 2008.

C.M. van der Walt and E. Barnard, “Data characteristics that determine classifier performance,” in transactions of SAIEE Africa Research Journal, Vol. 98, No.3, pp.87-93, September 2007.

E. van Dyk  and E. Barnard, “Naive Bayesian classifiers for multinomial features: a theoretical analysis”, in Proceedings of the Eighteenth Annual Symposium of the Pattern Recognition Association of South Africa, pp. 75-82, November 2007.

   
  Contact: Christiaan van der Walt +27 12 841 4364 cvdwalt@csir.co.za
   
Copyright © Meraka Institute 2007
Bottom of page