Top of page
go to main navigation
go to sub navigation
go to main content
Meraka Institute

   
start of sub navigation
HLT Home | People | Research | Collaborators | Projects | Publications
end of sub navigation
start of content

Human Language Technologies (HLT) – Text-based topic modelling

Today, large collections of digital data are widely available and continue to grow in size at an increasing pace. Trying to understand the meaning of such data is a difficult task and in general the first option is to perform keyword searches. The results of keyword searches do not always describe the meaning of the data collection in a satisfactory way, especially if the user has limited insight into the collection. A summary of the data would be very useful and would ideally encapsulate the main topics within the data.

Examples of data collections include news articles, conference proceedings or minutes of meetings. In the case of a text corpus of news articles, a summary of topics could include politics, sport, finance, culture and local news, for example. When one thinks of a text corpus as a collection of documents, it makes sense that each document has an underlying semantic context. This semantic context develops as the document is generated and refers to the intended meaning of the document. For example, a newspaper article has the purpose of reporting on a news event and as we read the article, we become aware of the intended message the author(s) is hoping to communicate. The semantic context is generally not stated explicitly, but is encoded in the words of a document. Topic modelling addresses the retrieval of semantic context from a text corpus and can be described as a problem of statistical inference.

We investigate the use of natural language processing (NLP) techniques to pre-process and structure text data in order to improve the performance of topic models within the scope of text modelling applications. Our research explores a number of related issues:

  • We challenge the assumption that a word is the most suitable unit for deriving topic models. We introduce an alternative basic unit, namely “concepts”. A concept consists of a number of words which are generally - but not not necessarily - adjacent (sequential).
  • The orthographic properties of languages differ, which leads to different topic models of the same corpus in different languages; language independence is therefore an interesting test of the consistency of a modelling approach.
  • We anticipate that extracting concepts as a preprocessing task will reduce the dimensionality of the topic model parameter space. We wish to quantify the effect of this modification on the performance of the topic model.

One application area of topic modelling is in digital forensics, where it can focus the search of confiscated digital data for relevant evidence.

Selected publications

De Waal A, Venter JP and Barnard E. “Applying Topic Modelling on Forensic Data: A Case Study”. International Federation for Information Processing, Advances in Digital Forensics IV, eds. Shenoi, S., Vol 242, pp 303-315, Springer Boston.

De Waal A, Barnard E and Du Preez E. “Topic Models applied to Multilingual Data”. In Proceedings of the 18h Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), pp 99-103, Pietermaritzburg, South Africa, November 2007.

   
  Contact: Alta de Waal +27 12 841 3792 adewaal@csir.co.za
   
Copyright © Meraka Institute 2007
Bottom of page