![]() |
|
start of sub navigation
HLT Home | People | Research | Collaborators | Projects | Publications
end of sub navigation
start of content
Human Language Technologies (HLT) – Text-based topic modellingToday, large collections of digital data are widely available and continue to grow in size at an increasing pace. Trying to understand the meaning of such data is a difficult task and in general the first option is to perform keyword searches. The results of keyword searches do not always describe the meaning of the data collection in a satisfactory way, especially if the user has limited insight into the collection. A summary of the data would be very useful and would ideally encapsulate the main topics within the data. Examples of data collections include news articles, conference proceedings or minutes of meetings. In the case of a text corpus of news articles, a summary of topics could include politics, sport, finance, culture and local news, for example. When one thinks of a text corpus as a collection of documents, it makes sense that each document has an underlying semantic context. This semantic context develops as the document is generated and refers to the intended meaning of the document. For example, a newspaper article has the purpose of reporting on a news event and as we read the article, we become aware of the intended message the author(s) is hoping to communicate. The semantic context is generally not stated explicitly, but is encoded in the words of a document. Topic modelling addresses the retrieval of semantic context from a text corpus and can be described as a problem of statistical inference. We investigate the use of natural language processing (NLP) techniques to pre-process and structure text data in order to improve the performance of topic models within the scope of text modelling applications. Our research explores a number of related issues:
One application area of topic modelling is in digital forensics, where it can focus the search of confiscated digital data for relevant evidence. Selected publicationsDe Waal A, Venter JP and Barnard E. “Applying Topic Modelling on Forensic Data: A Case Study”. International Federation for Information Processing, Advances in Digital Forensics IV, eds. Shenoi, S., Vol 242, pp 303-315, Springer Boston. De Waal A, Barnard E and Du Preez E. “Topic Models applied to Multilingual Data”. In Proceedings of the 18h Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), pp 99-103, Pietermaritzburg, South Africa, November 2007. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Contact: Alta de Waal +27 12 841 3792 adewaal@csir.co.za | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Copyright © Meraka Institute 2007 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||