As one of the team members for the Historie project I thought I would do a little bit of digging into other attempts (largely from other disciplines) of describing some of the tools that we will be looking at in the simplest way possible. In my first entry I will look at text mining. Jonathan has already provided a brief definition. Text mining is “The derivation of structured, meaningful data from a large body of unstructured data, using automated analytical methods”. But how are other people defining this particular tool?
Carrying out a basic Google search on the question – what is text mining? – the first item that appears on my screen is a short article titled with my search query written by Marti Hearst in 2003. Professor Marti Hearst works in the School of Information at UC Berkeley and makes a living researching various digital tools: search engines, social technology, computational linguistics (including text mining), information visualisation, and usability in websites. Her article ‘What is Text Mining?’ (17 October 2003). The article describes text mining as ‘the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources’.
Hearst emphasises that text mining is an aspect of data mining but differs in that it is attempting to extract from natural language text rather than structured databases of facts. Thus, text mining attempts to dig out new knowledge from free flowing text such as might be found in an article, monograph, or primary source material. Text mining is not a glorified Search – Search Engines look for something that is already known and cannot easily remove the chaff (irrelevant data) from the corn (relevant data)! Hearst also believes that data mining differs from programmes designed for information extraction:
‘I distinguish between what I call “real” text mining, that discovers new pieces of knowledge, from approaches that find overall trends in textual data’. (Hearst, 2003)
A second, longer article by Hearst entitled ‘Untangling Text Data Mining’ (which appears online and in the Proceedings of ACL’99: the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland ) looks at corpus-based computational linguistics engagement with text data mining and concludes that whilst good at producing better text analysis algorithms fails to search for new facts and trends about the actual world. Whilst now quite an antiquated study on text mining Hearst’s call for a semi-automated system to be devised to enable better text mining results is still under used – at the very least – in the humanities sector.