Definition investigations (1): Text Mining

As one of the team members for the Historie project I thought I would do a little bit of digging into other attempts (largely from other disciplines) of describing some of the tools that we will be looking at in the simplest way possible.  In my first entry I will look at text mining.  Jonathan has already provided a brief definition.  Text mining is “The derivation of structured, meaningful data from a large body of unstructured data, using automated analytical methods”.  But how are other people defining this particular tool? 

Carrying out a basic Google search on the question – what is text mining? – the first item that appears on my screen is a short article titled with my search query written by Marti Hearst in 2003.  Professor Marti Hearst works in the School of Information at UC Berkeley and makes a living researching various digital tools: search engines, social technology, computational linguistics (including text mining), information visualisation, and usability in websites.  Her article ‘What is Text Mining?’ (17 October 2003).  The article describes text mining as ‘the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources’. 

Hearst emphasises that text mining is an aspect of data mining but differs in that it is attempting to extract from natural language text rather than structured databases of facts.  Thus, text mining attempts to dig out new knowledge from free flowing text such as might be found in an article, monograph, or primary source material.  Text mining is not a glorified Search – Search Engines look for something that is already known and cannot easily remove the chaff (irrelevant data) from the corn (relevant data)!  Hearst also believes that data mining differs from programmes designed for information extraction:

‘I distinguish between what I call “real” text mining, that discovers new pieces of knowledge, from approaches that find overall trends in textual data’. (Hearst, 2003)    

A second, longer article by Hearst entitled ‘Untangling Text Data Mining’ (which appears online and in the Proceedings of ACL’99: the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland [1999]) looks at corpus-based computational linguistics engagement with text data mining and concludes that whilst good at producing better text analysis algorithms fails to search for new facts and trends about the actual world.  Whilst now quite an antiquated study on text mining Hearst’s call for a semi-automated system to be devised to enable better text mining results is still under used – at the very least – in the humanities sector.


1 thought on “Definition investigations (1): Text Mining

  1. It would be interesting if project team members could expand on the application of text mining techniques for historical analysis, ideally giving examples from different periods of historical analysis and different thematic areas. is now offering the opportunity to sample selected eighteenth century texts in the web based text analysis environment Voyeur

    The Data Mining with Criminal Intent project has used Zotero and TAPor tools applied to the Old Bailey trial corpus

    Technological innovations are of course only ten or twenty percent of the issue when seeking to transform the way that research is done.

    In the pharmaceutical industry, for example, the introduction and adoption new technologies into research and development programmes and structures requires much evangelisation, reference projects, technology champions and visible success, something I have seen from the inside in my former career.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s