Definition investigations (1): Text Mining

As one of the team members for the Historie project I thought I would do a little bit of digging into other attempts (largely from other disciplines) of describing some of the tools that we will be looking at in the simplest way possible.  In my first entry I will look at text mining.  Jonathan has already provided a brief definition.  Text mining is “The derivation of structured, meaningful data from a large body of unstructured data, using automated analytical methods”.  But how are other people defining this particular tool? 

Carrying out a basic Google search on the question – what is text mining? – the first item that appears on my screen is a short article titled with my search query written by Marti Hearst in 2003.  Professor Marti Hearst works in the School of Information at UC Berkeley and makes a living researching various digital tools: search engines, social technology, computational linguistics (including text mining), information visualisation, and usability in websites.  Her article ‘What is Text Mining?’ (17 October 2003).  The article describes text mining as ‘the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources’. 

Hearst emphasises that text mining is an aspect of data mining but differs in that it is attempting to extract from natural language text rather than structured databases of facts.  Thus, text mining attempts to dig out new knowledge from free flowing text such as might be found in an article, monograph, or primary source material.  Text mining is not a glorified Search – Search Engines look for something that is already known and cannot easily remove the chaff (irrelevant data) from the corn (relevant data)!  Hearst also believes that data mining differs from programmes designed for information extraction:

‘I distinguish between what I call “real” text mining, that discovers new pieces of knowledge, from approaches that find overall trends in textual data’. (Hearst, 2003)    

A second, longer article by Hearst entitled ‘Untangling Text Data Mining’ (which appears online and in the Proceedings of ACL’99: the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland [1999]) looks at corpus-based computational linguistics engagement with text data mining and concludes that whilst good at producing better text analysis algorithms fails to search for new facts and trends about the actual world.  Whilst now quite an antiquated study on text mining Hearst’s call for a semi-automated system to be devised to enable better text mining results is still under used – at the very least – in the humanities sector.

Three aspects of HISTORE

When drawing up the initial idea for the HISTORE project we broke the deliverables up into three main portions that seemed to make sense from the perspective of both compiling the resources in the first place and then presenting them in a useful way to historians. 

These were:

  1. A tool audit (by example of existing projects)
  2. Case studies (one per tool)
  3. Training modules (two tools demonstrated)

These will be made available through the IHR’s History Online and History SPOT platforms which are now our primary location for digital data, listings and online training materials.  Much of this material will be produced in-house through our own extensive expertise in these areas; however there are various parts where we have planned (and have budgeted) for external help.  The following is a brief breakdown of what we currently see these deliverables as containing. 

Tools Audit

The tools audit will form a database of current relevant digital projects for historians using one or more of the tools selected for investigation for the HISTORE project.  These will be organised by function, with a faceted browsing interface to allow filtering of tools along multiple dimensions.  The tools audit will be made permanently available on History Online with direct links to the case studies and training modules on History SPOT.  

Case Studies

A represented tool from each of the main areas relevant to historical research will be included in a series of case studies describing what the tool can be used for, providing examples of actual use, and demonstrating how it can be combined with other tools/software.  These case studies will be made available on History SPOT.

Training Modules

The audit will inform the choice of training areas.  Two free online modules will be developed to train historians in the basic use of two digital tools.  The modules will be multimedia in nature and provide a general understanding and awareness of the tools use.  Again, these will be made available on History SPOT.

Project Objectives

HISTORE stands for Historians’ Online Research Environments which admittedly doesn’t tell you much on what the project is about even though it does state our planned end result very well.  Simply put HISTORE is an attempt to help demystify and identify online tools which are of most value for historical research.

Many historians will have heard of text mining or semantic data (for example) but will not have thought about how these tools could be used in their own research.  Indeed many historians will not actually know what these terms really mean or what results they might produce.  The technical jargon can be off-putting and as yet there is practically no discipline-based training available either for undergraduates, postgraduates or established academics.

While there is a small group of enthusiasts adopting relevant tools as they have become available, many have confined themselves to searching a few trusted online collections, such as the Oxford Dictionary of National Biography or the Bibliography of British and Irish History.  Surprisingly, early-career historians differ little from older colleagues, and even a freely available in-browser bibliographic tool like Zotero evokes only moderate name recognition and few declared users.  In particular a question about the use of VREs (Virtual Research Environments) in stakeholder research carried out by the IHR produced very few respondents who said that they had ever used one (see The Impact and Embedding of an Established Resource: British History Online as a Case Study).  A common response to questions about specific tools was ‘I don’t know what that is’, followed by expressions of interest when its functionality was explained. 

All of this suggests that lack of awareness, not active resistance, is the main barrier to the embedding of research tools in the historical community.  The HISTORE project hopes to help demystify several of these tools, raise awareness of what they can do and illustrate that these tools are not the preserve of IT specialists but quite learnable by academic historians. 

The following tools will form the focus for the HISTORE project:

  • Text mining
  • Cloud computing
  • Data visualisation
  • Semantic data
  • Linked data