Skip to content Skip to navigation Skip to collection information

OpenStax_CNX

You are here: Home » Content » Research in a Connected World » Text Analysis in the Arts and Humanities

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

In these lenses

  • eScience, eResearch and Computational Problem Solving

    This collection is included inLens: eScience, eResearch and Computational Problem Solving
    By: Jan E. Odegard

    Click the "eScience, eResearch and Computational Problem Solving" link to see all content selected in this lens.

Recently Viewed

This feature requires Javascript to be enabled.
 

Text Analysis in the Arts and Humanities

Module by: Tobias Blanke. E-mail the authorEdited By: Alex Voss, Elizabeth Vander Meer

Summary: This chapter introduces the use of computer-enabled methods of research in text analysis.

Key Concepts

  • Digital scholarship
  • Data-driven research
  • TextGrid and collaborative working
  • HiTHeR (High ThroughPut Computing in Humanities e-Research) and use of e-Infrastructure

Introduction

According to UNESCO reports, Britain tops the European lists of research publications per year in philology, literature and other text-based studies such as philosophy. Worldwide, Britain overtook the U.S. in 2006 in terms of book publications per year. In the list of countries for number of book publications in the latest available year, Britain is no. 1. These figures emphasize the urgent need for the British textual studies communities to explore new ways of dealing with this deluge of research data.

Based on these quoted figures, collaboration becomes fundamental to digital scholarship in textual studies. No researcher alone will be able to cope with the plethora of new daily published material. Furthermore, text analysis in the humanities can be a tedious and time-consuming task. But advanced computer-enabled methods make the process easier for digital or digitised works. Researchers can search large texts rapidly, conduct complex searches and have the results presented in context. The ease brought to the analysis process allows the researcher to engage with texts more thoroughly and can then lead to the development of insightful, well-crafted interpretations of texts.

Various projects have emerged internationally in recent years that allow for a new scale of textual studies research, in keeping with the idea of new data-driven research. Software developed by the US MONK (Metadata Offer New Knowledge) project helps humanities scholars discover and analyze patterns in texts,1 while its sister project SEASR (Software Environment for the Advancement of Scholarly Research) enables digital humanities developers to design, build, and share software applications that support research and collaboration in textual studies.2 Aus-e-Lit is an Australian project aimed at Australian literary scholars to allow them to seamlessly search across relevant databases and archives to retrieve reliable information on a particular author, topic or publication.3 These are just three projects quite closely linked to e-Research initiatives, but there are many more. For over fifty years, there has been a worldwide academic movement to work on Digital Humanities, resulting in many achievements, especially in the field of textual studies. It is impossible in the space of this chapter to list all of these projects (for a history of Digital Humanities and some of the involved textual scholarship see Schreibman, Siemens et al. 2004). Instead, we shall concentrate on two projects linked to both Digital Humanities and e-Research, which exemplify in very particular ways two major new developments in textual studies research that are directly linked to the shift in methodologies based on data-driven research: the German TextGrid project illustrates the value of new collaborative research in textual studies, while the UK project HiTHeR (High ThroughPut Computing in Humanities e-Research) demonstrates the effective use of e-Infrastructure to support everyday research in the Digital Humanities.

Collaboration in textual studies - TextGrid

TextGrid4 is primarily concerned with historical-critical editions for modern cross-language researchers. Such historical-critical editions often form the basis for more light-weight editions for study and reading. Such editions can be very large and very detailed. They cannot be the result of the work of one individual researcher alone, but have to be the result of a collaborative effort. It is TextGrid’s key innovation to facilitate such (virtual) collaboration across language and national barriers.

In its first phase of funding, TextGrid delivered a modular platform for collaborative textual editing, mainly based on the community standard of the Text Encoding Initiative (TEI).5 As a community grid for textual studies, TextGrid forms a cornerstone in the emerging German e-Humanities agenda. Its success has also been noted in the UK, where the arts and humanities e-Science initiative allowed researchers to experiment with new technologies to cope with the research data deluge in textual studies. The UK e-Science Scoping Study for textual studies, written by Professor Peter Robinson from Birmingham University, quotes TextGrid as a prime example of how to advance literary and textual studies with new digital services, because it addresses the need for collaborative resource creation, comparison (that is, collation and alignment), analysis and annotation.

TextGrid focuses on advancing digital scholarship for a particular community: TEI-based textual studies research. At the centre of its technology innovation is the deployment of an integrated development environment for the creation of critical editions called TextGridLab. Based on the Eclipse platform, TextGridLab uses Grid technologies for storage and retrieval of textual studies resources. It supports all activities, stakeholders and challenges in the textual studies research lifecycle. Resource discovery, via the web interface or TextGridLab modules, is aided by searching across the entire TextGrid data pool – either full text or metadata-restricted.

Decentralized and collaborative work is always sensible when primary sources grow very large and need to be made available and linked to each other in complex metadata schemes. This is due to the quality of these resources, which demand an integration of different viewpoints. Additionally, new mass quantities of resources need the support of high-performance technology in new investigations of ways that advanced text mining solutions can add to the linking and discovery of textual studies resources. The UK JISC Engage funded HiTHeR project has taken on this challenge.

Use of e-Infrastucture in textual studies – HiTHeR

In the Digital Humanities, many text-based collections are exposed via searchable websites. One of these resources is the Nineteenth Century Serials Edition (NCSE) in the UK.6 The NCSE, a free online scholarly edition of nineteenth-century periodicals and newspapers, has been created as a collaborative project between Birkbeck, University of London, King's College London, the British Library, and Olive Software. The UK Arts and Humanities Research Council funded the project from January 2005 to December 2007. The NCSE corpus contains circa 430,000 articles that originally appeared in roughly 3,500 issues of six 19th Century periodicals. Published over a span of 84 years, materials within the corpus exist in numbered editions and include supplements, wrapper materials and visual elements. A key challenge in creating a digital system for managing such a corpus is to develop appropriate and innovative tools that will assist scholars in finding materials that support their research, while at the same time stimulating and enabling innovative approaches to the material. One goal would be to create a 'semantic view' that would allow users of the resource to find information more intuitively. Such a semantic view can be created by offering users articles with common content through a browsing interface. This is a typical classification task known from many information retrieval and text mining applications. (Nentwich 2003)

According to Toms and O’Brien (2008), the work of humanities researchers using digital resources is concerned with access to sources, the presentation of texts and the ability to analyse texts using a well-defined set of analysis tools. HiTHeR promises direct retrieval of relevant primary sources for research on the NCSE collections. It provides an automatically generated browsing interface, which allows for the crucial Humanities 'chain of readings' activities that define most Humanities researchers' work. In Humanities research processes, new relevant resources are based on the initial discovery of other relevant resources. HiTHeR offers an interface to primary resources by automatically generating a chain of related documents for reading.

However, the advanced automated methods that could help to create such a browsing view using text mining to aid the information retrieval task by users require greater processing power than is available in standard desktop environments. Prior to the current case study, we experimented with a simple document similarity index to allow journals of similar contents to be represented next to each other. Initial benchmarks on a stand-alone server allowed us to conclude that (assuming the test set was representative) a complete set of comparisons for the corpus would take more than 1,000 years!

Governments, private enterprise and funding bodies are investing heavily in digitization of cultural heritage and humanities research resources. With advances in the availability of parallel computing resources and the simultaneous need to process large and complicated historical collections, it seems logical to turn attention towards the best parallel computing infrastructures to support work as envisioned in the HiTHeR project. In HiTHeR we set up an infrastructure based on High Throughput Computing (HTC), which uses many computational resources to accomplish a single computational task.

The HiTHeR project created a prototype infrastructure to demonstrate to textual scholars, and indeed to humanities researchers in general, the utility of HTC methods using Condor. It uses Condor to set up a Campus Grid. In our case, we have built a Campus Grid using underutilized computers from two institutions, which share a building at King’s College London: the Centre for Computing in the Humanities (CCH) and the Centre for e-Research (CeRch). We use two types of computer systems: underutilized normal desktops and dedicated servers. Both, CCH and CeRch, have a large number of desktop machines and servers, used to present their vast archives and online publications. While the servers contain several Terabytes of data, they have underused processing capabilities which can be made available for advanced processing. Additionally, the Condor Toolkit can use the national research infrastructure in the UK, the National Grid Service (NGS), which is a free service to UK researchers and provides dedicated advanced computing facilities.

The evaluation showed that the time used for calculating document similarity could be reduced significantly by using the HTC resource. However, it also showed that more work is needed to exactly determine how text mining for humanities can best be served by UK research infrastructures. More research is also needed to determine when HTC can serve the needs and when dedicated hardware is required.

Summary: The Potential of e-Research Technologies in Textual Resource Analysis

There is great unplugged potential for using e-Research technologies in textual resource analysis. Computation of textual resources is quite well researched and there are by now many well performing algorithms and data structures to serve the needs not only of the general user, but also the specific needs of researchers. But less work has been done to consider infrastructural needs for the future of research based on these methodologies. More user studies are required to analyse existing work in Digital Humanities involving textual resources. We need to better understand how new methods such as text mining could be used, or how the discipline of textual studies and humanities in general is transformed by the ability to do more data-driven empirical research. The field of humanities has the opportunity to move towards a new more empirical way of working in which more and more resources, increasing not only in number but in size, become easily available. Interest in such new working practices already exists as repeatedly shown in research reports for conferences. This chapter has presented just a few of the many projects working on this agenda. TextGrid looks at how collaboration can enable new research in the textual studies, while HiTHeR looks at enhancing online editions in Digital Humanities using text mining approaches. As the need for research using large digital corpora increases, other projects will emerge that will further advance computational text analysis in arts and humanities research.

References

Brockman, W. S., L. Newmann, et al., Eds. (2001). Scholarly Work in the Humanities and the Evolving Information Environment. Washington DC, Digital Library Federation. Council on Library and Information Resources.

Gietz, P., A. Aschenbrenner, et al. (2006). TextGrid and eHumanities. Proceedings of the Second IEEE International Conference on e-Science and Grid Computing, IEEE Computer Society.

Nentwich, M. (2003). Cyberscience. Research in the Age of the Internet. Vienna, Austrian Academy of Science Press.

Schreibman, S., R. Siemens, et al., Eds. (2004). A Companion to Digital Humanities. Oxford, Blackwell Publishing.

Toms, E. and H. L. O'Brien (2008). " Understanding the information and communication technology needs of the e-humanist." Journal of Documentation64.

Footnotes

  1. http://monkproject.org/
  2. http://seasr.org/
  3. http://www.itee.uq.edu.au/~eresearch/projects/aus-e-lit/
  4. http://www.textgrid.de
  5. http://www.tei-c.org/index.xml
  6. http://www.ncse.ac.uk

Collection Navigation

Content actions

Download:

Collection as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add:

Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks