Skip to Main Content

Digital Humanities: Collections

Support for researchers using the Library's collections with Digital Humanities techniques.

Library collections with Digital Humanities capabilities

The following collections have been selected as ideal for Digital Humanities research. Supporting information is provided for each, such as the different ways there might be to access the same resource and the pros and cons of each way.

The collections are organised by publisher, in boxes below. The publishers include Adam Matthew Digital, Gale Cengage and ProQuest. See also the Special Collections webpages, list of Historical English Corpora from the Department of Linguistics and English Language at The University of Manchester, and Researcher services: Using our collections and sources on the Library website.

 

Gale Cengage

Many Gale Cengage databases are available, both via direct access and via the Gale Primary Sources platform. All entry routes provide full text searches but Gale offers more advanced features such as cross-searching, term frequency and term clustering searches, and OCR (Optical Character Recognition) text download.

Data mining with Gale Primary Sources

Documents can be viewed, downloaded as PDF (flat images) or printed. The OCR text can be downloaded (where it exists) on a per document basis.

An entire Gale Cengage database be used for text and data mining by purchasing a hard drive of XML text. The Library would then be responsible for keeping the hard drive and loaning it to researchers.

Adam Matthew Digital

Adam Matthew Digital provides several text and image-rich collections covering the last 500 years. A data mining agreement is provided up-front. The collections include Apartheid South Africa, Global Commodities, African American Communities, Victorian Popular Culture and Mass Observation.

Data mining with Adam Matthew Digital

Documents can be viewed, downloaded as PDF (flat images), or printed. The OCR text cannot be accessed directly from here.

Some entire Adam Matthew databases can be used for text and data mining by requesting API access to JSON data (structured text) from Adam Matthew via the Library, for no charge.

ProQuest

The ProQuest collections arguably of most interest to Digital Humanities are Historical Newspapers and Early European Books Online.

Data mining with ProQuest Historical Newspapers

For newspapers that the University subscribes to:

Use the ProQuest TDM Studio tool to access 10 historical newspapers for full text search and visualisation.

 

For newspapers that the University has purchased:

Pages can be viewed, downloaded as PDF (flat images), or printed. The OCR text cannot be accessed directly from here, however a temporary link can be emailed to a researcher to gain access to XML text files online.

An entire ProQuest Historical Newspapers database can be used for text and data mining purposes. This would require the purchase of a hard drive that contains XML text files. The Library would then be responsible for keeping the hard drive and loaning it to researchers.

Currently, the Library does not have any of these hard drives for ProQuest Historical Newspapers collections.

Feature: Early modern books

One of the most widely used collections for Digital Humanities research is Early European Books Online. It is available via several platforms and providers. These include ProQuest, EEBO-TCP, Jisc Historical Texts and the Oxford Text Archive.

The platforms offer the same source material but with different levels of text encoding or search features. For example, Jisc Historical Texts allows you to search fuzzy and variant spellings, variant forms, create histograms, make subsets, and cross-search with eighteenth and nineteenth century collections.

Early European Books migrated from the legacy Chadwyck Healey platform to ProQuest in summer 2018. It now has better search and visualisation options, including historical mapping.

Data mining with EEBO

The search features are quite powerful in most of these platforms. You may need to download a set of XML files from the Oxford Text Archive. There are tools available from the Text Creation Partnership GitHub page, including a CSV/JSON bibliography of the EEBO-TCP texts.

JSTOR and Portico

Constellate is the text analytics service from the not-for-profit ITHAKA - the same people who brought you JSTOR and Portico. It is a platform for teaching, learning and performing text analysis using archival repositories of scholarly and primary source content. You can query, search and download full-text articles or upload your own to use with their tools. Constellate works using Python and includes sample Jupyter Notebooks which you can modify and extend.

The University of Manchester has access to the free, public version of Contellate (excluding the Lab). 

ITHAKA launched the Text Analysis Pedagogy (TAP) Institute to help instructors and librarians learn and teach text analysis. From 10 July to 11 August 2023, Constellate—in partnership with the Academic Data Science Alliance and the Association of College and Research Libraries—is offering free events and classes for anyone interested in teaching text analysis. Courses are progressive, so you will benefit from taking a single class or the entire series, no matter your skill level.

Register for summer 2023 TAP events now

 

There were a series of webinars in summer 2022 which can watch now:

You may also take in their four-session “Introduction to Python” course by working in the Constellate Lab alongside a recording of the class.

Data for Research (DfR) was a separate interface to access journal and pamphlet content on JSTOR ready for analysis and data mining. Searching DfR enabled researchers to find useful patterns, associations and unforeseen relationships in the body of research available in the journal and pamphlet archives on JSTOR. You could search OCR, metadata and key terms to download N-grams and word counts for up to 1,000 documents at a time, in XML or CSV format.

Data for Research has been replaced by Constellate.

Financial and business databases

Company annual reports are increasingly used for data mining, which are available from the (quoted) company's corporate website, from bodies such as Companies House (chargeable), or databases such as PI Navigator, Bloomberg or Refinitiv Eikon.

Another popular resource is the U.S. Securities and Exchange Commission search tool EDGAR.

Data mining with financial and business databases

The challenges that you might face include the following:

  1. Obtaining files in bulk
  2. Processing PDF files without correct underlying plain text or no text
  3. Processing PDF files without the structure, say if you are only interested in the Chairman's Statement.
  4. Processing HTML or other text files with poor, inconsistent or no discernable structure.

These challenges are not unique to financial and business data, although you might be able to use XBRL to overcome this. That still leaves a challenge that is specific to business, which is mapping company and security identifier codes.

Creative Commons Licence This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International Licence.