The following collections have been selected as ideal for Digital Humanities research. Supporting information is provided for each, such as the different ways there might be to access the same resource and the pros and cons of each way.
The collections are organised by publisher, in boxes below. The publishers include Adam Matthew Digital, Gale Cengage and ProQuest. See also the Special Collections webpages, list of Historical English Corpora from the Department of Linguistics and English Language at The University of Manchester, and Researcher services: Using our collections and sources on the Library website.
Many Gale Cengage databases are available, both via direct access and via the Gale Primary Sources platform. All entry routes provide full text searches but Gale offers more advanced features such as cross-searching, term frequency and term clustering searches, and OCR (Optical Character Recognition) text download.
Documents can be viewed, downloaded as PDF (flat images) or printed. The OCR text can be downloaded (where it exists) on a per document basis.
An entire Gale Cengage database be used for text and data mining by purchasing a hard drive of XML text. The Library would then be responsible for keeping the hard drive and loaning it to researchers.
Adam Matthew Digital provides several text and image-rich collections covering the last 500 years. A data mining agreement is provided up-front. The collections include Apartheid South Africa, Global Commodities, African American Communities, Victorian Popular Culture and Mass Observation.
Documents can be viewed, downloaded as PDF (flat images), or printed. The OCR text cannot be accessed directly from here.
Some entire Adam Matthew databases can be used for text and data mining by requesting API access to JSON data (structured text) from Adam Matthew via the Library, for no charge.
The ProQuest collections arguably of most interest to Digital Humanities are Historical Newspapers and Early European Books Online.
For newspapers that the University subscribes to:
Use the ProQuest TDM Studio tool to access 10 historical newspapers for full text search and visualisation.
For newspapers that the University has purchased:
Pages can be viewed, downloaded as PDF (flat images), or printed. The OCR text cannot be accessed directly from here, however a temporary link can be emailed to a researcher to gain access to XML text files online.
An entire ProQuest Historical Newspapers database can be used for text and data mining purposes. This would require the purchase of a hard drive that contains XML text files. The Library would then be responsible for keeping the hard drive and loaning it to researchers.
Currently, the Library does not have any of these hard drives for ProQuest Historical Newspapers collections.
One of the most widely used collections for Digital Humanities research is Early European Books Online. It is available via several platforms and providers. These include ProQuest, EEBO-TCP, Jisc Historical Texts and the Oxford Text Archive.
The platforms offer the same source material but with different levels of text encoding or search features. For example, Jisc Historical Texts allows you to search fuzzy and variant spellings, variant forms, create histograms, make subsets, and cross-search with eighteenth and nineteenth century collections.
Early European Books migrated from the legacy Chadwyck Healey platform to ProQuest in summer 2018. It now has better search and visualisation options, including historical mapping.
Constellate is the text analytics service from the not-for-profit ITHAKA - the same people who brought you JSTOR and Portico. It is a platform for teaching, learning and performing text analysis using archival repositories of scholarly and primary source content. You can query, search and download full-text articles or upload your own to use with their tools. Constellate works using Python and includes sample Jupyter Notebooks which you can modify and extend.
The University of Manchester has access to the free, public version of Contellate (excluding the Lab).
ITHAKA launched the Text Analysis Pedagogy (TAP) Institute to help instructors and librarians learn and teach text analysis. From 10 July to 11 August 2023, Constellate—in partnership with the Academic Data Science Alliance and the Association of College and Research Libraries—is offering free events and classes for anyone interested in teaching text analysis. Courses are progressive, so you will benefit from taking a single class or the entire series, no matter your skill level.
There were a series of webinars in summer 2022 which can watch now:
You may also take in their four-session “Introduction to Python” course by working in the Constellate Lab alongside a recording of the class.
Data for Research (DfR) was a separate interface to access journal and pamphlet content on JSTOR ready for analysis and data mining. Searching DfR enabled researchers to find useful patterns, associations and unforeseen relationships in the body of research available in the journal and pamphlet archives on JSTOR. You could search OCR, metadata and key terms to download N-grams and word counts for up to 1,000 documents at a time, in XML or CSV format.
Data for Research has been replaced by Constellate.
Company annual reports are increasingly used for data mining, which are available from the (quoted) company's corporate website, from bodies such as Companies House (chargeable), or databases such as PI Navigator, Bloomberg or Refinitiv Eikon.
Another popular resource is the U.S. Securities and Exchange Commission search tool EDGAR.
The challenges that you might face include the following:
These challenges are not unique to financial and business data, although you might be able to use XBRL to overcome this. That still leaves a challenge that is specific to business, which is mapping company and security identifier codes.