The following collections have been selected as ideal for Digital Humanities research. Supporting information is provided for each, such as the different ways there might be to access the same resource and the pros and cons of each way.
The collections are organised by publisher, in boxes below. The publishers include Adam Matthew Digital, Gale Cengage and ProQuest. See also the Special Collections subject guide, and list of Historical English Corpora from the Department of Linguistics and English Language at The University of Manchester.
Many Gale Cengage databases are available, both via direct access and via the Gale Primary Sources platform. All entry routes provide full text searches but Gale offers more advanced features such as cross-searching, term frequency and term clustering searches, and OCR (Optical Character Recognition) text download.
Documents can be viewed, downloaded as PDF (flat images) or printed. The OCR text can be downloaded (where it exists) on a per document basis.
An entire Gale Cengage database be used for text and data mining by purchasing a hard drive of XML text. The Library would then be responsible for keeping the hard drive and loaning it to researchers.
Adam Matthew Digital provides several text and image-rich collections covering the last 500 years. A data mining agreement is provided up-front. The collections include Apartheid South Africa, Global Commodities, African American Communities, Victorian Popular Culture and Mass Observation.
Documents can be viewed, downloaded as PDF (flat images), or printed. The OCR text cannot be accessed directly from here.
Some entire Adam Matthew databases can be used for text and data mining by requesting API access to JSON data (structured text) from Adam Matthew via the Library, for no charge. Adam Matthew provides a data mining agreement.
The ProQuest collections arguably of most interest to Digital Humanities are Historical Newspapers and Early European Books Online.
The following information applies to newspapers that the University has purchased, not those it subscribes to.
Pages can be viewed, downloaded as PDF (flat images), or printed. The OCR text cannot be accessed directly from here, however a temporary link can be emailed to a researcher to gain access to XML text files online.
An entire ProQuest Historical Newspapers database can be used for text and data mining purposes. This would require the purchase of a hard drive that contains XML text files. The Library would then be responsible for keeping the hard drive and loaning it to researchers.
Currently, the Library does not have any of these hard drives for ProQuest Historical Newspapers collections.
One of the most widely used collections for Digital Humanities research is Early European Books Online. It is available via several platforms and providers. These include Chadwyck Healey (ProQuest), EEBO-TCP, Jisc Historical Texts and the Oxford Text Archive.
The platforms offer the same source material but with different levels of text encoding or search features. For example, Jisc Historical Texts allows you to search fuzzy and variant spellings, variant forms, create histograms, make subsets, and cross-search with eighteenth and nineteenth century collections.
Early European Books is migrating from the legacy Chadwyck Healey platform to ProQuest in summer 2018. It will have better search and visualisation options, including historical mapping.
Data for Research (DfR) is a separate interface to access journal and pamphlet content on JSTOR ready for analysis and data mining. Searching DfR enables researchers to find useful patterns, associations and unforeseen relationships in the body of research available in the journal and pamphlet archives on JSTOR. You can search OCR, metadata and key terms to download N-grams and word counts for up to 1,000 documents at a time, in XML or CSV format.
Data for Research also has a prepared bulk download option for early journals called “Early Journal Content Data Bundle,” which contains every article from every journal in JSTOR’s massive database that was published prior to 1923 (for US publications; prior to 1870 for other publications). This amounts to more than 450,000 articles from more than 200 journals. The Psyborgs Lab has written an introductory tutorial for using the Early Journal Content Data Bundle; you will need to use a little Python code for this.
Company annual reports are increasingly used for data mining, which are available from the (quoted) company's corporate website, from bodies such as Companies House (chargeable), or databases such as PI Navigator or Thomson Research.
Another popular resource is the U.S. Securities and Exchange Commission search tool EDGAR.
The challenges that you might face include the following:
These challenges are not unique to financial and business data, although you might be able to use XBRL to overcome this. That still leaves a challenge that is specific to business, which is mapping company and security identifier codes.