Subject guides: Digital Humanities: Collections

Gale Cengage

Many Gale Cengage databases are available, both via direct access and via the Gale Primary Sources platform. All entry routes provide full text searches but Gale offers more advanced features such as cross-searching, term frequency and term clustering searches, and OCR (Optical Character Recognition) text download.

Gale Archives of Sexuality & Gender
The largest digital collection of primary source materials relating to the history and study of sex, sexuality and gender currently available.
more...less...
Parts I and II cover LGBTQ History and Culture Since 1940 and feature historical documents published in more than 35 countries, with over 15 languages represented and cover the development, culture, and society of LGBTQ groups in the latter half of the twentieth century.
Part III Sex and Sexuality, Sixteenth to Twentieth Century brings together more than 5,000 rare and unique books covering sex, sexuality, and gender issues across the sciences and humanities and examines topics such as patterns of fertility and sexual practice; prostitution; religion and sexuality; the medical and legal construction of sexualities; and the rise of sexology. The collection not only offers a reflection of the cultural and social attitudes of the past, but also a window into how sexuality and gender roles were viewed and changed over time (1600-1940). As such it facilitates researchers in areas as diverse as medicine, biology, anthropology, law, the classics, art, and erotic literature.
Digital Scholar Lab
Gale Digital Scholar Lab is a cloud-based platform that enables students and researchers to access content and OCR data from Gale Primary Sources and analyse these archives with text and data mining tools. Users at any level will be able to work easily and efficiently with large corpora of text-data, organising custom data sets and digital tools that reflect the unique needs of individual researchers or entire classrooms.
Eighteenth Century Collections Online
ECCO provides digital images of approximately 150,000 books published during the 18th century. Each book is also listed in the Library catalogue.
Gale Cengage Sabin Americana 1500-1926
Based on Joseph Sabin's landmark bibliography, this collection contains works about the Americas published throughout the world from 1500 to the early 1900s.
more...less...
Included are books, pamphlets, serials and other documents that provide original accounts of exploration, trade, colonialism, slavery and abolition, the western movement, Native Americans, military actions and much more. With over 6 million pages from 29,000 works, this collection is a cornerstone in the study of the western hemisphere.
Gale Primary Sources
It provides an interactive research environment that allows researchers to cross-search Eighteenth Century Collections Online (ECCO) and Nineteenth Century Collections Online (NCCO) and to discover and analyze content in new ways.
Nineteenth Century Collections Online
Nineteenth Century Collections Online brings together rare primary source materials - monographs, newspapers, pamphlets, manuscripts, ephemeram, maps, photographs and more...

Data mining with Gale Primary Sources

Documents can be viewed, downloaded as PDF (flat images) or printed. The OCR text can be downloaded (where it exists) on a per document basis.

An entire Gale Cengage database be used for text and data mining by purchasing a hard drive of XML text. The Library would then be responsible for keeping the hard drive and loaning it to researchers.

Adam Matthew Digital

Adam Matthew Digital provides several text and image-rich collections covering the last 500 years. A data mining agreement is provided up-front. The collections include Apartheid South Africa, Global Commodities, African American Communities, Victorian Popular Culture and Mass Observation.

Mass Observation Online
Archives of the pioneering social research organisation founded in 1937 to record everyday life in Britain and a key repository for the study of Social History in the modern era. Offers access to primary material gathered by Mass Observation from 1937-1955, including print, manuscripts, photographs and interactive features.
Confidential Print: Middle East, 1839-1969
Confidential Print: Middle East, 1839-1969 is an online series of ‘Confidential Print’ documents issued by the United Kingdom Foreign and Colonial Office since c1820.
Apartheid South Africa 1948-1980
Apartheid South Africa makes available British government files from the Foreign, Colonial, Dominion and Foreign and Commonwealth Offices spanning the period 1948 to 1980. These previously restricted letters, diplomatic dispatches, reports, trial papers, activists’ biographies and first-hand accounts of events give unprecedented access to the history of South Africa’s apartheid regime.
more...less...
The files explore the relationship of the international community with South Africa and chart increasing civil unrest against a backdrop of waning colonialism in Africa and mounting world condemnation.
Global Commodities: Trade, exploration and cultural exchange, ca. 1515-2012
This resource brings together manuscript, printed and visual primary source materials for the study of global commodities in world history. The commodities featured in this resource have been transported, exchanged and consumed around the world for hundreds of years. They helped transform societies, global trading operations, habits of consumption and social practices.
more...less...
Includes a 360-degree image viewer, visualisation tools and interactive maps.

Data mining with Adam Matthew Digital

Documents can be viewed, downloaded as PDF (flat images), or printed. The OCR text cannot be accessed directly from here.

Some entire Adam Matthew databases can be used for text and data mining by requesting API access to JSON data (structured text) from Adam Matthew via the Library, for no charge.

ProQuest

The ProQuest collections arguably of most interest to Digital Humanities are Historical Newspapers and Early European Books Online.

Early European Books (EEB)
Early European Books traces the history of printing in Europe from its origins through to the close of the seventeenth century, offering full-colour, high-resolution facsimile images of rare and hard-to-access printed sources. Direct access via ProQuest.
ProQuest Historical Newspapers (various)
See link above for details, in "Newspapers: Overseas Publications: Archives" Subject Guide
more...less...
Includes:
ProQuest Historical Newspapers: The Baltimore Sun (1837-1988)
ProQuest Historical Newspapers: Boston Globe (1872-1982)
ProQuest Historical Newspapers: Chicago Tribune (1849-1990)
ProQuest Historical Newspapers: Chinese Newspapers Collection
ProQuest Historical Newspapers: Los Angeles Times (1881-1990)
ProQuest Historical Newspapers: New York Tribune (1841-1922)
ProQuest Historical Newspapers: South China Morning Post (1903-1995)
ProQuest Historical Newspapers: The New York Times(1851-2009)
ProQuest Historical Newspapers: The Wall Street Journal (1889-1996)
ProQuest Historical Newspapers: The Washington Post (1877-1997)
ProQuest TDM Studio
ProQuest TDM Studio is a text and data mining tool for research, teaching and learning, providing procedural access to most full-text ProQuest collections. TDM Studio includes two paths: Visualization is designed for users of all levels to quickly spot trends and generate insights into a set of historical newspapers; Workbench is designed for experienced researchers who use their own coding methodologies.
more...less...
The following newspapers are included in the Visualization tool:
New York Times - 1 Jun 1980 – present
The Guardian - 25 Nov 1996 – present
The Globe and Mail - 14 Nov 1977 - present
Washington Post - 4 Dec 1996 – present
Chicago Tribune - 4 Dec 1996 – present
LA Times - 4 Dec 1996 – present
Sydney Morning Herald - 1 Aug 1996 - present
The Times of India - 6 Jan 2006 - present
The South China Morning Post - 1 Jan 1993 - present

You might also find (TBC): Arizona Republic, The Baltimore Sun, The Boston Globe, Chicago Defender, Hartford Courant, New York Amsterdam News, New York Herald Tribune (1926-1962), Philadelphia Tribune, The Tennessean, The Irish Times, The Jerusalem Post (1950-1988), Wall Street Journal.
The Vogue Archive
The Vogue Archive contains the entire run of Vogue magazine (US Edition), from the first edition in 1892 to the current month, reproduced in high resolution colour page images. Every page, advertisement, cover and fold out has been included, with rich indexing enabling you to find images by garment type, designer and brand name.

Data mining with ProQuest Historical Newspapers

For newspapers that the University subscribes to:

Use the ProQuest TDM Studio tool to access 10 historical newspapers for full text search and visualisation.

For newspapers that the University has purchased:

Pages can be viewed, downloaded as PDF (flat images), or printed. The OCR text cannot be accessed directly from here, however a temporary link can be emailed to a researcher to gain access to XML text files online.

An entire ProQuest Historical Newspapers database can be used for text and data mining purposes. This would require the purchase of a hard drive that contains XML text files. The Library would then be responsible for keeping the hard drive and loaning it to researchers.

Currently, the Library does not have any of these hard drives for ProQuest Historical Newspapers collections.

Feature: Early modern books

One of the most widely used collections for Digital Humanities research is Early European Books Online. It is available via several platforms and providers. These include ProQuest, EEBO-TCP, Jisc Historical Texts and the Oxford Text Archive.

The platforms offer the same source material but with different levels of text encoding or search features. For example, Jisc Historical Texts allows you to search fuzzy and variant spellings, variant forms, create histograms, make subsets, and cross-search with eighteenth and nineteenth century collections.

Early European Books migrated from the legacy Chadwyck Healey platform to ProQuest in summer 2018. It now has better search and visualisation options, including historical mapping.

Early European Books (EEB)
Early European Books traces the history of printing in Europe from its origins through to the close of the seventeenth century, offering full-colour, high-resolution facsimile images of rare and hard-to-access printed sources. Direct access via ProQuest.
EEBO-TCP: Early European Books Online Text Creation Partnership
EEBO-TCP is a partnership with ProQuest and with more than 150 libraries to generate highly accurate, fully-searchable, SGML/XML-encoded texts corresponding to books from the Early English Books Online Database.
Jisc Historical Texts
Historical Texts brings together four historically significant collections for the first time: Early English Books Online (EEBO), Eighteenth Century Collections Online (ECCO), 65,000 texts from the British Library 19th Century collection, and the UK Medical Heritage Library (UKMHL). The texts are cross-searchable and presented as both facsimile images and with full text where available.
more...less...
For learning and teaching resources and suggestions with Jisc Historical Texts, see https://historicaltexts.jisc.ac.uk/landt
CQPWeb
Corpus Query Processor platform, hosted by University of Lancaster, contains EEBO version 3. Content is tagged for part of speech. CQPWeb includes VARD, a variant spelling detection tool, and an advanced query language for "phrasal searching". It is available for all with a .ac.uk email address.
Oxford Text Archive
The University of Oxford Text Archive (OTA) is a repository of digital literary and linguistic resources for research and teaching. It lets you download the transcribed text files used in EEBO and EEBO-TCP.
Gale Cengage Sabin Americana 1500-1926
Based on Joseph Sabin's landmark bibliography, this collection contains works about the Americas published throughout the world from 1500 to the early 1900s.
more...less...
Included are books, pamphlets, serials and other documents that provide original accounts of exploration, trade, colonialism, slavery and abolition, the western movement, Native Americans, military actions and much more. With over 6 million pages from 29,000 works, this collection is a cornerstone in the study of the western hemisphere.

Data mining with EEBO

The search features are quite powerful in most of these platforms. You may need to download a set of XML files from the Oxford Text Archive. There are tools available from the Text Creation Partnership GitHub page, including a CSV/JSON bibliography of the EEBO-TCP texts.

JSTOR and Portico

Constellate is the text analytics service from the not-for-profit ITHAKA - the same people who brought you JSTOR and Portico. It is a platform for teaching, learning and performing text analysis using archival repositories of scholarly and primary source content. You can query, search and download full-text articles or upload your own to use with their tools. Constellate works using Python and includes sample Jupyter Notebooks which you can modify and extend.

The University of Manchester has access to the free, public version of Contellate (excluding the Lab).

ITHAKA launched the Text Analysis Pedagogy (TAP) Institute to help instructors and librarians learn and teach text analysis. From 10 July to 11 August 2023, Constellate—in partnership with the Academic Data Science Alliance and the Association of College and Research Libraries—is offering free events and classes for anyone interested in teaching text analysis. Courses are progressive, so you will benefit from taking a single class or the entire series, no matter your skill level.

There were a series of webinars in summer 2022 which can watch now:

You may also take in their four-session “Introduction to Python” course by working in the Constellate Lab alongside a recording of the class.

JSTOR
Used by millions for research, teaching, and learning. With more than a thousand academic journals and over one million images, letters, and other primary sources, JSTOR is one of the world's most trusted sources for academic content.
more...less...
JSTOR is a not-for-profit service that includes full-text content of more than 1,300 academic journals. This includes scholarship published in over one thousand of the highest-quality academic journals across the humanities, social sciences, and sciences.

JSTOR Labs
Tools and projects using JSTOR content in innovative ways for research and teaching.

Data for Research (DfR) was a separate interface to access journal and pamphlet content on JSTOR ready for analysis and data mining. Searching DfR enabled researchers to find useful patterns, associations and unforeseen relationships in the body of research available in the journal and pamphlet archives on JSTOR. You could search OCR, metadata and key terms to download N-grams and word counts for up to 1,000 documents at a time, in XML or CSV format.

Data for Research has been replaced by Constellate.

Financial and business databases

Company annual reports are increasingly used for data mining, which are available from the (quoted) company's corporate website, from bodies such as Companies House (chargeable), or databases such as PI Navigator, Bloomberg or Refinitiv Eikon.

Another popular resource is the U.S. Securities and Exchange Commission search tool EDGAR.

Bloomberg Professional
Current and historical financial information on individual equities, stock market indices, fixed-income securities, currencies, commodities, and futures for both international and domestic markets, (booking required).
Mergent Online
Internet-based suite of information resources that enables in-depth business and financial research for U.S. based and many other international companies.
PI Filings Expert
A financial and capital markets database providing access to over 14 million global company filings including annual reports, M&A (Mergers and Acquisitions), IPO (Initial Public Offerings), bond prospectuses and news announcements. Offers free text search. Replaces PI Navigator.
Refinitiv Eikon
Features market quotes, earnings estimates, financial fundamentals, press releases, transaction data, corporate filings, ownership profiles and research from Refinitiv (formerly Thomson Reuters). Replaces ThomsonONE.com. You must book a slot with an Eikon ID to use this resource.
(Note: Eikon will be replaced with LSEG Refinitiv Workspace in September 2023.)
more...less...
Eikon is available via a web interface, also via a desktop application with an Excel add-in. Datastream is available via a second Excel add-in. There is also an Eikon Data API. You must book a slot with an Eikon ID to use any of these platforms. An Eikon ID can be used by one person in one place at one time.

Data mining with financial and business databases

The challenges that you might face include the following:

Obtaining files in bulk
Processing PDF files without correct underlying plain text or no text
Processing PDF files without the structure, say if you are only interested in the Chairman's Statement.
Processing HTML or other text files with poor, inconsistent or no discernable structure.

These challenges are not unique to financial and business data, although you might be able to use XBRL to overcome this. That still leaves a challenge that is specific to business, which is mapping company and security identifier codes.

Digital Humanities: Collections

Library collections with Digital Humanities capabilities