Subject guides: Digital Humanities: Technology

Text and data mining

Some of the Library's collections offer a search interface that include more advanced text searching tools. These may include

proximity searching (when two words occur relatively close together).
clustering similar articles together in a search.
fuzzy matching (allowing for spelling, typing or printing variance).

The following links may be useful for those looking to conduct text and data mining on content that has been downloaded or is otherwise available on a hard drive. See also the Collections page of this guide for which collections are most suitable.

AntConc
A freeware corpus analysis toolkit for concordancing and text analysis. There are other useful packages on this site such as AntFileConverter to convert PDF and Word documents to TXT.
A curated collection of resources for data cleaning and text preparation
Some suggested introductory readings, lessons, and software packages that may be of interest for those looking to do some self-directed learning on Data Cleaning and Text Preparation.
Intro to Text Analysis
A Library Guide from Penn State University Libraries about text analysis, finding corpora and selecting methods.
OpenRefine
OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
Suggested data mining tools from NaCTeM
The National Centre for Text Mining (NaCTeM) provides a list of suggested tools useful for text and data mining.
Text Creation Partnership (TCP)
The Text Creation Partnership creates standardized, accurate XML/SGML encoded electronic text editions of early print books. They transcribe and mark up the text from the millions of page images in ProQuest's Early English Books Online, Gale Cengage's Eighteenth Century Collections Online, and Readex's Evans Early American Imprints.
Text Encoding Initiative (TEI)
The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. [sic]
Text Mining 101
What is text mining, how does it work and why is it useful? Foster and OpenMinted.eu explain.
Weka
Weka is a collection of machine learning algorithms for data mining tasks. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualisation.

USC Libraries' Content Mining Research Guide

This guide provides information about text mining resources and tools at the University of Southern California and whether or not their subscription databases support content mining. The details are similar for the resources at The University of Manchester Library.

Data analysis packages

There are many packages to consider for containing your texts for analysis. The list below includes some of the packages available to researchers at The University of Manchester. Some may already be installed on your PC, otherwise you will need to find them in the Software Centre icon on your desktop or from IT Services.

Mathematica
Official website for Wolfram Mathematica.
MATLAB
Official website for MATLAB.
NVivo
Official website for NVivo.

Creating your own database

If an existing data analysis package is not appropriate for your needs, you may need to create your own database. For example, if you are using a collection of XML files which contain the full-text content for news articles, and you make your own interpretations of each text, you will need to store these in a sensible way.

It is not possible to discuss the many considerations in this guide, but some of the desktop tools commonly used are Microsoft Excel (or other spreadsheet package), or databases such as MySQL (relational), MongoDB (document), Neo4j (graph) and Microsoft Access.

HEURIST
Whatever you field of study, HEURIST can help you to design, create, manage, analyse and publish your own richly-structured database(s) within hours through a simple web interface.
Webinar: Storing data in databases
YouTube video from UK Data Service. This introductory webinar covers: a basic definition of a database and introduction to a variety of database types; how data can be stored in relational databases and how data can be retrieved from them by writing queries in SQL (Structured Query Language); how data can be stored in ‘NoSQL’ databases such as document databases (MongoDB) and graph databases (Neo4j).

Programming

If you need to perform a difficult or repetitive task on your data, or if you want to interpret data designed for use by machines like XML, it might be necessary to write a script. Some common scripting and programming languages for Digital Humanities are Python, R and Javascript.

The following links may be useful when considering programming:

Webinar: What are APIs?
YouTube video from UK Data Service. This webinar concentrates on the use of APIs (Application programming interface) to access and download information from websites for which an API is available (e.g. internet search engines, social media websites, travel information websites etc.). Twitter is used as the main example.
The Programming Historian
Novice-friendly, peer-reviewed tutorials that help humanists learn a wide range of digital tools, techniques, and workflows to facilitate research and teaching. Materials in English, Spanish and French.
Python for Humanists: Why learn Python?
Blog post by Roger Whitson. He sketches out a case for 1) why humanists should learn to code; 2) why Python is a great language for beginning humanists; and 3) what are some great resources for learning Python.
Python programming for the humanities
An interactive tutorial and introduction into programming with Python for the humanities, by Folgert Karsdorp.
Digital Methods Initiative: List of tools
List of convenient tools, instructions and use cases for various digital methods. Examples include a Dorling (bubble) map generator and a webpage language detector.

Geographic Information Systems (GIS)

You may have spatial or geographic data, for example a country of publishing for each text in your collection, or your collection may consist of maps. You may be able to visualise your content using an advanced tool such as ArcGIS or the free QGIS.

Links to support resources for Geographic Information Systems (GIS) are listed below.

A primer on GeoTechnologies
A wiki entry on the Computational Science Community Wiki.
FOSS4G Academy Curriculum
Thiry-five (35) FOSS4G University-level lectures and labs are maintained and made available for download from the Spatial {Query} Lab on behalf of the GeoAcademy. The lectures focus on a vendor-agnostic set of theories and principles. The labs focus on the use of QGIS, GRASS, and Inkscape.
Learn QGIS
The new Learn ArcGIS Hub experience provides collections of curated content to guide you in your discovery and exploration of ArcGIS capabilities and products.
MapCrow
MapCrow is a cartographic resource for tasks such as calculating the distance between two places. It features millions of locations and matching by sound.
more...less...
Unlike smaller sites, it matches automatically, on non-Latin characters, does sound matches, and finds alternate names such as Peking. It includes several million cities instead of just major cities (there are 48 cities called "Boston", 51 for "Paris"). It also includes an interactive map.
There are pre-filled city lists for every US state and country, along with distance charts for individual continents.

It includes a complex, customisable flight time calculator, meeting time calculator, suburb and mountain peak searches and many related searches.
QGIS Workshop (Lancaster University)
Training slides and data sets covering the principles of spatial data, Geographic Information Systems as a software ecosystem, and a tour of QGIS. By Barry Rowlingson.
Spatial skills screencast tutorials
Access screencast tutorials on some of the common features of ArcMap and IDRISI and the new Urban Design Toolkit. There are also links to other online resources that you will find useful for your studies e.g. Digimap, Borders and Census resources.

Visualisation

Increasingly, publishers are including visualisation tools directly in the web platforms used to access library databases, such as charts showing the rise and fall of key terms with publication date, or topic modelling.

Programming languages like Python and R have libraries which can be utilised for displaying the data you have collected in interesting and informative ways.

Other tools exist with the primary purpose of visualising data, like Tableau and Gephi.

Bookworm
Bookworm is a simple and powerful way to visualise trends in repositories of digitised texts. Essentially, you can use it to build your own Google N-gram viewer.
Data Visualization 101: How to Choose the Right Chart or Graph for Your Data
A 9-minute guide from HubSpot on choosing a chart type that matches your data.
Gephi
Gephi is a leading visualisation and exploration software for all kinds of graphs and networks. It is open-source and free.
ImagePlot
ImagePlot is a free software tool that visualises collections of images and videos of any size. It is implemented as a macro which works with the open source image processing program ImageJ.
Introduction to network analysis with R
Jesse Sadler describes creating static and interactive network graphs using the R programming language.
NC State Data Visualization Guides and Tools
A list of guides, tools and sources for visualisation from NC State.
Python data visualization: Comparing 7 tools
Blog post from Dataquest which uses real world data and introduces the standard matplotlib and vispy, bokeh, seaborn, pygal, folium, and networkx.
Tableau example report
A dummy citation report using Tableau. It is rich, interactive and intuitive. Note that any data you use in the free version will be publicly available.

Linked Data

Linked Data is about publishing structured data on the Web that can be interlinked and become more useful by semantic queries. More specifically, Wikipedia defines Linked Data as "a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF".

As library collections emerge that offer support for Linked Data, they will be listed below.

Crowd sourcing

Citizen science projects use the efforts and ability of volunteers to help scientists and researchers deal with the flood of data that confronts them.

The best known example is Zooniverse, which offers a tool to build your own crowd-sourced project. The British Library's crowd sourced projects are part of the LibCrowds platform.

Computer vision

Computer vision is a term which covers automated extraction, analysis and understanding of useful information from one or more images.

Some examples of tools used or developed by groups such as the University of Oxford Visual Geometry Group follow.

VGG Image Search Engine (VISE)
VISE (instance-based search, i.e. ‘find this thing') is available for Windows, Mac and Linux through Docker.
more...less...
Installation is not quite one-click and there are problems in particular with Docker on Windows.
VGG Image Classification (VIC) Engine
VIC (category search, ‘find all the examples of this sort of thing') is available for Windows, Mac and Linux through Docker.
more...less...
Installation is not quite one-click and there are problems in particular with Docker on Windows.
Traherne Digital Collator
Traherne Digital Collator is an image comparison tool developed for the The Oxford Traherne project.
VGG Image Annotator (VIA)
VGG Image Annotator (VIA) is a simple annotation tool to define regions in an image and create a textual description of those regions.