Some of the Library's collections offer a search interface that include more advanced text searching tools. These may include
The following links may be useful for those looking to conduct text and data mining on content that has been downloaded or is otherwise available on a hard drive. See also the Collections page of this guide for which collections are most suitable.
Similar to our databases, this page explains what data mining facilities exist in databases at the University of Southern California. The details are similar for the resources at The University of Manchester Library.
There are many packages to consider for containing your texts for analysis. The list below includes some of the packages available to researchers at The University of Manchester. Some may already be installed on your PC, otherwise you will need to find them in the Software Centre icon on your desktop or from IT Services.
If an existing data analysis package is not appropriate for your needs, you may need to create your own database. For example, if you are using a collection of XML files which contain the full-text content for news articles, and you make your own interpretations of each text, you will need to store these in a sensible way.
It is not possible to discuss the many considerations in this guide, but some of the desktop tools commonly used are Microsoft Excel (or other spreadsheet package), or databases such as MySQL (relational), MongoDB (document), Neo4j (graph) and Microsoft Access.
The following links may be useful when considering programming:
You may have spatial or geographic data, for example a country of publishing for each text in your collection, or your collection may consist of maps. You may be able to visualise your content using an advanced tool such as ArcGIS or the free QGIS.
Links to support resources for Geographic Information Systems (GIS) are listed below.
Increasingly, publishers are including visualisation tools directly in the web platforms used to access library databases, such as charts showing the rise and fall of key terms with publication date, or topic modelling.
Programming languages like Python and R have libraries which can be utilised for displaying the data you have collected in interesting and informative ways.
Other tools exist with the primary purpose of visualising data, like Tableau and Gephi.
Linked Data is about publishing structured data on the Web that can be interlinked and become more useful by semantic queries. More specifically, Wikipedia defines Linked Data as "a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF".
As library collections emerge that offer support for Linked Data, they will be listed below.
Citizen science projects use the efforts and ability of volunteers to help scientists and researchers deal with the flood of data that confronts them.
Computer vision is a term which covers automated extraction, analysis and understanding of useful information from one or more images.
Some examples of tools used or developed by groups such as the University of Oxford Visual Geometry Group follow.