All guides: Digital tools for research: Text analysis and data mining

What is text analysis and data mining?

Text analysis

Text analysis (or text mining) uses large collections of text or "unstructured data" to identify patterns or connections. Automated computer tools, are used to process large amounts of text, meaning that no reading or viewing of the materials is necessary. Text analysis is considered ‘non-consumptive’ research.

Adapted from "What is text and data mining" by The University of Adelaide Library is licensed under CC BY-NC-SA 4.0

How does Text Mining Work? (1:34 mins) by Elsevier (YouTube)

Data mining

Data mining is the use of computational techniques to find patterns or relationships within large sets of organised or "structured" data. These datasets need to be organised into specific, defined formats before mining processes can be performed.

Adapted from "What is text and data mining" by The University of Adelaide Library is licensed under CC BY-NC-SA 4.0

All major data mining techniques explained with examples (13:03 mins) by Learn with Whiteboard (YouTube)

Key considerations

There are a number of factors to be aware of when conducting text analysis and data mining - see below for issues related to Ethics, Copyright, Permissions, Licensing, and Referencing.

For more details on definitions and legal implications of text and data mining see the Australian Law Reform Commission page.

Ethics

Even if access is permitted when performing text and data mining, it is important that researchers respect the rights of the owners of the content and abide by their terms of access. Researchers also need to respect the privacy of the subjects of research and be aware that data mining may reveal confidential details.

Information on the responsibilities of researchers can be found on this page on Research integrity.

Copyright

There is no Australian copyright exemption for text and data analysis, as explained in this Australian Law Reform Commission discussion paper. Even publicly accessible arrangements of datasets are still protected by copyright and may require permission for use in a text analysis or data mining project.

Permissions

For some data, you may need to acquire permission from the rightsholder before performing analysis on datasets. Be aware that if you are granted permission to use data for your research, this may not extend to use for publication. It is easier to seek permission for all uses of the data upfront.

For tips on permission seeking for researchers, please see the Copyright guide's Seeking permission section.

Licensing

Data and database publishers vary widely in the degree to which they permit text and data mining of their collections. First consult the licence in the LibrarySearch record for the database, as illustrated in the image below:

LibrarySearch database record with licensing options

If the 'Show License' option does not appear, or if the information does not mention data mining, contact the Library Research Services team.

Websites and social media platforms have terms of service which may include clauses around data mining and text analysis. Check the website terms of service or terms of use to determine what is allowed for the site you intend to use.

The Australian Research Data Commons (ARDC) has several flowcharts that illustrate the licensing process and a data rights management guide that focuses on rights information and licences.

Referencing

Data sources such as data sets and raw data (for text analysis), stop word lists, algorithms, visualisations and other textual data borrowed from others used for the purposes of text analysis and data mining should be acknowledged and cited appropriately in your chosen referencing style.

See the RMIT Easy Cite referencing guide to determine how to cite data sources in a variety of referencing styles.

Data sources

Overview

Raw data available for text analysis and data mining can be derived from many sources, including library databases and the open web. See the following tabs for some licensed and open access data sources that may be useful for your research.

Licensed library sources

Note: The following data sources are licensed library resources and permitted for use by RMIT staff and students (RMIT login required).

Elsevier Text and Data mining
Elsevier allows text mining of the ScienceDirect and Scopus databases via an API.
IEEE API Quickstart guide
Steps through the process of acquiring and using the IEEE database API.
JSTOR Text-mining
The JSTOR database Data for Research program provides datasets for text analysis.
PubMed article datasets
The PubMed database provides large datasets of journal articles and other scientific publications.

Open access sources

Note: The Library does not license the following open access resources, and does not assist with API management, text and data mining, and other services.

Australian Data Archive
Provides a national service for the collection and preservation of digital research data. ADA disseminates this data for secondary analysis by academic researchers and other users.
CORE
A large aggregator of open access research papers, with access for text mining.
Crossref
Documentation for the Crossref API which allows text and data mining of the Crossref database.
HathiTrust Digital Library
Documentation for use of the HathiTrust datasets of digitised academic and research titles.
Trove
Use the Trove API to source rich data such as the digitised newspaper collection.

Coding tools and tool indexes

Overview

The following coding tools such as Python and R are the perfect programming languages for developing text analysis applications, due to the abundance of custom libraries available that are focused on delivering natural language processing (NLP) functions.

Note: Some basic familiarity with programming languages may be required to use these tools, and where possible, training resources have been provided for inexperienced users.

Python

Python is a general-purpose programming language with a focus on code readability for projects of all sizes.

Access

RMIT provides access to Python for staff and students.

myDesktop - go to myDesktop (RMIT login required) and select Python from the Apps tab.
Personal device - to download/install Python on your own device:
- Go to the Python homepage.
- Locate the Downloads menu and follow the on-screen instructions.

Training resources

Python for data science: essential training part 1 (LinkedIn Learning tutorial)
Discover how to clean, transform, analyse, and visualise data, as you build a practical project: an automated web scraper.
Python for data science: essential training part 2 (LinkedIn Learning tutorial)
Discover how to use machine learning to generate predictions and recommendations and automate routine tasks.
RMIT Library e-books on Python for text mining and data analysis
Selection of Library resources on using Python.

R

R is a programming language and software platform focused on statistical analysis, graphical presentation, and is widely used in data mining.

Access

RMIT provides access to R for staff and students.

myDesktop - go to myDesktop (RMIT login required) and select R from the Apps tab.
Personal device - to download/install R on your own device:
- Go to the R homepage.
- Locate the Download menu and follow the on-screen instructions.

Note: For access and training resources to RStudio, See the RStudio section on this guide.

Training resources

Learning R (LinkedIn Learning tutorial)
Learn the basics of R, the free, open-source language for data science. Discover how to use R and RStudio for beginner-level data modeling, visualisation, and statistical analysis.
R essential training: wrangling and visualising data (LinkedIn Learning tutorial)
Learn how to wrangle data and create meaningful visualisations with R, the programming language powering modern data science.
RMIT Library e-books on R for text mining and data analysis
Selection of Library resources on using R.

Tool collections and indexes

Digital Humanities Tools
Curated website with extensive listings of research tools, including text analysis tools and resources.
KDnuggets
Software and research tool page on KDnuggets, a curated website and blog on AI, machine learning and data mining.
TAPoR
Text Analysis Portal for Research (TAPoR) offers a curated listing of text analysis tools.

Visual data analysis

OriginPro
Homepage of OriginLab the creators of OriginPro, a visual data analysis program. Free trial and paid downloads, as well as support and training, are available from the homepage.
Tecplot
Tecplot is a visual data analysis software program. Free trial and paid downloads, as well as support and training, are available from the homepage.

Web-based tools

Overview

Web-based tools provide a variety of easy to use and manage visualisation and analysis tools. Some of the following tools listed include word clouds, charts, graphics, and other analysis tools that create visual images and statistically interpret your text.

Jupyter

Jupyter
Jupyter Notebook is a web-based app that lets users create documents with live code, visualisations and text. Jupyter Lab allows development and configuration of notebooks to enable data workflows.
How to use Jupyter Notebook: a beginner's tutorial
Tutorial that provides an overview of how to set up and utilise Jupyter Notebooks effectively.

Leximancer

Leximancer
A commercial web-based text mining software program, which allows visual display of results and concept mapping. Tutorials and manuals are available on this page.

Orange

Orange
An open-source data analysis tool using a graphical interface, where you can select widgets and connect them to create workflows to analyse datasets.
Orange - getting started
Installation guide and links to video tutorials, workflow examples and widget guide.
Orange - documentation
Tutorials and manuals for using Orange.

Voyant

Voyant Tools
An open-source, web-based tool that lets you analyse text documents.
Voyant Tools - help
Online guides to using Voyant Tools.

Teaching and Research guides

Digital tools for research

What is text analysis and data mining?

Text analysis

Data mining

Key considerations

Ethics

Copyright

Permissions

Licensing

Referencing

Data sources

Overview

Licensed library sources

Open access sources

Coding tools and tool indexes

Overview

Python

Access

Training resources

R

Access

Training resources

Tool collections and indexes

Visual data analysis

Web-based tools

Overview

Jupyter

Leximancer

Orange

Voyant