Text analysis (or text mining) uses large collections of text or "unstructured data" to identify patterns or connections. Automated computer tools, are used to process large amounts of text, meaning that no reading or viewing of the materials is necessary. Text analysis is considered ‘non-consumptive’ research.
Adapted from "What is text and data mining" by The University of Adelaide Library is licensed under CC BY-NC-SA 4.0
How does Text Mining Work? (1:34 mins) by Elsevier (YouTube)
Data mining is the use of computational techniques to find patterns or relationships within large sets of organised or "structured" data. These datasets need to be organised into specific, defined formats before mining processes can be performed.
Adapted from "What is text and data mining" by The University of Adelaide Library is licensed under CC BY-NC-SA 4.0
All major data mining techniques explained with examples (13:03 mins) by Learn with Whiteboard (YouTube)
There are a number of factors to be aware of when conducting text analysis and data mining - see below for issues related to Ethics, Copyright, Permissions, Licensing, and Referencing.
For more details on definitions and legal implications of text and data mining see the Australian Law Reform Commission page.
Even if access is permitted when performing text and data mining, it is important that researchers respect the rights of the owners of the content and abide by their terms of access. Researchers also need to respect the privacy of the subjects of research and be aware that data mining may reveal confidential details.
Information on the responsibilities of researchers can be found on this page on Research integrity.
There is no Australian copyright exemption for text and data analysis, as explained in this Australian Law Reform Commission discussion paper. Even publicly accessible arrangements of datasets are still protected by copyright and may require permission for use in a text analysis or data mining project.
For some data, you may need to acquire permission from the rightsholder before performing analysis on datasets. Be aware that if you are granted permission to use data for your research, this may not extend to use for publication. It is easier to seek permission for all uses of the data upfront.
For tips on permission seeking for researchers, please see the Copyright guide's Seeking permission section.
Data and database publishers vary widely in the degree to which they permit text and data mining of their collections. First consult the licence in the LibrarySearch record for the database, as illustrated in the image below:
Image: Copyright © Ex Libris. Used under licence.
If the 'Show License' option does not appear, or if the information does not mention data mining, contact the Library Research Services team.
Websites and social media platforms have terms of service which may include clauses around data mining and text analysis. Check the website terms of service or terms of use to determine what is allowed for the site you intend to use.
The Australian Research Data Commons (ARDC) has several flowcharts that illustrate the licensing process and a data rights management guide that focuses on rights information and licences.
Data sources such as data sets and raw data (for text analysis), stop word lists, algorithms, visualisations and other textual data borrowed from others used for the purposes of text analysis and data mining should be acknowledged and cited appropriately in your chosen referencing style.
See the RMIT Easy Cite referencing guide to determine how to cite data sources in a variety of referencing styles.
Raw data available for text analysis and data mining can be derived from many sources, including library databases and the open web. See the following tabs for some licensed and open access data sources that may be useful for your research.
Note: The following data sources are licensed library resources and permitted for use by RMIT staff and students (RMIT login required). |
Note: The Library does not license the following open access resources, and does not assist with API management, text and data mining, and other services. |
The following coding tools such as Python and R are the perfect programming languages for developing text analysis applications, due to the abundance of custom libraries available that are focused on delivering natural language processing (NLP) functions.
Note: Some basic familiarity with programming languages may be required to use these tools, and where possible, training resources have been provided for inexperienced users.
Python is a general-purpose programming language with a focus on code readability for projects of all sizes.
AccessRMIT provides access to Python for staff and students.
|
R is a programming language and software platform focused on statistical analysis, graphical presentation, and is widely used in data mining.
AccessRMIT provides access to R for staff and students.
Note: For access and training resources to RStudio, See the RStudio section on this guide. |
Web-based tools provide a variety of easy to use and manage visualisation and analysis tools. Some of the following tools listed include word clouds, charts, graphics, and other analysis tools that create visual images and statistically interpret your text.
This Library guide by RMIT University Library is licensed under a CC BY-NC 4.0 licence, except where otherwise noted. All reasonable efforts have been made to clearly label material where the copyright is owned by a third party and ensure that the copyright owner has consented to this material being presented in this library guide. The RMIT University logo is ‘all rights reserved’.