Before we understand what integrity might mean in this context, we must first ask: What Is Research Data? Research data is the information, records, and files that are collected or used during the research process. Data may be numerical, descriptive, visual, raw, analysed, experimental, or observational.
Some examples of research data include:
Grant and funding bodies require research data to be managed through its lifecycle. You may need to provide information about the data or the data itself, for example, some journals require it or you may want to patent an invention.
Creating a research data management plan at the start of the research project is the simplest way to save time in the collection, description, analysis, and reuse of the data. Effective management and documentation of research data means you can verify your research results, replicate the research, and provide access to data.
Consider the case of Greg Semenza, winner of the 2019 Nobel Prize for Physiology or Medicine.
"Semenza shared his 2019 Nobel Prize with two other researchers, but only his papers keep popping up on PubPeer bearing telltale signs of data fakery. There are some recurrent author names suggesting naughty mentees or collaborators, but still, in many cases, Semenza is the last and corresponding author, so the final responsibility is his. After all, Nobel Prize recognition comes from that same last authorship."
For more examples of data fakery, contested research, and the necessity of data reproducibility, look at the article quoted above in "Gregg Semenza: real Nobel Prize and unreal research data". As detailed in the piece, misrepresentation of research data, and the benefits and rewards that go with these published findings, go to the heart of integrity and trustworthy research. An inability to prove your findings is disastrous for the publication and verifiability of research, as detailed in the Reproducibility section below.
Not only is a clear and consistent methodology important in research, but easily available, reliable, and honestly presented data is crucial. Research without an integrity-driven research data practice is a house built on sand.
The FAIR Data Principles (Findable, Accessible, Interoperable, Reusable) were drafted at a Lorentz Center workshop in Leiden in the Netherlands in 2015, and have since received worldwide recognition by various organisations, including FORCE11, National Institutes of Health (NIH), and the European Commission, as a useful framework for thinking about sharing data in a way that will enable maximum use and reuse. They are a way of thinking about getting the most out of your research data, and its place in the wider researcher community.
Can your data be found if someone is looking for it? Does it have a DOI or a Handle? Does it have rich metadata? Is it discoverable through a research portal or a repository?
Does your data utilise a standardised protocol? Your data does not necessarily have to be "open" - there are sometimes good reasons why data cannot be made open, i.e. privacy concerns, national security, or commercial interests - but if it is not there should be clarity and transparency around the conditions governing access and reuse.
To be interoperable the data will need to use community agreed formats, language, and vocabularies. Will someone who finds your data be able to meaningfully reuse it, and build or reproduce your work? The metadata you use will also need to use community agreed standards and vocabularies, and contain links to related information using identifiers.
Reusable data should maintain its initial richness. For example, it should not be diminished for the purpose of explaining the findings in one particular publication. It needs a clear machine-readable licence and provenance information on how the data was formed. It should also have discipline-specific data and metadata standards to give it rich contextual information that will allow for reuse.
Reproducibility is crucial in consideration of Research - data should maintain its initial richness. In the words of the Australian National Data Service, data should not be diminished for the purpose of explaining the findings in one particular publication. If possible it should have a clear machine readable licence and provenance information on how the data was formed. It should also have discipline-specific data and metadata standards to give it rich contextual information that will allow for reuse.
In his journal article "No raw data, no science: another possible source of the reproducibility crisis", Molecular Brain editor-in-chief Tsuyoshi Miyakawa argues that "inappropriate practices of science, such as HARKing, p-hacking, and selective reporting of positive results, have been suggested as causes of irreproducibility", but "a lack of raw data or data fabrication is another possible cause of irreproducibility".
Miyakawa analyses the often parlous state of data availability and reproducibility in his field, where nearly a quarter of submissions since 2017 marked as "revise before review". Of these 41 manuscripts, 21 are withdrawn due to an inability to provide raw data, while another 19 are rejected because of insufficient raw data. Thus, Miyakawa concludes:
"more than 97% of the 41 manuscripts did not present the raw data supporting their results when requested by an editor, suggesting a possibility that the raw data did not exist from the beginning, at least in some portions of these cases. Considering that any scientific study should be based on raw data, and that data storage space should no longer be a challenge, journals, in principle, should try to have their authors publicize raw data in a public database or journal site upon the publication of the paper to increase reproducibility of the published results and to increase public trust in science."
The reproducibility crisis isn't limited to the sciences, but is perhaps the best lens to view the issue through. In Nature Human Behaviour's collective "A manifesto for reproducible science" the dangers of falsity in data collection, retention and re-use are bluntly stated:
"What proportion of published research is likely to be false? Low sample size, small effect sizes, data dredging (also known as P-hacking), conflicts of interest, large numbers of scientists working competitively in silos without combining their efforts, and so on, may conspire to dramatically increase the probability that a published finding is incorrect."
As the manifesto concludes:
"These cautions are not a rationale for inaction. Reproducible research practices are at the heart of sound research and integral to the scientific method. How best to achieve rigorous and efficient knowledge accumulation is a scientific question; the most effective solutions will be identified by a combination of brilliant hypothesizing and blind luck, by iterative examination of the effectiveness of each change, and by a winnowing of many possibilities to the broadly enacted few. True understanding of how best to structure and incentivize science will emerge slowly and will never be finished. That is how science works. The key to fostering a robust metascience that evaluates and improves practices is that the stakeholders of science must not embrace the status quo, but instead pursue self-examination continuously for improvement and self-correction of the scientific process itself.
As Richard Feynman said, “The first principle is that you must not fool yourself – and you are the easiest person to fool.”
Figshare is a best-in-class data publishing platform for RMIT researchers and Higher Degree Research students to store, manage, share, and discover research.
Benefits of Figshare:
For an overview of the Open Data landscape please consider Figshare's annual State of Open Data 2021, made in collaboration with Digital Science and Springer Nature.
Figshare not only offers research data visibility, but it also offers it security - when it comes to verification, provability, and "FAIR data principles", data repositories like Figshare are crucial in supporting your publications and securing the integrity of your research long into the future. It offers version control that allows you to track changes to your data post-publication, and all work is stored on Amazon AWS S3 servers located in Australia that perform regular, systematic data integrity checks.