Open Bytes

Future Open Research

In this piece, I hope to give an overall view of how I believe the future research “flow” should look like, given the ideas in the Open Science and FAIR spaces, and how data stewards fit into the overall picture.

I’d like to place a disclaimer here. I am, in no shape or form, thinking that the ideas presented here would be easy to implement. However, I believe that delineating and keeping in mind the golden standard is essential while pursuing it. These ideas also apply to publicly funded research. Privately funded research might not fit this schema, although I’d argue many aspects can be transversal.

Research Flow

The proposed Open Research workflow. Please note that “Mimir protocols” refer to Structured protocols.

First, we can define three large areas in the research process:

Data stewards and Open Science principles can help streamline and open up all three steps, creating a new, of fundamentally higher quality, research procedure. I refer to this proposed structure as “Open Research” (OR).

I often talk about Structured protocols. You can find more information on them here and here. They are implemented well in the (paid) platform of protocols.io.

Bullet points starting with $ are implementation questions that have to be effectively addressed before this research flow can be used in practice.

Planning

The project planning phase is arguably the most important. OR begins as any research project does, by having an idea. This idea, as is routinely done when applying for grants, is often fleshed out and plainly explained in a specific document. OR extends this idea to every project, big or small, funded through grants or not.

When an individual or team is committed to the execution of the project, a project space is created with the ability to store project-level documentation and metadata. This platform is then used as a centralized place to effectively plan all the other steps.

During this phase, many aspects of the overall project are defined:

Depending on the scope of the project, this document, the project plan, may be large or small. It is important to underline that OR prescribes that all projects, of any size, are explicitly planned.

If required, the funding proposal is created and included in the project space.

Crucially, after the project plan is created, a data steward is assigned to the project. The data steward has the role of guiding the researchers in all aspects of data management, from dealing with the fine points of FAIR data implementation to the education of researchers on Open Science concepts.

The data steward begins guiding the group through the creation of a data management plan. The data types that are expected to be handled in the project are defined. Then, the data steward determines if FAIR Implementation Profiles (FIPs) created in the past can be used to deal with these data types. If relevant FIPs exist, they can be reused or adapted to the current situation. Otherwise, new FIPs need to be created by the data steward, and deposited in the FIP archive for later reuse.

Once data types, volumes, privacy concerns, corresponding FIPs and all other aspects regarding data management are determined, the data management plan can be redacted, and deposited as a deliverable in the project space.

After - and only after - the DMP is created, the true research process can begin.

Active Research

It is important to note that I am a researcher in the biological space, so my concept of experiment aligns with this field. I admit my ignorance of other research areas, and, while I try to present these ideas in a field-agnostic way, this discrepancy on conceptualization might cause some concepts presented here to not apply, at least directly, to all research fields.

The research process itself involves running experiments or, in general, gathering and processing knowledge.

It is crucial that each technically sound experiment or data collection (even if not ultimately useful to the goal of the project) is considered for long term preservation (see, for this, the discard problem).

Tools such as open notebooks and integrated research environments are particularly useful to open up research during this step. However, due to the fast pace of the active research process, the complete transparency of this stage might not be an achievable goal.

A key aspect that enables future FAIR data creation is the usage of Structured protocols. They enable more or less automatic deposit of experimental data after the experiment is over. Of course, the guiding light during the experimental step is the specific FIP(s) previously defined in the DMP.

The Hot data produced during this step is temporarily stored in an agile repository.

The experiments produce considerations, which drive more experiments, and so on. At any point, the DMP may be updated (with the same procedure followed when creating it) if and when new data types need to be used.

It is useful to apply a standardized structure for the hot data storage, to ease later processing.

The data steward acts as both support and supervisor, periodically checking if data quality is upheld, and if the processes as detailed in the FIPs are respected regarding data formats, content, etc…

OR should explicitly detail every experiment, with a mini-rationale statement and noting down the considerations made on the results. This has two benefits. First, the researcher must consciously process why each experiment is run, and what information can be derived from it. Secondly, encapsulating data from each experiment individually makes later FAIRification both easier and in some cases possible. Structured protocols should help in this regard, if implemented and followed correctly.

It might be possible to collate FIPs and Structured protocols into a single entity, which could be called FAIR-Enabled Structured Protocols.

All experiments, regardless of “positive” or “negative” outcome should be considered for preservation. Only identifiable violations of the structured protocols may warrant the outright deletion of the experimental data due to technical failure. If the researcher still believes that the experiment failed (regardless of outcome), but the protocol breakage cannot be determined, the experiment can be retained but marked as potentially faulty.

A single experiment may be trying to answer more than one question, or might contain internal controls for quality assurance. Additionally, information regarding laboratory quality controls and known confounding factors need to be included in the metadata of the experiment. This quality information must be recorded properly, and reflected in the final metadata.

The summarize, each experiment “block’ has to carry the following information:

To group this different information, an identifier can be used to mark all related files. FAIR Containers such as Research Object Crates (ROCrates) might be useful.

Preservation and dissemination

After the (main) experimental phase is completed, it is important to collate all experiments that where not discarded (again, regardless of outcome) and all considerations made on them into a large document which I’ll call interpretation. This document includes everything that was done during the experimental phase, and may or may not coincide with the laboratory notebook.

This document, when redacted, will be the starting point of both research articles and the way that data will be ultimately deposited in long-term repositories.

The researchers leave the data with the data stewards and may begin writing the manuscript for the publication process.

The data steward then begins working on archival and long-term preservation of the data, including the determination of access policies and other FAIR-enabling infrastructural choices. According to the FIPs, data archival packages are created for the data that requires archival (the Discard problem can be re-evaluated here), and expiration dates (if needed) are placed on the data (see, for instance, Access-based lifetime determination). Data archival packages are collections of data, metadata, identifiers and generally all knowledge that is required for the data to be archived, including all of the information for its effective later reuse.

When archival packages are ready, they are deposited in the relevant repositories. It is useful to create a space to collate all identifiers of all the data created in a specific project, to preserve the identity of the project itself. A way to do this might be through Nanopublications (see later). While project identity is not necessarily useful for data reuse, it might be needed for quality evaluation and authorship rights in the later phases.

Nanopublications are minted from the archival packages, recording the salient metadata and forming a knowledge graph that may be used to access this data again later.

If software was used to analyse the data, or the project was the development of novel software, this has to be deposited statically into some archive.

The interpretation document can also be used (with the help of the original researcher) to derive basic assertions that can become nanopublications, enabling researchers to distill their findings and increasing their own and the community awareness on all of the outcomes of their research, were they expected or otherwise. These nanopublications can then go and increase the global body of knowledge for further (automated) examination.

After all data is processed, checked for consistency and archived, and all relevant nanopublications are created and interlinked with the archived data, the project may be considered to be finished. All local data may be deleted, and project spaces closed.

This marks the end of the project. However, post-project data support may still be needed if third parties require access to data, for example for controlled data. This can be handled by either the data steward or the researchers themselves.

It is important to note that the publication procedure does not influence whether or not the resulting data and assertions are published. The data is deposited regardless of the attempt at a publication (for a “regular” project) or not (for, e.g. master theses), and whether or not the manuscript is accepted. This ensures that no insight is lost, regardless of the (potentially fickle) publication procedures.

Green open access, when needed, has to be enforced by the institutions, arguably for ethical responsibilities for publicly-funded research.

Self-evaluation

To keep this process able to comply with the needs of researchers, post-project meetings to define what worked and what did not, what can be done better, and other criticalities have to be performed. This self-audit has the be accessible for aggregate statistics to be performed, so standard protocols (such as feedback forms) and other techniques should be used.

Periodically, these guidelines and policies can be changed following researcher feedback. In particular, FIPs can be re-evaluated for relevance and fitness of purpose.

Roles and responsibilities

In this piece, I talk at length about Data Stewards but never define their role in detail. I hope that the reader could have picked up the outline of Data Stewards in my conceptualization of the research process. In any case, I explicitly explain it here. As sources, I use the work of the FAIRsFAIR competence framework for data stewards (my notes can be found here), and the Data Stewards: Minimum Viable Skills Profile report from the Skills4EOSC consortium (available on Zenodo in short form here and in long form here).

A Data Steward (DS for short) is a professional figure with expertise in the handling of data, with a multifaceted skillset. A Data Steward’s primary mission is to preserve data for later reuse, be it for the benefit of an organization, a sovereign state or the whole community.

First, as modern data is overwhelmingly digitized, a DS has technical knowledge of the strategies to harvest, store, manipulate and preserve digital data. This includes the creation of data-processing pipelines (e.g. for data ingestion), the design and administration (but potentially not the maintenance) of data archives and repositories, and the day-to-day handling of edge cases encountered during data management. Concrete skills include programming, knowledge of data formats and computer systems, concepts of data and systems security, and more. While a DS’s day-to-day may not involve significant data wrangling or otherwise data handling per se, DSs must understand what these processes involve, as they both define the boundary of what is possible to do with data, and how data is ultimately used by the data (re)users.

Additionally, as software can be, in a way, seen as data that can be executed, knowledge in the preservation of software for later execution is important.

Secondly, to unlock the data for reuse, the fine conceptualization of what the data is describing and the epistemic need that it is trying to fulfill need to be understood, made explicit, and preserved. For this reason, a Data Steward has factual knowledge and expertise in specific application domains (read: Arts, Biology, Earth Sciences, Medicine, etc…), as well as familiarity with those ways in which the reusable data they strive to create will be used, in order to meet the demands of their fields and organizations. Additionally, as ontologies and supra-ontological frameworks are routinely used to address these concerns, knowledge of the creation and usage of these systems is fundamental for any Data Steward.

Thirdly, soft skills such as communication, mediation, leadership capacity, ability to work and coordinate large teams, and engagement, as well as hard skills such as data privacy, ethical knowledge and the ability to draft and put in application policies regarding data are extremely useful for Data Stewards. Indeed, there is a significant need for professionals who both have the domain and technical knowledge to guide the whole data lifecycle, with its multifaceted aspects (see here and here for more on the data lifecycle). Given the centrality of this role, soft communication skills as well as robust core abilities and expertise to deal with the many issues which arise when several varied stakeholders collide (researchers/data creators, data (re)users, organizations, laws, legal entities, states and the whole community) is of enormous importance.

Finally, and more relevant for those Data Stewards in academia and generally working for the benefit of the community, Data Stewards are trained in the principles, philosophy and methods of Open Science. In this context, Open Science is understood as that practice that aims to render public research more useful for the public, in particular by being more trustworthy, accessible, and aligned with human and societal needs. According to this framework, Data Stewards are both enablers and teachers of Open Science, in many different contexts. For this point, the raw knowledge of the history, practice and philosophy of Open Science is fundamental, as well as other soft skills such as community engagement, lecturing and divulgation.

FAIR Implementation Profiles

As the keen reader will have undoubtedly noticed, I’ve delegated many important and hard tasks to the FAIR Implementation Profiles (FIPs). Here I attempt to delineate these issues, providing my idea of what FIPs are and some of the characteristics that they require in the context of Open Research. While the confrontation of these issues is relegated to a later date and open to debate (it is an open question in the above text), here I hope to bind this conversation to more defined areas.

As a summary and basis for this section, I list here all the aspects of Open Research that depend on the FIPs:

Potentially more aspects should be covered by the FIPs, but they are yet unclear to me. In any case, the burden placed on the FIPs is enormous. However, this key role of FIPs makes them ideal candidates as facilitators of data handling. Executable FIPs, whereas an infrastructure may be able to guide the primary data generators through the process of creating, storing, preprocessing, annotating with metadata, and ultimately depositing the data can be envisioned.

Of course, the task of creating such a FIP for all the potential data types is enormous, and potentially infinite. However, such strategies can be reused over time, slowly eroding the problem of non-reusable data.

FIPs are not to be intended as prescriptive. In no way does the presence (or more importantly the absence) of a FIP have any influence on the freedom of experimentation that every researcher has. It is not the job of FIPs - or Data Stewards for that matter - to legislate what can and cannot be done with data. Their only role - and I cannot stress this enough - is to support the proper handling of data for the benefit of the researchers themselves and crucially for future data (re)users.

Benefits of Open Research

This research framework has several inherent benefits. I detail some of them here.

For Researchers

For the community

For institutions and funding agencies

For society

Reply to this post by email ↪