Cite

Background

In the fourth research paradigm, research data has transitioned from being solely a product of scientific research activities to serving as the foundation for such activities, owing to its explosive growth, aggregation, reuse and linked on an unprecedented scale. Numerous disciplines, like life science, geoscience, material science, etc., have demonstrated significant features pertaining to data-intensive knowledge discovery (Borgman & Brand, 2022; Gettelman et al., 2022; Li et al., 2021; Li et al., 2023; Scheffler et al., 2022; Yuille, 2011). Research data have become the fundamental strategic resources in the new paradigm, and emphasizing openness, integration, intelligence and security (Wan & Gu, 2022; Yang et al., 2023), and they serve as one of the driving forces behind research and innovation.

Research data infrastructures (RDIs) form the cornerstone in cyber and physical spaces for the development of data-intensive scientific research (Stewart et al., 2023). Typically, RDI consists of data resources, technologies used in the whole life-cycle of data resources, as well as the guidance, standards, policies and rules (NASEM, 2023). In specific cases, these infrastructures also encompass individuals, capacity, and expertise.

In this context, the RDIs impact the mechanisms of science. They continuously search for, collect, organize and accumulate research data and knowledge (Orhean et al., 2022), not only generated by “big science” using large-scale research infrastructures (RI), or giant instruments, which are more homogeneity, but also scattered among researchers and research teams heterogeneously (Donaldson & Koepke, 2022; Sales & Sayao, 2019). RDIs played increasingly important role in optimizing the flow of research data across multi-disciplines, reducing the wasteful duplication of resources and work, improving the way and connection of scientific research collaboration (Guo et al., 2023). Meanwhile, the necessitates a corresponding increase in appropriate computing resources, tools, and data experts to handle research data influx (Kanza et al., 2022). Scientific activities have been based on data and always will be, however, in fourth research paradigm era, data-centered entails a shift of research performance focus from concepts to methods systematically (Edgar et al., 2022; Kalhor et al., 2023; Marshall et al., 2023). This drives the development of an integrated RDIs ecosystem and international cooperation framework.

The trend of shaping the research data infrastructure in research infrastructure framework

The publication of the roadmap serves as the primary method for countries to plan and construct their national-level RIs, reflecting the strategic path of the future. By sorting out the digital and RDIs that have been planned, built and operated since 2000 (Figure 1), RDIs have emerged as strategic topic of interest and priority for decision-makers and political bodies of developed and developing countries. This aims to enhance their scientific and technological competitiveness in the new paradigm, including responses to major challenges such as COVID-19 pandemic, and to align with the promise of social progress (Aarden, 2023; Song & Ran, 2022).

Figure 1.

Landscape of global RDIs (Sorted based on the roadmap of USA, EU, France, Germany, UK, China’s RIs, and scheduled running time in parentheses. Many high-performance computers are carriers of research data analytics processing technologies, so they are treated and listed as a subcategory).

In recent years, several countries reoriented their national RI frameworks to facilitate effective establishment and operation of RDIs. European and USA offered two reference paths (Figure 2).

Figure 2.

Two reshaping paths of RDIs in RI ecosystem (Modified based on the ESFRI (2016) and OSTP (2021)).

In European path, scientific e-infrastructure (data, computing, network infrastructures) was an independent domain, and the strategic position of data infrastructure. To support the strategic approach to policy-making on RIs, EU established a specialized body called the European Strategy Forum on Research Infrastructures (ESFRI) in 2002. It is responsible for creating the Roadmap of European Research Infrastructures and updating it every two to five years. In all editions of the ESFRI Roadmaps, coherent infrastructures, resources and services covered the main scientific and technological domains. Table 1 show the evolution of the category of RDIs and currently, RDI has been central part of the RI ecosystem and priorities, which enables researchers and other stakeholders from research, education, society and business, to use, reuse and exploit data for the benefit of science and society (Aarden, 2023; Candela et al., 2015).

The evolution of the categories of RDIs.

Version Description of the categories of RDIs and their predecessors
2006 (ESFRI 2006) ● Seven domains or topics were organized in the first edition at 2006 and one of them was Computation and Data Treatment, which mainly focused on high-performance and high-throughput computing, grid architectures, software and middleware for performing computing, data management and curation techniques for handling vast masses of data.
2008 (ESFRI 2008)

European high-performance computing service, later called PRACE, was established to form the partnership and network for advanced computing in Europe, aiming to utilize the top-level computing machines to fulfill the requirements of different scientific domains.

Emphasized on the efficient use of the e-infrastructure to deal with the production and use of unprecedented quantities of research data, not only those coming directly from facilities, but also those contained in scientific publications.

2016 (ESFRI 2016)

Pointed out an impression of the European RI Ecosystem,

The e-infrastructure was transversal to all domains to develop the convergent and sustainable services in landscape analysis.

2018 (ESFRI 2018)

Enhanced the concept of the European RI Ecosystem, and the role of e-infrastructure.

Widely acknowledged and accepted the importance of data infrastructures for solving cutting-edge and interdisciplinary complex scientific and social problems, by policy-makers, researchers, funders, industry and society.

Renamed the domain e-infrastructure to Data, Computing and Digital Research Infrastructures (DIGIT), highlighting the more general and explicit mandate of data.

Ensured convergence of strategies and implementation actions with the European Open Science Cloud (EOSC).

2021 (ESFRI 2021)

Continuing to deepen the ecological concept of European RI Ecosystem.

Announced 11 new projects filling gaps in European RI capacities. eight of them were data infrastructures (horizontal) and data-driven facilities (vertical), involving computing and data science, brain science, population, ecology and environment, energy and social science.

Recently, USA redefined the elements of RIs to accommodate the evolving landscape of rich, integrated, interconnected research data workflows and discovery pathways. In the United States, RIs, also known as large-scale facilities or user facilities, have mainly received primary funding from Department of Energy (DOE) and National Science Foundation (NSF) through Major Research Equipment and Facilities Construction Funding (MREFC) and Major Facilities funding (Matthews, 2012). Additionally, several specific infrastructures are operated by the National Institutes of Health (NIH), and the National Aeronautics and Space Administration (NASA). These infrastructures were categorized based on disciplines such as basic energy science, high-energy physics, nuclear physics, fusion science, advanced scientific computer, biological science, geosciences, mathematical and physical sciences, polar science, and more. They are managed by the corresponding governing sectors, establishing disciplines and scientific domains as fundamental framework of RIs in USA and shaping the current RI landscape.

With the transitions of research paradigm, a national strategic overview document released by National Science and Technology Council (NSTC) in 2021 (OSTP, 2021), pointed out that RI should consist of three elements (Table 2). Compared with the European path, the RI framework of USA constructed a new data-centric vision across all disciplines, illustrating the efforts to strengthen the construction of RDI and promote the transformation of a data-intensive research paradigm.

Three elements of USA’s new RI framework.

Elements Description and the position in the dataflow
Experimental and observational infrastructure

closer to the traditional concept of physical or “hard” facilities, such as accelerators, telescopes, etc., and provided the tools and platforms for scientists

Played the role of observation, generation and collection of research data with the extreme experimental approaches and conditions from the perspective of dataflow.

Knowledge infrastructures

Included research data assets and resources, such as scientific collections, reference libraries, data repositories, intellectual property, etc.; the standards, protocols, and services enabling data management and remote access; analytic and computational algorithms; even human capital infrastructure.

covered the aggregation, storage, management, sharing and provision procedures in research dataflow, as well as the mechanisms to guarantee them. It should be noted that these infrastructures maintain historical data, information, references, knowledge that may be demanded in the future.

Research cyberinfrastructure

Offered interconnected ecosystem, like co-designed advanced computing resources, data and software service, research and education networks.

Scientific results and outputs from certain research were shared, analyzed, near real time distributed, accessed across global physical or virtual networks through them, also ensured the efficiency and trustworthiness.

Four new missions for research data infrastructures
As a pioneer, to transcend the disciplinary border and address complex, cutting-edge scientific and social challenges with problem- and data-oriented insights

In previous decades, national scientific data centers distributed across scientific domains were a prevalent pattern in most countries. Since the 1990s, the US, UK and Germany had constructed and enhanced their national science data center systems in reliance on national institutions (Wang & Wang, 2022). They served as the pipelines and key nodes to facilitate the aggregation of research data from national laboratories, research institutions, universities, research communities, researchers and other innovation bodies across various disciplines. For its duty, the national scientific data center ultimately achieved vertical connectivity of data within a certain domain.

However, data from a single discipline is inadequate to respond to the complex scientific problems. Scientists often require diverse data, different analyses and an integrated research data platform across various disciplines to enhance their comprehensive understanding of the natural phenomena, and to address the increased complexity of socio-economic development challenges. There is an increasing number of problem-oriented RDIs, rather than research domain-oriented ones, in the global RI landscape (Figure 3). They are pioneering the breakthrough of traditional research fields boundaries with the trend of interdisciplinarity and the evolution of the fourth research paradigm, while also overcoming existing geographical and organizational information barriers.

Figure 3.

Pattern of RDIs (discipline-based vs problem-oriented).

These research data or data-driven infrastructures coordinate integrated data and knowledge across various domains of natural sciences, including natural science and social science. This progresses gradually. For example, PANdata, the Photon and Neutron data infrastructure initiative, brought together 13 major European analytical RIs to establish a fully integrated, seamless, pan-European, information infrastructure supporting the scientific process in 2011 (Bicarregui et al., 2015). This initiative developed a common data format and management, offering a unity framework for the data generated by synchrotron radiation light sources, free electron laser facilities, laser light sources, and spallation neutron sources. It meant that researchers could take advantage of experimental data from more than one analytical infrastructure without replication, in order to create a more detailed, broader spectrum portrayal of the material world, to break down the barriers of experimental means. Research Infrastructure for EnvIRonmental Exposure assessmeNt in Europe (EIRENE RI) was included the ESFRI Roadmap for 2021 (ESFRI, 2021). This distributed data-driven infrastructure aims to support a large-scale, wide ranging research into the interdisciplinary assessment of environmental determinants of health. It will harmonize integrated services for data, knowledge and tools across a cross-cutting area of environmental and analytical chemistry, biology and toxicology, epidemiology, environmental and human exposure and risk assessment, pharmacokinetics, and geospatial modelling. But it mainly across the domains of natural science. Next, we use the Intergovernmental Panel on Climate Change (IPCC) as an example of the cross-cutting between natural science and social science. The object of IPCC was to provide governments with scientific information that they could use to develop climate policies (Dong & Zhang, 2014). Its Data Distribution Centre offered the climate, environment and socio-economic data from the past and also in scenarios projected into the future, and shown a good practice from research data, to information, knowledge, decision, finally to policy at international level (Stockhause & Lautenschlager, 2022).

As an architect, to establish a digital, intelligent, flexible research and knowledge services environment

With the evolution of the fourth paradigm, two notable branches have gained increasing significant: X-info, for the perception, collection and analysis of data and information, and Comp-X, for computing and simulation (Zhang et al., 2019). Research data has become a new type of innovation capital. Some scientific activities do not necessarily rely on physical experimental observation facilities or instruments from the outset, but rather on the accumulation of data collected for other purpose, usually called data reuse. As a one-station, unified, comprehensive digital research environment, RDIs will serve as the primary interface for direct interaction with researchers. The network and flexible nature of science leads to a significant transformation of the scientific service mode.

The role of X-info was demonstrated in the evolution of traditional RIs. The responses of analytical RI during and post COVID-19 era provide clues about the transformation of the digital and knowledge service environment of infrastructures. The restriction on travel during the COVID-19 era necessitated more standardized and digitalized procedures for remote user access across infrastructures, as well as enhanced interaction and collaboration between users and technical experts, and a comprehensive overview of information and knowledge to facilitate digital sample handling (LEAPS, 2021). These upgrades enabled the flexible transmission of scientific data to researchers without requiring on-site operation, and established a permanent capability.

Thanks to intelligent computing, RDIs with a scaling effect have become a goldmine for scientific discoveries. In recent years, AI scientists and robot scientists emerged in various disciplines, which force the data and knowledge service intelligently (Sun & Han, 2021). ChatGPT, Large Language Models (LLM) molded AI research assistants, and might be capable of works from fine-tuning language, extracting references and metadata from papers, to helping evaluate the trustworthiness of findings by aggregating the heuristics of many experts (Cramer, 2021; Davies et al., 2021; OECD, 2023; Zhao et al., 2023). This trend not only occurred in disciplines with a high level of digitization, good data accumulation and relatively well-defined problems, but also in finding solutions to complex problems. The European Integrated Infrastructure for Social Mining and Big Data Analytics (SoBigData++), has created a platform for designing and implementing of large-scale social science data experiments, serving as a digital academic laboratory to understand complex, globally interconnected societies.

With AI scientists (Boiko et al., 2023; Koscher et al., 2023; Merchant et al., 2023; Romera-Paredes et al., 2023) working in digital, autonomous labs (Szymanski et al., 2023), the RIDs will serve as the interface for providing efficient, flexible and intelligent knowledge services in physical or virtual digital academic laboratories.

As a platform, to foster the high-end academic communication

The high-end characteristics of its communication platform were primarily evident in the efficient communication pattern, the aggregation of elite talents, and the delivery of high-quality content within academic discourse (Figure 5).

Figure 4.

The interface of one-station panoramic digital research environment.

Figure 5.

The role of RDI as a high-end communication platform.

Firstly, the problem-oriented RDIs suggest a highly efficient communication pattern, fostering the unrestricted dissemination of knowledge for the benefit of researchers, scholars, students, and, more broadly, society worldwide, without barriers, during the implementation of mission-driven research endeavors. In Europe, Open scholarly communication in the European Research Area for Social Sciences and Humanities (OPERAS) constituted a distributed RI, aiming to promote Open Science by establishing a scholarly communication system with specific forms (monographs, critical editions, and edited bibliographies, amongst others). Meanwhile, communication was not limited to cross-geographical boundaries or cross-disciplines boundaries, but also transcended time. Scientific big data, historical or generated in real time, can be transformed into robust information and knowledge through simulation, modelling and analysis, bridging the gap between past, now and future. LLMs currently served as a new medium, reshaping the way and promoting the efficiency in how researchers interact with knowledge itself, in order to solve the pronounced conflicts between the ever-increasing volume of knowledge and the limited capacity of the human brain (Zhang et al., 2023).

Secondly, RDIs have formed scientific communities fostering the convergence of high-caliber talents and wisdom. They served as pivotal avenues for the generation of increased opportunities and the flow of knowledge, fostering in-depth collaborative efforts on a global scale.

Thirdly, RDIs will provide well-organized, high-quality, credible research and communication content, based on knowledge-based data curation (Sakai et al., 2019). On the one hand, open access journals, open data, data journals, traditional academic journals, academic conferences and other knowledge repositories would be organized in a manner that was more readily understandable and usable by AI, to discovery, integration and synthesis (Carter et al., 2023), and harmonized with whole lifecycle in physical RIs (Bunakov & Matthews, 2013). On the other hand, RDIs focus on not only the fine outputs, like academic articles report series etc., but also the whole process of experiments, including automated collection and aggregation, standardization and traceability of scientific research data with quality control. They will be remarkable and credible infrastructures in data-driven research paradigm.

As a coordinator, to balance scientific openness with ethics needs

Openness of research data was a prevalent requirement for innovation activities in the fourth research paradigm. However, there exist conflicts between openness and ethics due to various rights and demands of stakeholders, such as privacy, intellectual property, etc. (Li et al., 2022). Legal data ingestion, integration, retrieval and replication, transparency, fairness, citation as academic norms were hotly debated and widely considered in research data curation themes (Elliott et al., 2023; Kraemer, et al., 2021; Li & Shen, 2024; Tong et al., 2022). As central nodes in the cyber space, RDIs should play a crucial role in conducting curation and governance effectively, while balancing the needs of scientific openness and ethics. As research data converges and is reused on a larger scale and in more deeply connected contexts, issues of safety and ethics will manifest in more diverse forms. RDIs will shoulder more responsibility for balancing the openness and ethics, as well as addressing security considerations. They will capitalize fully on data resources, technical tools, standards and protocols, and working systems.

In recent years, new concepts and frameworks have been proposed and implemented in governance practice. In healthcare, the introduction of the Personal Health Train, which involves the submission of questions and algorithms to the data-centered infrastructures and networks (Beyan et al., 2020), is considered as an alternative to directly sharing basic data. This approach is considered to address the privacy risks associated with distributed analytics scenarios (van Soest et al., 2018; Welten et al., 2022).

Moreover, advanced information technologies such as blockchain, secure multiparty computation, and trusted execution environments are gaining traction in addressing research data ethics and safety issues, ensuring the integrity of each step of data workflows in RDIs. Oak Ridge National Laboratory developed a secure computation platform that allocates research data generated by infrastructures or instruments to provide a safe collaboration environment and capabilities for training a differential privacy enabled deep learning network on the super-computer Summit, both within and outside of the laboratory (Yoginath et al., 2021). Chinese National Marine Data Center launched a blockchain platform for data transmission in the Science and Technology Program in 2022. A model for data management and certification was constructed, and designed smart contracts such as automated control of the remittance process, data certification, etc., so as to ensure authenticity, consistency, completeness, and traceability of the data files and documents.

Conclusion

As the strategic importance of foundational assets in research data became increasingly prominent, RDIs have emerged as material and technological foundation within the paradigm of intelligent data-driven research, playing a crucial role in determining innovation capabilities. Many countries are accelerating the deployment of high-performance and efficiently operational RDIs, shaping innovative policy frameworks and fostering innovation that align with their service capabilities and application scenarios. In the future, RDIs would enhance their capabilities in efficiently aggregating and analyzing, further leveraging its role in supporting research on comprehensive complex problems, providing advanced intelligent research and knowledge environments, facilitating high-level academic communication, upholding research data safety and ethics, serving as a driving force in accelerating the research paradigm shift.

eISSN:
2543-683X
Idioma:
Inglés
Calendario de la edición:
4 veces al año
Temas de la revista:
Computer Sciences, Information Technology, Project Management, Databases and Data Mining