The Data Lake Task Force

The Data Lake Task Force (TFDL) collects, discusses and harmonises the needs and architectural solutions adopted by the communities and institutions represented in ICDI, in the field of data infrastructures supporting scientific research.

The aim is to define a relatively high-level design of a 'data lake', as an ecosystem of existing and future data infrastructures that on the one hand maintain their specificity and functionality, and on the other hand allow a minimum set of inter-domain activities. The data lake in our understanding is a system ("thin middleware layer") that helps basic interoperability between existing and heterogeneous systems, enabling, for instance, multidisciplinary research activities on originally unrelated databases.

The data lake that the Task Force is going to propose, will be able to:

  1. allow the creation of a connected data ecosystem, promoting and ensuring interoperability among the different existing data infrastructures at the national level;
  2. define a minimal, modular and extensible data and metadata interface that is able to abstract from the specificities of each application domain;
  3. preserve existing functionalities and thus the operability of legacy systems;
  4. guarantee federation aspects, taking into account the Open Science and FAIR principles.

The participation of data infrastructure in the data lake will be dependent on the adoption of minimum requirements to be suggested by the Task Force.

The Task Force is committed to defining an architectural design of a federated infrastructure, analysing through an online survey, the requirements of the communities involved, the solutions adopted by existing infrastructures as well as the solutions available in the open source market, in order to propose a high-level architecture that can be promoted by ICDI, and possibly implemented in the framework of national and European calls.