by Willem Elbers
The REPLIX project is studying and implementing the next level in grid based replication and synchronization at a logical level by using iRODS. REPLIX is a joint project between DEISA represented by Rechenzentrum Garching, and CLARIN and DOBES both represented by MPI for Psycholinguistics.
The two main goals are data preservation and authenticity control:
1) When we are talking about data preservation, we are talking about guaranteeing future generations access and use for the data we are archiving now. This includes managing different copies of the data and associated metadata at different physical locations, this is called replication. Metadata in this context includes system metadata (such as file size, creation date, etc.), complex user metadata (anything defined by the user but also the relations defined by the user) and access restrictions (which user has access to which files and operations).
2) When we are talking about authenticity control, we are talking about making sure the information remains authentic. And not only the data files, but also the metadata associated to the data files. Since the data and metadata is replicated, the authenticity of each copy needs to be controlled. Moreover, access to files is also part of the authenticity control. Only authorized editors should be able to edit the information and associated metadata.
The current infrastructure takes care of replication at a physical level (using tools like rsync and Andrew File System (AFS)). At the moment this is similar to copying files from one location to another. For future use, this approach is too limited since replication causes source collections to be placed in different contexts, which cannot be properly handled by AFS or rsync. In order to ensure consistency of the collections a new approach is needed.
To be able to identify the archived objects in a unique way, MPI uses the handle system. The handle system creates persistent identifiers (PID) and associates them with file properties (such as a reliable checksum). The use of PIDs ensures the identification of archived objects now and in the future by a single identifier.
The current infrastructure consists of one central archive, located at the MPI in Nijmegen. The central archive is replicated (at the file level) to two large data centers, each managing two copies of the archive. Around the world, several satellite archives exist. Researchers use these satellite archives to ingest the information they collect. The first step in the current preservation process ensures proper ingestion into the central archive. The seconds step in the current preservation process ensures proper ingestion into the two remote archives. This is shown in Illustration 1.
REPLIX is researching possibilities to overcome the limitations of existing replication and authenticity control methods. To be more specific, REPLIX is researching how we can use iRODS to create a solution where information is ingested into the archive and replicated ensuring the integrity and authenticity of the data and metadata. The solution should also take care of synchronization of the central archive to the backup data centers. This step also has to verify the integrity and authenticity of the data.
Although tested for the MPI infrastructure, the approach should be easily generalized to a solution where any community can use the solution to deposit information into a central zone and taking advantage of the preservation facilities. Before achieving this, we will start to explore iRODS in general. Then we will start to create a setup to synchronize the central archive to one of the two data centers. The next step is to include both data centers and finally the satellite centers have to be included.
IRODS is a storage grid which uses rules to enforce policies on the actions performed on the data inside the storage grid or execute policies on a regular interval. One of the policies could be replication inside the storage grid. As soon as a file is ingested into the storage grid, it is automatically replicated onto several storage resources (hard discs, tapes, …). Another policy could make sure the file remains authentic, by checking the file hashes and repairing any damaged replication(s). It is also possible to create a connection between two or more storage grids. Each storage grid manages it’s own data collection and policies can be created to synchronize information between the storage grids.
Ideally, the iRODS policies should use the PID system to identify the information in all storage grids and based on these identifiers, and the associated information such as a checksum, perform synchronization between multiple (n) storage grids. This synchronization process will verify the integrity and authenticity of the synchronized data. This is shown in Illustration 2.