Get in touch with EOSC-Life

     WP1 RI nominated resources 6: PDB-REDO Cloud: FAIR protein structures with deep versioning for scientific reproducibility and data provenance tracking

    Summary

    Structural biology research depends on high-quality structure models, often derived from X-ray crystallography. The Protein Data Bank (PDB) is the primary source for such models, but has some drawbacks for large-scale and high-throughput studies as the deposited models were created by different people in different eras using different methods, and many models contain solvable flaws. For over a decade, the PDB-REDO databank drastically reduces these drawbacks. Namely, the databank collects optimised PDB entries that are re-created from their original experimental data using a fully automated protocol. This procedure increases comparability of the structure models while also removing many imperfections. 

    Thus far, PDB-REDO had limited provenance tracking and used proprietary data formats for metadata. Moreover, because the PDB-REDO databank is a living entity, entries get replaced frequently, e.g. to incorporate algorithmic advances from our ongoing research. Unfortunately, this caused models used in structural biology research to ‘disappear’, thereby affecting their re-use and scientific reproducibility. Such limitations of PDB-REDO remodelled structures were addressed in this EOSC-Life WP1 project:

    • Each PDB-REDO databank entry now includes a detailed provenance record documenting the versioning of the input from the PDB, as well as all versions of the over 60 programs used in the PDB-REDO protocol. The use of non-standard settings in the procedure is also documented.
    • All PDB-REDO metadata have been recast to JSON files with detailed descriptions of the data structures in the form of JSON schemas (see this website). Other key data were already included according to community standard formats such as mmCIF and MTZ.
    • A robust version roll-over mechanism is implemented that retains previous versions of PDB-REDO structure model entries in the so-called ‘attic’ of the entry. Each versioned model has a persistent identifier based on the provenance record. This ensures the long-term availability of data and thus improves the reproducibility of studies done on PDB-REDO data.

    The updated data structure of the PDB-REDO databank also allowed us to create a cloud-ready API for research dataset generation. It allows users to select structure models based on provenance data and/or structural and model validation parameters stored as metadata. A graphical interface to this API is available here (Fig 1). Search results can be stored as a JSON structure and used as a dataset descriptor for scientific publications.

    The combined results of this project have made the PDB-REDO databank a stronger resource for structural biology research inside and outside of the European Open Science Cloud.

    Testimonials

    We have long-standing connections between PDBe and PDB-REDO. Now with the improved FAIRness and overall structure of the PDB-REDO databank we can make more extensive connections between our resources. This will create more added value to our combined userbases.

    – Sameer Velankar, PDBe

    The PDB-REDO project has always supported Open Science and focused on accessibility and availability. EOSC-Life has given us the resources and support to bring our databank to the next level of usability and    FAIRness which allows us to better serve our existing users and make our databank more suited for new users in the field of structural biology and bioinformatics.

    – Robbie P. Joosten, PDB-REDO