Commit 891b1c50 authored by Timm Schoening's avatar Timm Schoening
Browse files

Update README.md

parent 0d07b305
> **Preamble:** We strive to make marine image data [FAIR](https://www.go-fair.org/fair-principles/). We maintain [data profiles](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/metadata-profiles-fdos) to establish a common language of marine imagery, we develop - here in this repository - best-practice [operating procedures](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures) for handling marine images and we develop [software tools](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/software/mar-iqt) to apply the vocabulary and procedures to marine imagery.
# Introduction
## Preface
Publishing marine image data in a FAIR and open way requires data curation for both image data and image metadata. These quality control steps need to follow common standard operating procedures (SOPs) to facilitate joint data interpretation. This repository mainly collects SOPs for the QA/QC steps between acquisition and publication. It also contains some example SOPs on image acquisition for the interested user. It does not yet provide operational steps for the publication phase (e.g. physical file transfer to Pangaea).
# Fundamentals
## Data Structure:
How to structure data on disk should not be enforced by any SOP. Anyhow, we recommend the following structure to aid the automation of data curation and publication workflows. It is based on several best-practices (e.g. https://github.com/drivendata/cookiecutter-data-science).
```
/<volume>/<project>/
├── <event_1>
│ ├── <sensor_x>
│ │ ├── external/ (Optional) External data that affects the creation of raw data(e.g. calibration curves)
│ │ └── raw/ The raw data as recorded by the sensor(e.g. acoustic soundings)
│ │ └── intermediate/ (Optional) Intermediate data that will not be archived. Playground or sandbox for working with the raw data
│ │ └── processed/ Processed data that has been QA/QC'd and is ready for publication (e.g. map grids)
│ │ └── products/ (Optional) Data products created from the raw or processed data for visualization or as combinations of data of several events (e.g. geological maps)
│ │ └── protocol/ │Documentation on how the data was created, curated, processed, visualized, etc.
│ └── <sensor_y> Same as above for the next sensor deployed during this event
│ └── protocol/ General information on this event (e.g. ROV deployment plan)
└── <event_2> The same as above for the next event
├── <sensor_x>
└── <sensor_z>
```
On German research vessels, the "scientists folder" on the network or the new "Mass-Data-Module" (MDM, installed in 2021) will mostly act as the root folder `/<volume>/<project>` but for some researchers, who bring their own mass storage or NAS devices, it may be some path on their own hardware. Some disciplines/groups like to split their data by sensor first. This is not recommended but certainly possible. In that case, the paths would look like this:
```
/<volume>/<project_i>/
├── <sensor_x>
│ ├── <event_1>
│ └── <event_2>
└── <sensor_y>
├── <event_1>
└── <event_3>
```
# Standard operating procedures (SOPs)
This section is your starting point for exploring the MareHub AG V/I SOP documents. This readme provides some background and context information on how visualizations of the SOPs are structured and how the data workflow is explained. It is like an SOP for SOPs. You can find the existing marine imaging SOPs [here](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/tree/master/SOPs).
## Provenance documentation:
Provenance documentation of (automated) SOP steps is required to enable reusability of data and validity checks. Provenance information needs to document the agent, entities and activities and should facilitate reproducibility but mainly document execution steps rather than enable the fully-automated re-execution which would further require the automated setup of the software environment (e.g. through Docker). Provenance of individual SOP steps should be recorded in a machine-readable fashion (i.e. a **yaml** file) like so:
```
provenance:
- action:
executable:
name: <executable name>
version: <version string of executable>
parameter:
- name: <param-x_name>
value: <param-x_value>
[hash: <md5 hash of file at <param-x_value> (optional, only for files)>]
- name: <param-y_name>
value: <param-y_value>
hash: null
time: <time of execution: in utc, human-readable, with milliseconds (%Y%m%d %H:%M:%S.%f%z)>
- action:
executable:
...
parameter:
...
hash: <sha256 hash of previous provenance file>
time: ...
```
In case an additional processing step applied to a entity, the additional provenance information shall be appended to the provenance file of the entities' creation. Together with the SHA256 hash of the previous provenance file, a blockchain-like behaviour is enabled.
### General workflows
![MareHub AG Videos/Images SOP documentation: general workflow](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/raw/master/SOPs/graphics/AGVI_SOP-documentation-1-workflow.jpeg "MareHub AG Videos/Images SOP documentation: general workflow")
SOPs generally describe how processes create or modify entities which are managed in infrastructure.
# Standard operating procedures (SOPs)
Currently (March 2021), the few available SOPs are just bullet-point lists but detailed versions and jupyter notebooks to execute the curation steps are in preparation.
### A project's data workflow
![MareHub AG Videos/Images SOP documentation: project workflow](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/raw/master/SOPs/graphics/AGVI_SOP-documentation-2-project.jpeg "MareHub AG Videos/Images SOP documentation: project workflow")
In terms of research data, this is expressed by a data creation process that produces a data set entity which is managed in a data repository.
## QA/QC steps between acquisition and publication
- [Creating navigation data per image item](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/blob/master/SOPs/MareHub_AGVI_SOP_navigation-data.md)
- [Overview of image curation steps](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/blob/master/SOPs/MareHub_AGVI_SOP_iFDO.md)
- Creating iFDOs (image FAIR Digital Objects) for a deployment (in prep.)
- Creating pFDOs (proxy FAIR Digital Objects) for a deployment (in prep.)
- Creating sFDOs (semantic FAIR Digital Objects) for a deployment (in prep.)
### Actors in a workflow
![MareHub AG Videos/Images SOP documentation: project workflow with actors](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/raw/master/SOPs/graphics/AGVI_SOP-documentation-3-project-actors.jpeg "MareHub AG Videos/Images SOP documentation: project workflow with actors")
Processes are conducted by an actor - in this case researchers - and similarly infrastructure is operated by actors - in this case a research data management (RDM) team. Infrastructure is further characterized by whether it is publicly accessible and whether it is machine-accessible. Accessibility by humans is always expected.
## Image acquisition
- At sea using ROVs (in prep.)
- At sea using AUVs (in prep.)
- At sea using OFOSs (in prep.)
### Documentation entities
![MareHub AG Videos/Images SOP documentation: project workflow with documentation](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/raw/master/SOPs/graphics/AGVI_SOP-documentation-4-project-documentation.jpeg "MareHub AG Videos/Images SOP documentation: project workflow with documentation")
Processes and entities can be accompanied by one or many documentation entities. These can take various forms, depending on use cases, SOPs, software tools used, etc. These documentation entities may be seen by some as just another data entity (_"one wo:mans data is another wo:mans metadata"_), we like to keep it separate. The format of the documentation entities (file format, information content, etc.) is defined by one or several actors. In case the file format is machine-readable, the documentation entity will be marked as such. Like data entities, documentation entities cannot be the end of a workflow. They needs to be further processed or be placed into an infrastructure. This infrastructure might also be access-restricted.
## Other
- Publishing image data in Pangaea (in prep.)
- [Image curation workflow at GEOMAR](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/blob/master/SOPs/GEOMAR_image_curation_SOP_2021.png)
### Status information
![MareHub AG Videos/Images SOP documentation: project workflow with status information](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/raw/master/SOPs/graphics/AGVI_SOP-documentation-5-project-status.jpeg "MareHub AG Videos/Images SOP documentation: project workflow with status information")
Of course, all SOPs should represent the best-case scenario and of course all their components should be in place and work as described. But as the marine imaging community is still developing their best-practices, some of the elements of SOPs are still under development. In that case, this is color-coded in the workflow figures: either there is no solution for the entire concept (of a process, entity, infrastructure or documentation) or there is one but it is not commonly supported, operated, executed or maintained.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment