README.md 5.17 KB
Newer Older
Timm Schoening's avatar
Timm Schoening committed
1
2
# Introduction
Publishing marine image data in a FAIR and open way requires data curation for both image data and image metadata. These quality control steps need to follow common standard operating procedures (SOPs) to facilitate joint data interpretation. This repository mainly collects SOPs for the QA/QC steps between acquisition and publication. It also contains some example SOPs on image acquisition for the interested user. It does not yet provide operational steps for the publication phase (e.g. physical file transfer to Pangaea).
Timm Schoening's avatar
Timm Schoening committed
3

Timm Schoening's avatar
Timm Schoening committed
4
5
6
7
8
9
10
11
12
13
14
15
16
# Fundamentals

## Data Structure:
How to structure data on disk should not be enforced by any SOP. Anyhow, we recommend the following structure to aid the automation of data curation and publication workflows. It is based on several best-practices (e.g. https://github.com/drivendata/cookiecutter-data-science).

```
/<volume>/<project>/
├── <event_1>
│   ├── <sensor_x>
│   │   ├── external/ (Optional) External data that affects the creation of raw data(e.g. calibration curves)
│   │   └── raw/ The raw data as recorded by the sensor(e.g. acoustic soundings)
│   │   └── intermediate/ (Optional) Intermediate data that will not be archived. Playground or sandbox for working with the raw data
│   │   └── processed/ Processed data that has been QA/QC'd and is ready for publication (e.g. map grids)
Timm Schoening's avatar
Timm Schoening committed
17
│   │   └── products/ (Optional) Data products created from the raw or processed data for visualization or as combinations of data of several events (e.g. geological maps)
Timm Schoening's avatar
Timm Schoening committed
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
│   │   └── protocol/ │Documentation on how the data was created, curated, processed, visualized, etc.
│   └── <sensor_y> Same as above for the next sensor deployed during this event
│   └── protocol/ General information on this event (e.g. ROV deployment plan)
└── <event_2> The same as above for the next event
    ├── <sensor_x>
    └── <sensor_z>
```
On German research vessels, the "scientists folder" on the network or the new "Mass-Data-Module" (MDM, installed in 2021) will mostly act as the root folder `/<volume>/<project>` but for some researchers, who bring their own mass storage or NAS devices, it may be some path on their own hardware. Some disciplines/groups like to split their data by sensor first. This is not recommended but certainly possible. In that case, the paths would look like this:
```
/<volume>/<project_i>/
├── <sensor_x>
│   ├── <event_1>
│   └── <event_2>
└── <sensor_y>
    ├── <event_1>
    └── <event_3>
```

## Provenance documentation:
Timm Schoening's avatar
Timm Schoening committed
37
Provenance documentation of (automated) SOP steps is required to enable reusability of data and validity checks. Provenance information needs to document the agent, entities and activities and should facilitate reproducibility but mainly document execution steps rather than enable the fully-automated re-execution which would further require the automated setup of the software environment (e.g. through Docker). Provenance of individual SOP steps should be recorded in a machine-readable fashion (i.e. a **yaml** file) like so:
Timm Schoening's avatar
Timm Schoening committed
38
```
Timm Schoening's avatar
Timm Schoening committed
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
provenance:
    - action:
        executable:
            name: <executable name>
            version: <version string of executable>
        parameter:
            - name: <param-x_name>
              value: <param-x_value>
              [hash: <md5 hash of file at <param-x_value> (optional, only for files)>]
            - name: <param-y_name>
              value: <param-y_value>
      hash: null
      time: <time of execution: in utc, human-readable, with milliseconds (%Y%m%d %H:%M:%S.%f%z)>
    - action:
        executable:
            ...
        parameter:
            ...
      hash: <sha256 hash of previous provenance file>
      time: ...
Timm Schoening's avatar
Timm Schoening committed
59
```
Timm Schoening's avatar
Timm Schoening committed
60
In case an additional processing step applied to a entity, the additional provenance information shall be appended to the provenance file of the entities' creation. Together with the SHA256 hash of the previous provenance file, a blockchain-like behaviour is enabled.
Timm Schoening's avatar
Timm Schoening committed
61
62

# Standard operating procedures (SOPs)
Timm Schoening's avatar
Timm Schoening committed
63
Currently (March 2021), the few available SOPs are just bullet-point lists but detailed versions and jupyter notebooks to execute the curation steps are in preparation.
Timm Schoening's avatar
Timm Schoening committed
64
65

## QA/QC steps between acquisition and publication
Timm Schoening's avatar
Timm Schoening committed
66
- [Creating navigation data per image item](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/blob/master/SOPs/MareHub_AGVI_SOP_navigation-data.md)
Timm Schoening's avatar
Timm Schoening committed
67
68
- [Overview of image curation steps](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/blob/master/SOPs/MareHub_AGVI_SOP_iFDO.md)
- Creating iFDOs (image FAIR Digital Objects) for a deployment (in prep.)
Timm Schoening's avatar
Timm Schoening committed
69
70
- Creating pFDOs (proxy FAIR Digital Objects) for a deployment (in prep.)
- Creating sFDOs (semantic FAIR Digital Objects) for a deployment (in prep.)
Timm Schoening's avatar
Timm Schoening committed
71
72
73
74
75
76
77
78

## Image acquisition
- At sea using ROVs (in prep.)
- At sea using AUVs (in prep.)
- At sea using OFOSs (in prep.)

## Other
- Publishing image data in Pangaea (in prep.)
Timm Schoening's avatar
Timm Schoening committed
79
- [Image curation workflow at GEOMAR](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/blob/master/SOPs/GEOMAR_image_curation_SOP_2021.png)