README.md 4.63 KB
Newer Older
Timm Schoening's avatar
Timm Schoening committed
1
2
# Introduction
Publishing marine image data in a FAIR and open way requires data curation for both image data and image metadata. These quality control steps need to follow common standard operating procedures (SOPs) to facilitate joint data interpretation. This repository mainly collects SOPs for the QA/QC steps between acquisition and publication. It also contains some example SOPs on image acquisition for the interested user. It does not yet provide operational steps for the publication phase (e.g. physical file transfer to Pangaea).
Timm Schoening's avatar
Timm Schoening committed
3

Timm Schoening's avatar
Timm Schoening committed
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Fundamentals

## Data Structure:
How to structure data on disk should not be enforced by any SOP. Anyhow, we recommend the following structure to aid the automation of data curation and publication workflows. It is based on several best-practices (e.g. https://github.com/drivendata/cookiecutter-data-science).

```
/<volume>/<project>/
├── <event_1>
│   ├── <sensor_x>
│   │   ├── external/ (Optional) External data that affects the creation of raw data(e.g. calibration curves)
│   │   └── raw/ The raw data as recorded by the sensor(e.g. acoustic soundings)
│   │   └── intermediate/ (Optional) Intermediate data that will not be archived. Playground or sandbox for working with the raw data
│   │   └── processed/ Processed data that has been QA/QC'd and is ready for publication (e.g. map grids)
│   │   └── data_products/ (Optional) Dataproducts created from the processed data for visualization or as combinations of data of several events (e.g. geological maps)
│   │   └── protocol/ │Documentation on how the data was created, curated, processed, visualized, etc.
│   └── <sensor_y> Same as above for the next sensor deployed during this event
│   └── protocol/ General information on this event (e.g. ROV deployment plan)
└── <event_2> The same as above for the next event
    ├── <sensor_x>
    └── <sensor_z>
```
On German research vessels, the "scientists folder" on the network or the new "Mass-Data-Module" (MDM, installed in 2021) will mostly act as the root folder `/<volume>/<project>` but for some researchers, who bring their own mass storage or NAS devices, it may be some path on their own hardware. Some disciplines/groups like to split their data by sensor first. This is not recommended but certainly possible. In that case, the paths would look like this:
```
/<volume>/<project_i>/
├── <sensor_x>
│   ├── <event_1>
│   └── <event_2>
└── <sensor_y>
    ├── <event_1>
    └── <event_3>
```

## Provenance documentation:
Provenance documentation of (automated) SOP steps is required to enable reusability of data and validity checks. Provenance information needs to document the entities, agents and activities and should facilitate reproducibility but mainly document execution steps rather than enable the fully automated re-execution which would require automated setup of the software environment (through Docker etc.). Provenance of individual SOP steps should be recorded in a machine-readable fashion (i.e. a **yaml** or json file) like so:
```
executable:
Timm Schoening's avatar
Timm Schoening committed
40
    path: <executable name>
Timm Schoening's avatar
Timm Schoening committed
41
42
43
44
45
46
47
48
49
50
51
52
    hash: <md5 hash of executable binary>
    time: <utc time of execution, milliseconds since epoch>
    version: <version string of executable>
parameter:
    - name: <param-x_name>
      value: <param-x_value>
      [hash: md5 hash of file at <param-x_value> (optional, only for files)]
    - name: <param-y_name>
      value: <param-y_value>
```

# Standard operating procedures (SOPs)
Timm Schoening's avatar
Timm Schoening committed
53
Currently (March 2021), the few available SOPs are just bullet-point lists but detailed versions and jupyter notebooks to execute the curation steps are in preparation.
Timm Schoening's avatar
Timm Schoening committed
54
55

## QA/QC steps between acquisition and publication
Timm Schoening's avatar
Timm Schoening committed
56
57
58
59
- [Creating navigation data per image item](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/blob/master/SOPs/MareHub_AGVI_SOP_navigation-data.md)
- [Image Curation Steps](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/blob/master/SOPs/MareHub_AGVI_SOP_iFDO.md) Creating iFDOs (image FAIR Digital Objects) for a deployment (in prep.)
- Creating pFDOs (proxy FAIR Digital Objects) for a deployment (in prep.)
- Creating sFDOs (semantic FAIR Digital Objects) for a deployment (in prep.)
Timm Schoening's avatar
Timm Schoening committed
60
61
62
63
64
65
66
67

## Image acquisition
- At sea using ROVs (in prep.)
- At sea using AUVs (in prep.)
- At sea using OFOSs (in prep.)

## Other
- Publishing image data in Pangaea (in prep.)
Timm Schoening's avatar
Timm Schoening committed
68
- [Image curation workflow at GEOMAR](https://gitlab.hzdr.de/datahub/marehub/ag-videosimages/standard-operating-procedures/-/blob/master/SOPs/GEOMAR_image_curation_SOP_2021.png)