GitLab HPC Driver issueshttps://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues2023-03-02T11:12:56+01:00https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/17Breaking changes in GitLab 17.02023-03-02T11:12:56+01:00Huste, TobiasBreaking changes in GitLab 17.0GitLab Runner ~~14.0~~ 17.0 will introduce breaking changes that also require this implementation to be updated. In principle, the step called `build_script` so far will be renamed to `step_script`.
This was already unintentionally intr...GitLab Runner ~~14.0~~ 17.0 will introduce breaking changes that also require this implementation to be updated. In principle, the step called `build_script` so far will be renamed to `step_script`.
This was already unintentionally introduced in runner version `13.1.0` and fixed in `13.1.1` with this Merge Request: https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/2227
Update: Now this is apparently due in version 17.0. See output of gitlab-runner:
```
WARNING: Starting with version 17.0 the 'build_script' stage will be replaced with 'step_script': https://gitlab.com/groups/gitlab-org/-/epics/6112
```
[GitLab epic here.](https://gitlab.com/groups/gitlab-org/-/epics/6112)2021-04-21https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/18Hemera-tagged job doesn't start2021-06-18T11:45:56+02:00Evdokimov, Dr. Ilya (FWDC) - 141789Hemera-tagged job doesn't startWe have repeating problems with out jobs in Cases repository.
https://gitlab.hzdr.de/openfoam/fwdc/Cases/-/jobs/161939
```
16
Executing "step_script" stage of the job script
57:28
17WARNING: Starting with version 14.0 the 'build_script...We have repeating problems with out jobs in Cases repository.
https://gitlab.hzdr.de/openfoam/fwdc/Cases/-/jobs/161939
```
16
Executing "step_script" stage of the job script
57:28
17WARNING: Starting with version 14.0 the 'build_script' stage will be replaced with 'step_script': https://gitlab.com/gitlab-org/gitlab-runner/-/issues/26426
18Slurm job (2993342) state: PENDINGslurm_load_jobs error: Socket timed out on send/recv operation
19terminate called after throwing an instance of 'std::out_of_range'
20 what(): basic_string::substr: __pos (which is 9) > this->size() (which is 0)
22
Uploading artifacts for failed job
```
Last successful job before this was https://gitlab.hzdr.de/openfoam/fwdc/Cases/-/jobs/161538https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/20No git command found2021-03-01T10:50:04+01:00Evdokimov, Dr. Ilya (FWDC) - 141789No git command foundHey our regular scheduled job experienced failure on Friday
```
Fetching changes with git depth set to 1...
/tmp/custom-executor790368454/script770287592/script.: line 127: git: command not found
ERROR: Job failed: exit status 1
```
ht...Hey our regular scheduled job experienced failure on Friday
```
Fetching changes with git depth set to 1...
/tmp/custom-executor790368454/script770287592/script.: line 127: git: command not found
ERROR: Job failed: exit status 1
```
https://gitlab.hzdr.de/openfoam/fwdc/Cases/-/jobs/235101https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/19HPC-Runner is off2020-12-22T10:02:16+01:00Evdokimov, Dr. Ilya (FWDC) - 141789HPC-Runner is offHi,
our runner does not seem to be living anymore after recent Hemera failure. Is it possible to make jobs running again?
![image](/uploads/a1bad29c69cbabab7618b748f5aad74f/image.png)Hi,
our runner does not seem to be living anymore after recent Hemera failure. Is it possible to make jobs running again?
![image](/uploads/a1bad29c69cbabab7618b748f5aad74f/image.png)https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/1Security concept2020-12-14T13:26:35+01:00David Paped.pape@hzdr.deSecurity conceptThe program needs a security concept.
For now, all jobs are run in sibling directories without any encapsulation/containerization/... This means that a job, running concurrently with other jobs, can easily access their data. This is a p...The program needs a security concept.
For now, all jobs are run in sibling directories without any encapsulation/containerization/... This means that a job, running concurrently with other jobs, can easily access their data. This is a problem when dealing with jobs that contain sensitive information that must not be made public.
A possible solution might be a pool of cluster users that run incoming jobs in a round-robin manner. Their working directories can be secured using UNIX/LDAP permissions.https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/915 minutes in PENDING state2020-09-04T09:25:57+02:00Evdokimov, Dr. Ilya (FWDC) - 14178915 minutes in PENDING stateHey, I've started a test pipeline without snakemake and hemera can't start.
Is it a bug or there is no available runners?
Pipeline https://gitlab.hzdr.de/openfoam/Cases/-/jobs/95223Hey, I've started a test pipeline without snakemake and hemera can't start.
Is it a bug or there is no available runners?
Pipeline https://gitlab.hzdr.de/openfoam/Cases/-/jobs/95223https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/16Job failure: Error in file util/session_dir.c2020-06-25T15:18:08+02:00Evdokimov, Dr. Ilya (FWDC) - 141789Job failure: Error in file util/session_dir.cHey, got such error in one of three similar jobs
```
Error in rule hoso_tomi_h1_solve:
jobid: 50
output: /home/gitlabrun/builds/55-349-3-134250/openfoam/fwdc/Cases/baseline/2013_Hosokawa_and_Tomiyama/H1/log.solver
shell:
...Hey, got such error in one of three similar jobs
```
Error in rule hoso_tomi_h1_solve:
jobid: 50
output: /home/gitlabrun/builds/55-349-3-134250/openfoam/fwdc/Cases/baseline/2013_Hosokawa_and_Tomiyama/H1/log.solver
shell:
mpirun -oversubscribe -np 8 HZDRreactingMultiphaseEulerFoam -case /home/gitlabrun/builds/55-349-3-134250/openfoam/fwdc/Cases/baseline/2013_Hosokawa_and_Tomiyama/H1 -parallel > /home/gitlabrun/builds/55-349-3-134250/openfoam/fwdc/Cases/baseline/2013_Hosokawa_and_Tomiyama/H1/log.solver
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job hoso_tomi_h1_solve since they might be corrupted:
/home/gitlabrun/builds/55-349-3-134250/openfoam/fwdc/Cases/baseline/2013_Hosokawa_and_Tomiyama/H1/log.solver
Job failed, going on with independent jobs.
[csk088.cluster:137555] [[3914,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 106
[csk088.cluster:137555] [[3914,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 382
[csk088.cluster:137558] [[3919,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 106
[csk088.cluster:137558] [[3919,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 382
[csk088.cluster:137563] [[3906,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 106
[csk088.cluster:137563] [[3906,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 382
[csk088.cluster:137571] [[3962,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 106
[csk088.cluster:137575] [[3966,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 106
[csk088.cluster:137571] [[3962,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 382
[csk088.cluster:137575] [[3966,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 382
[Thu Jun 25 10:43:29 2020]
```
May be it signals about problems on certain node?
Full log https://gitlab.hzdr.de/openfoam/fwdc/Cases/-/jobs/134250/raw
Two others (RUNNING now)
https://gitlab.hzdr.de/openfoam/fwdc/Cases/-/jobs/134229
https://gitlab.hzdr.de/openfoam/fwdc/Cases/-/jobs/134231https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/15Snakemake update2020-05-13T17:08:39+02:00Evdokimov, Dr. Ilya (FWDC) - 141789Snakemake updateWe have error
```
TypeError in line 54 of /home/gitlabrun/builds/55-349-1-118277/openfoam/fwdc/Cases/analyze.rules:
42 report() got an unexpected keyword argument 'patterns'
```
in our Snakemake jobs. This points to the older Snakemake v...We have error
```
TypeError in line 54 of /home/gitlabrun/builds/55-349-1-118277/openfoam/fwdc/Cases/analyze.rules:
42 report() got an unexpected keyword argument 'patterns'
```
in our Snakemake jobs. This points to the older Snakemake versions. While the library was updated on the cluster, the question remains if you have some specific installation paths for runner account where an older version of snakemake could be living.
Please, check if it is possible snakemake version on runner account. It can be done via `snakemake --version`.https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/14Help needed related to bash loop2020-04-07T18:15:33+02:00Evdokimov, Dr. Ilya (FWDC) - 141789Help needed related to bash loopHi, I'm trying such construction in CI/CD script
```
for CASE in $NEW_CASES ; do export CASE_DIR=$(dirname $CASE) ; cd CASE_DIR && python3 -m snakemake --cores $SNAKEMAKE_CORES && python3 -m snakemake --report report.html ; done
```
...Hi, I'm trying such construction in CI/CD script
```
for CASE in $NEW_CASES ; do export CASE_DIR=$(dirname $CASE) ; cd CASE_DIR && python3 -m snakemake --cores $SNAKEMAKE_CORES && python3 -m snakemake --report report.html ; done
```
It leads to error:
```
333 /var/spool/slurmd/job2254278/slurm_script: line 115: cd: CASE_DIR: No such file or directory
```
Is there any problem with `dirname`? Couldn't be suggested another tool to extract directory name from filepath?https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/13Set time limit in Slurm job (automatically)2020-03-17T10:29:31+01:00David Paped.pape@hzdr.deSet time limit in Slurm job (automatically)This would allow for faster access to the cluster if a small time limit is set.
If the user doesn't specify a limit in their `.gitlab.yml` it can be inferred from their projects settings. Unfortunately, there is no variable set by GitLa...This would allow for faster access to the cluster if a small time limit is set.
If the user doesn't specify a limit in their `.gitlab.yml` it can be inferred from their projects settings. Unfortunately, there is no variable set by GitLab that exposes this information.https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/12NFS file handle2020-03-02T16:16:13+01:00Evdokimov, Dr. Ilya (FWDC) - 141789NFS file handleWe have some problems on our side (like large logs) but there are failures with a runner. It takes a longer time and also produces this error:
```
WARNING: File ignored: lstat baseline/MTLoop/fixedPolydisperse/086/fourVelocityGroups/0.0...We have some problems on our side (like large logs) but there are failures with a runner. It takes a longer time and also produces this error:
```
WARNING: File ignored: lstat baseline/MTLoop/fixedPolydisperse/086/fourVelocityGroups/0.00011999/alpha.water: stale NFS file handle
```https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/11LICENSE2020-02-28T15:07:39+01:00David Paped.pape@hzdr.deLICENSEThe project needs a license. The CodeCoverage.cmake script includes it's own license text. FindSphinx.cmake and other parts of the CMake setup should differ significantly enough from a blogpost referenced in earlier commits.The project needs a license. The CodeCoverage.cmake script includes it's own license text. FindSphinx.cmake and other parts of the CMake setup should differ significantly enough from a blogpost referenced in earlier commits.https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/10"GitLab HPC-Runner" is not an ideal name2020-02-19T15:44:00+01:00David Paped.pape@hzdr.de"GitLab HPC-Runner" is not an ideal nameActually this GitLab project should have been called "GitLab HPC **driver**".
- runner = the service / computer / compute node / infrastructure
- executor = the "scenario" a job is run in (in the case of this project "custom")
- driver ...Actually this GitLab project should have been called "GitLab HPC **driver**".
- runner = the service / computer / compute node / infrastructure
- executor = the "scenario" a job is run in (in the case of this project "custom")
- driver = the actual program (in use [here](https://docs.gitlab.com/runner/executors/custom.html#driver-examples) and in other places)
Should we rename the project, @frust45? I don't see any harm (yet) in doing so. I'm definitely going to use the driver terminology in my Praktikumsbericht because it's more concise and doesn't overload terms. I'll refer to the whole project as "GitLab-Runner" and to the program that is maintained in this repository as "HPC-Driver".https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/7No space left on device (suppressing repeats)2020-02-04T15:26:18+01:00Evdokimov, Dr. Ilya (FWDC) - 141789No space left on device (suppressing repeats)Finally, I have about 1500 snakemake jobs. About 300 of them are real CFD solutions... so I will need much more space I have now.
This error happened when I just launched a test run with a single iteration per case.Finally, I have about 1500 snakemake jobs. About 300 of them are real CFD solutions... so I will need much more space I have now.
This error happened when I just launched a test run with a single iteration per case.https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/8Job's log exceeded limit of 4194304 bytes2020-02-04T15:08:18+01:00Evdokimov, Dr. Ilya (FWDC) - 141789Job's log exceeded limit of 4194304 bytesIs it a default limitation? Can we make it greater?Is it a default limitation? Can we make it greater?https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/6How to access group quotas2020-01-28T10:24:08+01:00Huste, TobiasHow to access group quotasE.g. FWDC has additional quotas. How to make them available via CI/CD on hemera?E.g. FWDC has additional quotas. How to make them available via CI/CD on hemera?https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/5python snakemake2020-01-27T15:56:12+01:00Evdokimov, Dr. Ilya (FWDC) - 141789python snakemakeI'm launching snakemake in Gitlab CI/CD bash as Python module, so the original command looks like
```
python3 -m snakemake -s baseline.rules --report report_baseline.html
```
Only in reports I get this error and pipeline fails.
```
62...I'm launching snakemake in Gitlab CI/CD bash as Python module, so the original command looks like
```
python3 -m snakemake -s baseline.rules --report report_baseline.html
```
Only in reports I get this error and pipeline fails.
```
627 Traceback (most recent call last):
628 File "/home/gitlabrun/.local/lib/python3.6/site-packages/snakemake/__init__.py", line 633, in snakemake
629 keepincomplete=keep_incomplete,
630 File "/home/gitlabrun/.local/lib/python3.6/site-packages/snakemake/workflow.py", line 680, in execute
631 auto_report(dag, report)
632 File "/home/gitlabrun/.local/lib/python3.6/site-packages/snakemake/report/__init__.py", line 663, in auto_report
633 text = f.read() + rst_links
634 File "/trinity/shared/pkg/devel/python/3.6.5/lib/python3.6/encodings/ascii.py", line 26, in decode
635 return codecs.ascii_decode(input, self.errors)[0]
636 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 452: ordinal not in range(128)
```
Actually I don't get the exact reason and since my local setup is working fine I can't reproduce it locally and debug. Any suggestions how to fix it are welcome.https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/3Handle all possible states in SlurmJob.cpp2019-12-16T15:53:13+01:00David Paped.pape@hzdr.deHandle all possible states in SlurmJob.cppThe current implementation does not cover all possible states a Slurm job can be in. A full list can be found [in the official Slurm documentation](https://slurm.schedmd.com/squeue.html#lbAG). These should be divided into categories like...The current implementation does not cover all possible states a Slurm job can be in. A full list can be found [in the official Slurm documentation](https://slurm.schedmd.com/squeue.html#lbAG). These should be divided into categories like
- still running (pending, running, completing, ...) → wait
- successfully finished (completed, ...) → `return 0;`
- job ended "users fault" (failed, ...) → `return HPCJob::GetEnv()->GetExitBuildFailure;`
- job ended "systems fault" (node_fail, out_of_memory, ...) → `return HPCJob::GetEnv->GetExitSystemFailure;`https://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/2Move SLURM docker image to FWCC group2019-12-16T13:46:52+01:00Huste, TobiasMove SLURM docker image to FWCC groupThe slurm image used for testing should be moved to the FWCC group. @pape58 Do you have the right to do so? If not, I can do it.The slurm image used for testing should be moved to the FWCC group. @pape58 Do you have the right to do so? If not, I can do it.David Paped.pape@hzdr.deDavid Paped.pape@hzdr.dehttps://codebase.helmholtz.cloud/fwcc/gitlab-hpc-driver/-/issues/4Notify SlurmJob::OutputFile of changes in file2019-12-16T11:35:51+01:00David Paped.pape@hzdr.deNotify SlurmJob::OutputFile of changes in fileAt the moment SlurmJob::OutputFile outputs new lines every 100 milliseconds. Can this be improved by using inotify or similar methods?At the moment SlurmJob::OutputFile outputs new lines every 100 milliseconds. Can this be improved by using inotify or similar methods?