Update on this. Thanks @a.pirogov I looked through the changes, looks very good so far.
One point: The prints in the test output are usually not shown by pytest if a test does not fail, if it does however most of the prints are usually helpful/need to understand what was going on. So I am not sure if removing all the prints in the tests is a good idea.
To the remaining problems.
sitemap harvester, we have talked about that issue and on the pipeline branch which will be merged into dev this should not be a problem anymore. It is a bug in the sitemap harvester if harvesting for a specific source, now with the generalized harvester interface.
I like the metadata modeling through pydantic classes. The threefold split is also ok. Through it takes a while to find it in 3 different places. For now this is fine. In the future there might be harvester specific metadata models, which I think is at this point not needed, but would be the better solution if they have richer metadata.
@j.broeder I re-enabled the test so you can take a look at it!
Anton Pirogov (cb799ae7) at 28 Mar 15:09
re-enabled broken test for sitemap harvester for debugging
Fiona D'Mello (3c8917bb) at 28 Mar 12:04
cleaned up and added new tests for search api development
Anton Pirogov (884b3b69) at 28 Mar 11:54
allow swapping out rdflib graph store, fix 'well-known' URLs (they ...
There are three possible bottlenecks
I will try to untangle these things.
I had hopes for oxrdflib as a drop-in improvement for general rdflib Graph
operations, but after some simple benchmarks it seems to have no measurable speed gain for loading from serialized files, and also appears to be still somewhat buggy. Furthermore, they say that SPARQL updates (unlike queries) are done via native rdflib anyway for some technical reasons.
Nevertheless I will keep the refactorings that allow easier swapping out the rdflib store backend, just in case there's a better option. I'll see if I can also try evaluating the Virtuoso rdflib backend.
This will be postponed, since we decided to try a Kubernetes deployment, bzw change the deployment, then this is partially solved already.
In order to migrate the Helmholtz KG to the JSC cloud there is downtime to be expected. We need to set up a banner in a FrontEnd to announce downtime to users.
Decided the dates: 09-10.04.2024 and Gabriel and Fiona set up a banner soon.
First smaller tests and movement of smaller projects first.
It was communicated, that this requires some downtime 20min to 4 hours.
For this there should be put up a notice on the front end a week before moving.
organise a "brainstorming-meeting" in order to discuss potential changes to the FrontEnd
will be updated by @f.dmello & potentially split up in sub-issus that can be closed in sprints. (discussion on Mar 28th)
With #50 done we will have a running pipeline for API and frontend.
For introducing the pipeline in the bginning I disabled all pylint issues we had with the code (see pyproject.toml
for disabled problems).
To have a clean code each warning need to be removed from the list disable
and need to be fixed except wrong-import-order
.
Issues we have (check if fixed):
this is not relevant any more due to the transition to next-JS
Problem:
Currently to know when a harvester last ran, we store a file with a timestamp. The file currently has a fixed name. If now the same harvester type runs in parallel but harvests others sources, they read and write to the same last run file.
Proposed Solution: add the functionality to save and load last run information from a given string. that way one can write and load a last run file for every source.
issue is closed manually after discussion
https://github.com/Materials-Data-Science-and-Informatics/somesy
Instead of using cffconvert/codemetapy maybe this package should use somesy (and not duplicate metadata from poetry to setuptools).
somesy is setup, but disabled, because of the overwrite of the codemeta.json with a lesser version of it. the current codemeta.json was generated by codemetapy and then completed by hand.
once the additional information can be given to somesy and piped into codemeta.json, it will be enabled. through I am not sure that the dublicatioon can be avoided, since I do not count that every tool reads out metadata from a tool specific sections.
I hate poetry for that they do not use the python standard but created there own.
Closes #65