Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: update readme #160

Merged
merged 1 commit into from
May 13, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 25 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,34 @@
# fdk-dataservice-harvester

fdk-dataservice-harvester will harvest catalogs of dataservices according to the upcoming [DCAT-AP-NO v.0 specification](https://informasjonsforvaltning.github.io/dcat-ap-no/).
The harvest process is triggered by messages from RabbitMQ with the routing key `dataservice.*.HarvestTrigger`, a message will call the method `initiateHarvest` in the class `HarvesterActivity`. The actual harvest will start when `activitySemaphore` has an available permit, when there are no available permits all messages will be queued by the semaphore.

The catalogs will then be stored and made available at a standardized endpoint.
The body of the trigger message has 3 relevant parameters:
- `dataSourceId` - Triggers the harvest of a specific source from fdk-harvest-admin
- `publisherId` - Triggers the harvest of all sources for the specified organization number.
- `forceUpdate` - Indicates that the harvest should be performed, even when no changes are detected in the source

A triggered harvest will download all relevant sources from fdk-harvest-admin, download everything from the source and try to read it as a RDF graph via a jena Model. If the source is successfully parsed as a jena Model it will be compared to the last harvest of the same source. The harvest process will continue if the source is not isomorphic to the last harvest or `forceUpdate` is true.

The actual harvest process will first find all catalogs, resources with the type `dcat:Catalog`, blank node catalogs will be ignored. And then find all data services each catalog contains, indicated by the predicate `dcat:service` and type `dcat:DataService`, blank node data services will be ignored.
When all catalogs and data services have been found a recursive function will create a graph with every contained triple for all catalogs and data services.

The process will save metadata for both data services and catalogs:
- `uri` - The IRI for the resource, is used as the database id
- `fdkId` - The UUID used for the resource used in the context of FDK, is a generated hash of the uri if nothing else is set.
- `isPartOf` - Only relevant for data services, is the uri of the catalog it belongs to.
- `removed` - Only relevant for data services, is set to true if the data service has been removed from the source.
- `issued` - The timestamp of the first time the resource was harvested
- `modified` - The timestamp of the last time a harvest of the resource found changes in the resource graph

All blank nodes will be [skolemized](https://www.w3.org/wiki/BnodeSkolemization) in the resource graphs.

When all sources from the trigger has been processed a new rabbit message will be published with the routing key `dataservices.harvested`, the message body will be a list of harvest reports, one report for each source from fdk-harvest-admin.

When the rabbit message has been published the semaphore permit is released and a new harvest trigger can be processed.

## Requirements
- maven
- java 8
- java 17
- docker
- docker-compose

Expand All @@ -15,10 +37,6 @@ Make sure you have an updated docker image with the tag "eu.gcr.io/digdir-fdk-in
```
mvn verify
```
Optionally, if you want to use an image with another tag:
```
mvn verify -DtestImageName="<image-tag>"
```

## Run locally
```
Expand All @@ -31,6 +49,3 @@ Then in another terminal e.g.
% curl http://localhost:8081/catalogs
% curl http://localhost:8081/dataservices
```

## Datastore
To inspect the Fuseki triple store, open your browser at http://localhost:3030/fuseki/
Loading