Skip to content

sul-dlss/technical-metadata-service

Repository files navigation

CircleCI Maintainability Test Coverage Docker image OpenAPI Validator

Technical Metadata Service

This API provides methods for creating technical metadata for files in the DOR. It persists the technical metadata and allows it to be queried.

The metadata creation process runs Siegfried to determine which kind of file this is and then runs appropriate tools depending on the file type (e.g. exiftool, poppler, etc.)

Before this service is invoked, the files must be on the /dor/workspace NFS mounts. Then this technical metadata service is invoked by the accessionWF technical-metadata robot by making a REST request. In the near term, the technical metadata service will directly update the workflow system after it has completed generating the technical metadata. Once this happens, the accessionWF can proceed and remove the files from the workspace. In the longer term, we would like to do this update via a messaging service so that it does not require the robots or need to be tightly coupled to the workflow service.

This will only store technical metadata for files in the current version; technical metadata for files that were in earlier versions and are not in the current version will be deleted.

Rake

In addition to the web service, the technical metadata can also be generated by using a pair of rake tasks. To generate technical metadata for an item run this:

$ bundler exec rake techmd:generate['druid:bc123df4567','spec/fixtures/test/0001.html spec/fixtures/test/bar.txt spec/fixtures/test/brief.pdf spec/fixtures/test/foo.jpg spec/fixtures/test/max.webm spec/fixtures/test/noam.ogg', 'true']
Success

This happens synchronously and will not update the workflow service.

To generate for an item from a Moab (from preservation storage):

$ bundler exec rake techmd:generate_for_moab['druid:bc123df4567', 'true']
Queued

Or from a list of druids (druid.txt):

$ bundler exec rake techmd:generate_for_moab_list
Queued druid:bc123df4567

Background processing

Background processing is performed by Sidekiq.

Sidekiq can be monitored from /queues. For more information on configuring and deploying Sidekiq, see this doc.

Monitoring / statistics

Basic monitoring and statistics are available from /.

Reports

The service includes a Rake task that outputs CSV for files belonging to druids (as specified in an argument to the rake task) if and only if the file has a duration value in its audiovisual metadata. It outputs the druid, the filename, the MIME type, and the duration (in seconds):

$ RAILS_ENV=production bin/rake techmd:reports:media_durations[/tmp/druids.txt]
druid:bk586kk6146,cb147tv8205_pm.wav,audio/x-wav,1683.739
druid:bk586kk6146,cb147tv8205_sh.wav,audio/x-wav,1646.118
druid:bk586kk6146,cb147tv8205_sl.m4a,application/mp4,1646.179
druid:cm856pm4228,gt507vy5436_sl.mp4,application/mp4,3816.201
druid:ck227dm7693,bb761mb4522_FV4298_eng_sl.mp4,application/mp4,621.0
druid:ck227dm7693,bb761mb4522_FV4298_ger_sl.mp4,application/mp4,621.0
druid:ck227dm7693,bb761mb4522_FV4298_v1_sl.mp4,application/mp4,620.72
druid:ck227dm7693,bb761mb4522_FV4298_v2_sl.mov,video/quicktime,620.96
druid:ck227dm7693,bb761mb4522_FV4298_v3_sl.mp4,application/mp4,621.014
druid:ck227dm7693,bb761mb4522_FV4298_v4_sl.mp4,application/mp4,620.96
druid:nr582tm3161,Redivis_GMT20220303-205959_Recording_1920x1186.mp4,application/mp4,3322.912
druid:nr582tm3161,Redivis_GMT20220303-205959_Recording.mp4,application/mp4,3322.912
druid:pf759xf5671,qf378nj5000_sh.mpeg,video/mpeg,2261.04
druid:pf759xf5671,qf378nj5000_sl.mp4,application/mp4,2294.956
druid:rz125dy0428,bw689yg2740_sl.mp4,application/mp4,5080.485

where /tmp/druids.txt looks like:

druid:bk586kk6146
druid:cm856pm4228
foobar
druid:ck227dm7693
druid:nr582tm3161
druid:pf759xf5671
druid:rz125dy0428
druid:bf342vg1682

Requirements

Siegfried

Siegfried (version 1.8.0+) is used for file identification.

To install on OS X:

brew install richardlehane/digipres/siegfried

Note that if you are using an earlier version, you may encounter problems as the output format has changed.

Exiftool

Exiftool is used for image characterization.

To install on OS X:

brew install exiftool

Poppler

Poppler is used for PDF characterization.

To install on OS X:

brew install poppler

MediaInfo

MediaInfo is used for A/V characterization.

To install on OS X:

brew install mediainfo

Testing

CI build

Spin up the database using docker-compose:

$ docker compose up db # use -d to run in background

Run the linters and the test suite:

$ bin/rake

Integration

Spin up all the docker-compose services for dev/testing:

$ docker compose up # use -d to run in background

Then create the accession workflow for the test object:

$ rails c
> client = Dor::Workflow::Client.new(url: 'http://localhost:3001')
> client.create_workflow_by_name('druid:bc123df4567', 'accessionWF', version: '1')

Get a JWT token for authentication

bundle exec rake generate_token

Hit the technical-metadata-service's HTTP API:

$ curl -i H "Authorization: Bearer #{TOKEN}" -H 'Content-Type: application/json' --data '{"druid":"druid:bc123df4567","files":["file:///app/README.md","file:///app/openapi.yml"]}' http://localhost:3000/v1/technical-metadata

Verify that technical metadata was created:

$ docker compose exec app rails c
> DroFile.pluck(:druid, :filename, :mimetype, :filetype)
# should look like: [["druid:bc123df4567", "openapi.yml", "text/plain", "x-fmt/111"], ["druid:bc123df4567", "README.md", "text/markdown", "fmt/1149"]]

And that the object's workflow was updated:

$ rails c
> client = Dor::Workflow::Client.new(url: 'http://localhost:3001')
> client.workflow_status({druid: 'druid:bc123df4567', workflow: 'accessionWF', process: 'technical-metadata'})
# should be "completed"

Run locally

First install foreman (foreman is not supposed to be in the Gemfile, See this wiki article ):

gem install foreman

Then you can run

bin/dev

This starts css/js bundling and the development server

Docker

Note that this project's continuous integration build will automatically create and publish an updated image whenever there is a passing build from the main branch. If you do need to manually create and publish an image, do the following:

Build image:

docker build -t suldlss/technical-metadata-service:latest -f docker/app/Dockerfile .

Publish:

docker push suldlss/technical-metadata-service:latest

Generating techmd from preservation storage

For details, see https://github.com/sul-dlss/technical-metadata-service/wiki/Generating-techmd-from-preservation-storage

Reset Process (for QA/Stage)

Steps

  1. Reset the database: bin/rails -e p db:reset