Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement annif upload and annif download commands for Hugging Face Hub integration #762

Merged
merged 39 commits into from
Apr 23, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
f6d2b7d
Initial functionality for HF Hub upload
juhoinkinen Feb 1, 2024
ab5e4bf
Use tempfile module and file-like objects for uploads
juhoinkinen Feb 5, 2024
d3dd888
Separate files for each project, vocab and config
juhoinkinen Feb 6, 2024
9d030c6
Catch also HFValidationError in HFH uploads
juhoinkinen Feb 6, 2024
3135114
Initial functionality for HF Hub download
juhoinkinen Feb 7, 2024
038d86d
Upgrade to huggingface-hub 0.21.*
juhoinkinen Feb 29, 2024
5afb251
Drop -projects part from upload/download CLI commands
juhoinkinen Feb 29, 2024
13191fc
Speed up CLI startup by moving imports in functions
juhoinkinen Feb 29, 2024
7666de8
Add --force option to allow overwrite local contents on download
juhoinkinen Mar 1, 2024
301d787
Resolve CodeQL complaint about imports
juhoinkinen Mar 1, 2024
d5b4abe
Restore datafile timestamps after unzipping
juhoinkinen Mar 4, 2024
a1e7605
Add comment to zip file with used Annif version
juhoinkinen Mar 4, 2024
25a46dc
Catch HFH Errors in listing files in repo
juhoinkinen Mar 4, 2024
86714d8
Unzip archive contents to used DATADIR
juhoinkinen Mar 6, 2024
6ba1e08
Add tests
juhoinkinen Mar 7, 2024
4d06be6
Create /.cache/huggingface/ with full access rights in Dockerimage
juhoinkinen Mar 7, 2024
a4f0f6f
Merge branch 'update-dependencies-v1.1' into issue760-hugging-face-hu…
juhoinkinen Mar 8, 2024
7575fff
Fix and improve tests and increase coverage
juhoinkinen Mar 8, 2024
16bacfb
Remove todos
juhoinkinen Mar 8, 2024
2952f64
Create /Annif/projects.d/ for tests in Dockerfile
juhoinkinen Mar 8, 2024
ed3cf2c
Refactor to address quality complains; improve names
juhoinkinen Mar 8, 2024
5b16952
Add docstrings
juhoinkinen Mar 12, 2024
c87675c
Add type hints
juhoinkinen Mar 12, 2024
2fe5b73
Update RTD CLI commands page
juhoinkinen Mar 12, 2024
d7be137
Remove --revision option of download command
juhoinkinen Mar 13, 2024
47f7ee4
Upgrade to huggingface-hub 0.22.*
juhoinkinen Mar 25, 2024
a488d07
Revert "Remove --revision option of download command"
juhoinkinen Mar 26, 2024
0c57bf2
Preupload lfs files
juhoinkinen Mar 26, 2024
df105a3
Fix HF Hub caching in Dockerfile
juhoinkinen Mar 27, 2024
d14ff30
Refactor to address quality complains
juhoinkinen Apr 12, 2024
cc0c989
Again: Refactor & simplify to address quality complains
juhoinkinen Apr 12, 2024
9443c8f
Fix typo in mocked filenames in repo
juhoinkinen Apr 19, 2024
156bbf5
Detect projects present in repo by .cfg files, not .zip files
juhoinkinen Apr 19, 2024
3f60456
Add --revision option to upload command
juhoinkinen Apr 19, 2024
2dd359d
Enable completion of project_id argument in upload command
juhoinkinen Apr 19, 2024
63076cd
Adapt test for adding revision option to upload command
juhoinkinen Apr 19, 2024
a0a3850
Move functions for HuggingFaceHub interactions to own file
juhoinkinen Apr 23, 2024
638aa07
Move unit tests for HuggingFaceHub util fns to own file
juhoinkinen Apr 23, 2024
6f35fff
Make io import conditional to TYPE_CHECKING
juhoinkinen Apr 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 108 additions & 1 deletion annif/cli.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
"""Definitions for command-line (Click) commands for invoking Annif
operations and printing the results to console."""


import collections
import importlib
import json
import os.path
import re
import shutil
import sys
from fnmatch import fnmatch

import click
import click_log
Expand Down Expand Up @@ -583,6 +584,112 @@
click.echo("---")


@cli.command("upload-projects")
@click.argument("project_ids_pattern")
@click.argument("repo_id")
@click.option(
"--token",
help="""Authentication token, obtained from the Hugging Face Hub.
Will default to the stored token.""",
)
@click.option(
"--commit-message",
help="""The summary / title / first line of the generated commit.""",
)
@cli_util.common_options
def run_upload_projects(project_ids_pattern, repo_id, token, commit_message):
"""
Upload selected projects to a Hugging Face Hub repository
\f
This command zips the project directories and vocabularies of the projects
that match the given `project_ids_pattern`, and uploads the archives along
with the projects configuration to the specified Hugging Face Hub repository.
An authentication token and commit message can be given with options.
"""
projects = [

Check warning on line 609 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L609

Added line #L609 was not covered by tests
proj
for proj in annif.registry.get_projects(min_access=Access.private).values()
if fnmatch(proj.project_id, project_ids_pattern)
]
click.echo(f"Uploading project(s): {', '.join([p.project_id for p in projects])}")

Check warning on line 614 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L614

Added line #L614 was not covered by tests

commit_message = (

Check warning on line 616 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L616

Added line #L616 was not covered by tests
commit_message
if commit_message is not None
else f"Upload project(s) {project_ids_pattern} with Annif"
)

project_dirs = {p.datadir for p in projects}
vocab_dirs = {p.vocab.datadir for p in projects}
data_dirs = project_dirs.union(vocab_dirs)

Check warning on line 624 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L622-L624

Added lines #L622 - L624 were not covered by tests

for data_dir in data_dirs:
zip_path = data_dir.split(os.path.sep, 1)[1] + ".zip" # TODO Check this
fobj = cli_util.archive_dir(data_dir)
cli_util.upload_to_hf_hub(fobj, zip_path, repo_id, token, commit_message)
fobj.close()

Check warning on line 630 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L626-L630

Added lines #L626 - L630 were not covered by tests

for project in projects:
config_path = project.project_id + ".cfg"
fobj = cli_util.write_config(project)
cli_util.upload_to_hf_hub(fobj, config_path, repo_id, token, commit_message)
fobj.close()

Check warning on line 636 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L632-L636

Added lines #L632 - L636 were not covered by tests


@cli.command("download-projects")
@click.argument("project_ids_pattern")
@click.argument("repo_id")
@click.option(
"--token",
help="""Authentication token, obtained from the Hugging Face Hub.
Will default to the stored token.""",
)
@click.option(
"--revision",
help="""
An optional Git revision id which can be a branch name, a tag, or a commit
hash.
""",
)
@cli_util.common_options
def run_download_projects(project_ids_pattern, repo_id, token, revision):
"""
Download selected projects from a Hugging Face Hub repository
\f
This command downloads the project and vocabulary archives and the
configuration files of the projects that match the given
`project_ids_pattern` from the specified Hugging Face Hub repository and
unzips the archives to `data/` directory and places the configuration files
to `projects.d/` directory. An authentication token and revision can
be given with options.
"""

project_ids = cli_util.get_selected_project_ids_from_hf_hub(

Check warning on line 667 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L667

Added line #L667 was not covered by tests
project_ids_pattern, repo_id, token, revision
)
click.echo(f"Downding project(s): {', '.join(project_ids)}")

Check warning on line 670 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L670

Added line #L670 was not covered by tests
juhoinkinen marked this conversation as resolved.
Show resolved Hide resolved

if not os.path.isdir("projects.d"):
os.mkdir("projects.d")
vocab_ids = set()
for project_id in project_ids:
project_zip_local_cache_path = cli_util.download_from_hf_hub(

Check warning on line 676 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L672-L676

Added lines #L672 - L676 were not covered by tests
f"projects/{project_id}.zip", repo_id, token, revision
)
cli_util.unzip(project_zip_local_cache_path)
local_config_cache_path = cli_util.download_from_hf_hub(

Check warning on line 680 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L679-L680

Added lines #L679 - L680 were not covered by tests
f"{project_id}.cfg", repo_id, token, revision
)
vocab_ids.add(cli_util.get_vocab_id(local_config_cache_path))
shutil.copy(local_config_cache_path, "projects.d") # TODO Disallow overwrite

Check warning on line 684 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L683-L684

Added lines #L683 - L684 were not covered by tests

for vocab_id in vocab_ids:
vocab_zip_local_cache_path = cli_util.download_from_hf_hub(

Check warning on line 687 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L686-L687

Added lines #L686 - L687 were not covered by tests
f"vocabs/{vocab_id}.zip", repo_id, token, revision
)
cli_util.unzip(vocab_zip_local_cache_path)

Check warning on line 690 in annif/cli.py

View check run for this annotation

Codecov / codecov/patch

annif/cli.py#L690

Added line #L690 was not covered by tests


@cli.command("completion")
@click.option("--bash", "shell", flag_value="bash")
@click.option("--zsh", "shell", flag_value="zsh")
Expand Down
95 changes: 94 additions & 1 deletion annif/cli_util.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,27 @@
"""Utility functions for Annif CLI commands"""

from __future__ import annotations

import collections
import configparser
import io
Fixed Show fixed Hide fixed
Fixed Show fixed Hide fixed
Fixed Show fixed Hide fixed
import itertools
import os
import pathlib
import sys
import tempfile
import zipfile
from fnmatch import fnmatch
from typing import TYPE_CHECKING

import click
import click_log
from flask import current_app
from huggingface_hub import HfApi, hf_hub_download, list_repo_files
from huggingface_hub.utils import HfHubHTTPError, HFValidationError

import annif
from annif.exception import ConfigurationException
from annif.exception import ConfigurationException, OperationFailedException
from annif.project import Access

if TYPE_CHECKING:
Expand Down Expand Up @@ -230,6 +239,90 @@
return list(itertools.product(limits, thresholds))


def _is_train_file(fname):
train_file_patterns = ("-train", "tmp-")
for pat in train_file_patterns:
if pat in fname:
return True
return False

Check warning on line 247 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L243-L247

Added lines #L243 - L247 were not covered by tests


def archive_dir(data_dir):
fp = tempfile.TemporaryFile()
path = pathlib.Path(data_dir)
fpaths = [fpath for fpath in path.glob("**/*") if not _is_train_file(fpath.name)]
with zipfile.ZipFile(fp, mode="w") as zfile:
for fpath in fpaths:
logger.debug(f"Adding {fpath}")
zfile.write(fpath)
fp.seek(0)
return fp

Check warning on line 259 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L251-L259

Added lines #L251 - L259 were not covered by tests


def write_config(project):
fp = tempfile.TemporaryFile(mode="w+t")
config = configparser.ConfigParser()
config[project.project_id] = project.config
config.write(fp) # This needs tempfile in text mode
fp.seek(0)

Check warning on line 267 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L263-L267

Added lines #L263 - L267 were not covered by tests
# But for upload fobj needs to be in binary mode
return io.BytesIO(fp.read().encode("utf8"))

Check warning on line 269 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L269

Added line #L269 was not covered by tests


def upload_to_hf_hub(fileobj, filename, repo_id, token, commit_message):
api = HfApi()
try:
api.upload_file(

Check warning on line 275 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L273-L275

Added lines #L273 - L275 were not covered by tests
path_or_fileobj=fileobj,
path_in_repo=filename,
repo_id=repo_id,
token=token,
commit_message=commit_message,
)
except (HfHubHTTPError, HFValidationError) as err:
raise OperationFailedException(str(err))

Check warning on line 283 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L282-L283

Added lines #L282 - L283 were not covered by tests


def get_selected_project_ids_from_hf_hub(project_ids_pattern, repo_id, token, revision):
all_repo_file_paths = _list_files_in_hf_hub(repo_id, token, revision)
return [

Check warning on line 288 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L287-L288

Added lines #L287 - L288 were not covered by tests
path.rsplit(".zip")[0].split("projects/")[1] # TODO Try-catch this
for path in all_repo_file_paths
if fnmatch(path, f"projects/{project_ids_pattern}.zip")
]


def _list_files_in_hf_hub(repo_id, token, revision):
return [

Check warning on line 296 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L296

Added line #L296 was not covered by tests
repofile
for repofile in list_repo_files(repo_id=repo_id, token=token, revision=revision)
]


def download_from_hf_hub(filename, repo_id, token, revision):
try:
return hf_hub_download(

Check warning on line 304 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L303-L304

Added lines #L303 - L304 were not covered by tests
repo_id=repo_id,
filename=filename,
token=token,
revision=revision,
)
except (HfHubHTTPError, HFValidationError) as err:
raise OperationFailedException(str(err))

Check warning on line 311 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L310-L311

Added lines #L310 - L311 were not covered by tests


def unzip(source_path):
with zipfile.ZipFile(source_path, "r") as zfile:
zfile.extractall() # TODO Disallow overwrite

Check warning on line 316 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L315-L316

Added lines #L315 - L316 were not covered by tests


def get_vocab_id(config_path):
config = configparser.ConfigParser()
config.read(config_path)
section = config.sections()[0]
return config[section]["vocab"]

Check warning on line 323 in annif/cli_util.py

View check run for this annotation

Codecov / codecov/patch

annif/cli_util.py#L320-L323

Added lines #L320 - L323 were not covered by tests


def _get_completion_choices(
param: Argument,
) -> dict[str, AnnifVocabulary] | dict[str, AnnifProject] | list:
Expand Down
7 changes: 7 additions & 0 deletions docs/source/commands.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,13 @@ Project administration

N/A

.. click:: annif.cli:run_upload_projects
:prog: annif upload-projects

**REST equivalent**

N/A

****************************
Subject index administration
****************************
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ python-dateutil = "2.8.*"
tomli = { version = "2.0.*", python = "<3.11" }
simplemma = "0.9.*"
jsonschema = "4.17.*"
huggingface-hub = "0.20.*"

fasttext-wheel = {version = "0.9.2", optional = true}
voikko = {version = "0.5.*", optional = true}
Expand Down
Loading