Skip to content

sg-wbi/belhd

Repository files navigation

BELHD: Improving Biomedical Entity Linking with Homonym Disambiguation

Code to reproduce experiments in:

@article{BelhdImprovinGarda2024,
  archiveprefix = {arXiv},
  author = {Garda, Samuele and Leser, Ulf},
  eprint = {2401.05125v1},
  month = {Jan},
  primaryclass = {cs.CL},
  title = {BELHD: Improving Biomedical Entity Linking with Homonoym Disambiguation},
  url = {http://arxiv.org/abs/2401.05125v1},
  year = {2024},
}

Setup

Install the belb library in your python environment:

git clone https://github.com/sg-wbi/belb
cd belb
pip install -e .

Then you need to install other requirements specific for BELHD:

(belhd) user $ pip install -r requirements.txt

Results

We stored predictions of all models and gold labels in the data directory. Below you find the commands to reproduce all tables reported in the paper.

BELB

Reproduce the results on BELB.

Main table:

(belhd) user $ python -m scripts.evaluate 

BELHD ablations:

(belhd) user $ python -m scripts.evaluate_ablations

Ad-hoc solutions for homonyms. Abbreviations:

(belhd) user $ python -m scripts.evaluate_ar

and species assignment:

(belhd) user $ python -m scripts.evaluate_sa 

BioRED

(belhd) user $ python -m biored.evaluate 

Run

If you wish to use our code with BELB you first need to follow the belb instructions to setup a directory with all the data (corpora and KBs).

Homonym Disambiguation

To create KB versions with disambiguated homonyms:

(belhd) user $ python -m scripts.disambiguate_kbs --dir /path/to/belb/dir

We note that belb deals with large KBs and its code it's not optimized. This step takes quite a while, especially for NCBI Gene.

BELHD

To train BELHD you need to convert BELB data into the required input format

Edit data/configs/data.yaml:

belb_dir : 'path/to/belb/directory'
exp_dir : 'path/to/experiments/directory'

Prepare data with:

(belhd) user $ python -m scripts.tokenize_corpora

and

(belhd) user $ python -m scripts.tokenize_dkbs

Then you can use the helpers scripts bin/train.sh to train the models and bin/predict.sh to obtain the predictions for each corpus.

BELHD Ablations

Run scripts bin/train_ablations.sh and bin/predict_ablations.sh

Ad-hoc solutions for homonyms

You need to first train BELHD without HD and with abbreviation resolution (bin/train_nohd.sh) and obtain the predictions (bin/predict_nohd.sh). For this you need to create a version of the data with abbreviation resolution with:

(belhd) user $ python -m scripts.tokenize_corpora abbres=true

Similarly you need to rerun the baselines with abbreviation resolution. Gene corpora with species assignment are stored in ./data/belb/species_assign (see SpeciesAssignment.md for details).

Baselines

For each baseline we use the original code. We provide detailed instruction on how to run them in separate files:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published