- Python 3.10 or higher
- PyTorch
- bitsandbytes
- flash_atten
- datasets
- transformers
- peft
- trl
Install the required packages using:
pip install -r requirements.txt
.
├── app/
├── code/
│ ├── img/
│ └── notebook/
│ ├── BioQwen-manuscript.ipynb
│ └── README.md
├── result/
│ ├── ner/
│ │ ├── bc5cdr/
│ │ ├── cmeee/
│ │ └── ncbi/
│ └── qa/
│ ├── cmedqa2/
│ ├── icliniq/
│ └── webmedqa/
├── script/
│ ├── README.md
│ ├── script_stage1.py
│ └── script_stage2.py
├── README.md
└── requirements.txt
This directory contains the download link for the BioQwen mobile deployment APK file.
- img/: Contains image files related to the project.
- notebook/: Contains Jupyter notebooks for detailed exploration and documentation.
BioQwen-manuscript.ipynb
: Notebook detailing the model training and inference process.README.md
: Documentation for the notebook.
- ner/: Results and outputs related to Named Entity Recognition tasks.
- bc5cdr/: Results for the BC5CDR dataset.
- cmeee/: Results for the CMEEE dataset.
- ncbi/: Results for the NCBI-DISEASE dataset.
- qa/: Results and outputs related to Question Answering tasks.
- cmedqa2/: Results for the cMedQA2 dataset.
- icliniq/: Results for the iCliniq dataset.
- webmedqa/: Results for the WebMedQA dataset.
- README.md: Documentation for the scripts.
- script_stage1.py: Script for Stage 1 training extracted from
BioQwen-manuscript.ipynb
. - script_stage2.py: Script for Stage 2 training extracted from
BioQwen-manuscript.ipynb
.
This file, providing an overview and instructions for the repository.
Here is a file listing all the dependencies required to run the scripts and notebooks in this repository. Please note that there might be some dependencies not listed here. Feel free to open an issue to provide feedback.
The training is performed in two stages using the scripts provided.
-
Load and Filter Data:
- Load datasets from various sources.
- Filter and preprocess the data.
- Tokenize the data with language checks.
-
Model Setup:
- Load the pre-trained model and tokenizer.
- Configure BitsAndBytes for efficient training.
- Prepare the model for QLoRA/LoRA training.
-
Training Configuration:
- Define training arguments.
- Create a Trainer instance and start training.
To run the training on a single GPU, use the python
command:
python script/script_stage1.py
python script/script_stage2.py
To run the training on multiple GPUs, use the torchrun
command:
torchrun --nproc_per_node=NUM_GPUS script/script_stage1.py
torchrun --nproc_per_node=NUM_GPUS script/script_stage2.py
Replace NUM_GPUS
with the number of GPUs available on your machine.
For any questions or further information, please submit an issue on this repository.