Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SWIFT 2.4 TO DO LIST #1617

Open
tastelikefeet opened this issue Aug 7, 2024 · 10 comments
Open

SWIFT 2.4 TO DO LIST #1617

tastelikefeet opened this issue Aug 7, 2024 · 10 comments

Comments

@tastelikefeet
Copy link
Collaborator

Dataset

  1. Refactor the self cognition dataset to support multi-lingual QAs.

Megatron PreTrain

  1. Support more Megatron models
  2. Support dataset split

Fine-tuning

  1. RAG LLM training investigation

RLHF

  1. PPO training investigation

Multi-modal

  1. GPTQ/AWQ quantization
  2. vLLM inference

Inference&Deployment

  1. PyTorch batch inference
  2. DeepSpeed-Zero inference investigation
  3. Output logits

WEB-UI

  1. Video/Audio chatbot
@tastelikefeet tastelikefeet pinned this issue Aug 7, 2024
@WSC741606
Copy link

希望能支持零一万物的Yi-1.5系列的Megatron,感谢大佬~

@WSC741606
Copy link

还有多机多卡的数据集训练加载问题~NFS挂载的网络波动问题导致加载不了本地的cache
我现在是修改了swift/llm/utils/utils.py的def _msdataset_ddp_load(*args, **kwargs):,改成了

    def _msdataset_ddp_load(*args, **kwargs):
        dataset=False
        while not dataset:
          try:
            with safe_ddp_context():
              dataset = _old_msdataset_load(*args, **kwargs)
            return dataset
          except:
            dataset=False

希望有更优雅的解决方法~

@WSC741606
Copy link

另外数据集希望能支持在命令行中给个标签,然后分别计算各个标签的loss,比如通用数据集loss,代码数据集loss,垂域数据集loss等,然后对应到Tensorboard看看情况
看到一个参考代码思路

channel_loss = {}
for step, batch in enumerate(train_dataloader):
    batch = to_device(batch, device)
    channel = batch['channel'][0]
    
    del batch['channel']
    outputs = model(**batch)
    loss = outputs.loss

    # Update channel loss
    if channel in channel_loss:
        channel_loss[channel][0] += loss.item()
        channel_loss[channel][1] += 1
    else:
        channel_loss[channel] = [loss.item(), 1]

    all_channel_loss = [None for _ in range(world_size)]
    torch.distributed.all_gather_object(all_channel_loss, channel_loss)

    merged_channel_loss = {}
    for lst in all_channel_loss:
        for k, v in lst.items():
            if k in merged_channel_loss:
                merged_channel_loss[k][0] += v[0]
                merged_channel_loss[k][1] += v[1]
            else:
                merged_channel_loss[k] = [v[0], v[1]]

    for k,v in merged_channel_loss.items():
        avg_loss = v[0] / v[1] if v[1] != 0 else 0.0
        print_rank_0("The Channel {} loss is {}".format(k, avg_loss), args.global_rank)

        # Log channel loss to TensorBoard
        if dist.get_rank() == 0:
            writer.add_scalar(f'Loss/channel_{k}', avg_loss, epoch * num_batches + step)

    channel_loss = {}

@WSC741606
Copy link

还有远古的DDP+MP的问题)另外我看日志里输出的是MP,这个有可能进化成PP吗,毕竟朴素MP的话气泡期也太长了,但我这边没跑成功过,所以不太清楚是不是已经做了优化

@Jintao-Huang
Copy link
Collaborator

还有远古的DDP+MP的问题)另外我看日志里输出的是MP,这个有可能进化成PP吗,毕竟朴素MP的话气泡期也太长了,但我这边没跑成功过,所以不太清楚是不是已经做了优化

这个device_map主要是用于节约显存的。如果要使用PP,可以使用deepspeed。如果要使用TP,估计需要等megatron了

@WSC741606
Copy link

还有远古的DDP+MP的问题)另外我看日志里输出的是MP,这个有可能进化成PP吗,毕竟朴素MP的话气泡期也太长了,但我这边没跑成功过,所以不太清楚是不是已经做了优化

这个device_map主要是用于节约显存的。如果要使用PP,可以使用deepspeed。如果要使用TP,估计需要等megatron了

好嘞,感谢大佬~

@beamind
Copy link

beamind commented Aug 16, 2024

希望支持训练RM(reward model)模型

@WSC741606
Copy link

还有多机多卡的数据集训练加载问题~NFS挂载的网络波动问题导致加载不了本地的cache 我现在是修改了swift/llm/utils/utils.py的def _msdataset_ddp_load(*args, **kwargs):,改成了

    def _msdataset_ddp_load(*args, **kwargs):
        dataset=False
        while not dataset:
          try:
            with safe_ddp_context():
              dataset = _old_msdataset_load(*args, **kwargs)
            return dataset
          except:
            dataset=False

希望有更优雅的解决方法~

解决了

@PancakeAwesome
Copy link

支持 qwenvl2 internvl2 vllm 多图和视频推理,谢谢

@ljqnb
Copy link

ljqnb commented Sep 10, 2024

Please support PPO! Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants