Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hardware] Initial TPU integration #5292

Merged
merged 193 commits into from
Jun 12, 2024
Merged

[Hardware] Initial TPU integration #5292

merged 193 commits into from
Jun 12, 2024

Conversation

WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Jun 5, 2024

This PR implements the initial integration of the Google TPU backend. It uses PyTorch XLA for maximal reuse of the existing code base.

The PR features:

  • Seamless support for popular HF models such as Llama, Mistral, Gemma, etc. The model's head size must be either 128 or 256.
  • Basic functionalities of vLLM, including continuous batching
  • Optimized pallas kernels for FlashAttention and PagedAttention

TODOs (next steps):

  • (Fast) top-p sampling (disabled for now due to performance issues)
  • Distributed (tensor-parallel) inference
  • INT8 quantization
  • MoE
  • Support best_of > 1

@WoosukKwon
Copy link
Collaborator Author

@alanwaketan Please take a look!

@WoosukKwon WoosukKwon changed the title [WIP][Hardware] Initial TPU integration [Hardware] Initial TPU integration Jun 11, 2024
@WoosukKwon WoosukKwon marked this pull request as ready for review June 11, 2024 17:47
@WoosukKwon WoosukKwon requested a review from JackCaoG June 11, 2024 17:57
Copy link

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Collaborator

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly asking for comments for clarifying some code!

For testing, are we planning to add relevant CI in the future?

vllm/worker/tpu_worker.py Outdated Show resolved Hide resolved
vllm/worker/tpu_worker.py Show resolved Hide resolved
vllm/worker/tpu_worker.py Show resolved Hide resolved
vllm/worker/tpu_worker.py Outdated Show resolved Hide resolved
vllm/worker/tpu_model_runner.py Show resolved Hide resolved
vllm/worker/tpu_model_runner.py Show resolved Hide resolved
vllm/worker/tpu_model_runner.py Show resolved Hide resolved
vllm/worker/tpu_model_runner.py Show resolved Hide resolved
vllm/worker/tpu_model_runner.py Show resolved Hide resolved
@WoosukKwon
Copy link
Collaborator Author

@rkooo567 Thanks for the quality review!

@WoosukKwon WoosukKwon merged commit 1a8bfd9 into main Jun 12, 2024
20 of 24 checks passed
@WoosukKwon WoosukKwon deleted the torch-xla branch June 12, 2024 18:53
robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 16, 2024
joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 27, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tpu Related to Google TPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants