Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This will solve CPU-only, CUDA-only and any mix of them. #98

Merged
merged 2 commits into from
May 7, 2024

Conversation

AlessandroFlati
Copy link
Contributor

@AlessandroFlati AlessandroFlati commented May 6, 2024

This solves the post-fix_symbolic problem with cuda, the initialize_from_another_model problem with cuda, and the cpu problem related (already mentioned in this PR) that forced to use cuda.

@AlessandroFlati
Copy link
Contributor Author

@KindXiaoming This should close many issues related to using CUDA. To work properly, I recommend updating requirements.txt to the following

matplotlib==3.6.2
numpy==1.26.4
scikit-learn==1.4.2
setuptools==69.5.1
sympy==1.11.1
torch==2.2.2
tqdm==4.66.2

Please let me know if you want me to make another PR or you'll handle this by yourself.

@Jim137
Copy link
Contributor

Jim137 commented May 6, 2024

There's another device missing in https://github.com/KindXiaoming/pykan/blob/master/kan/KAN.py#L205. I've addressed it in my fork at https://github.com/Jim137/pykan/tree/develop. Would you be open to merging my changes and submitting a pull request together?

@AlessandroFlati
Copy link
Contributor Author

Good point, I added it.

This was referenced May 6, 2024
@brainer3220
Copy link

brainer3220 commented May 7, 2024

I don't know why but if use MPS(Apple SIlicon) to loss is nan.

model.train(dataset, opt="LBFGS", steps=20, lamb=0.01, lamb_entropy=10., device=device.type);
train loss: nan | test loss: nan | reg: nan : 100%|█████████████████| 20/20 [00:03<00:00,  5.11it/s]

@AlessandroFlati
Copy link
Contributor Author

@brainer3220 I'm afraid I can't help too much with MPS, but it seems nonetheless a common issue between MPS and Torch (see pytorch/pytorch#112834, for example).

@rajdeepbanerjee-git
Copy link

rajdeepbanerjee-git commented May 7, 2024

I am trying to run the given example of KAN in colab with @AlessandroFlati AlessandroFlati:develop implementation:
image

Still getting the above error. I used the following requirements:
matplotlib==3.6.2
numpy==1.26.4
scikit-learn==1.4.2
setuptools==69.5.1
sympy==1.11.1
torch==2.2.1
tqdm==4.66.2

In case I want to run on cpu, it says no NVIDIA drivers selected.

Any help to resolve this is appreciated. Thanks!

@SimoSbara SimoSbara mentioned this pull request May 7, 2024
@SimoSbara
Copy link

I am trying to run the given example of KAN in colab with @AlessandroFlati AlessandroFlati:develop implementation: image

Still getting the above error. I used the following requirements: matplotlib==3.6.2 numpy==1.26.4 scikit-learn==1.4.2 setuptools==69.5.1 sympy==1.11.1 torch==2.2.1 tqdm==4.66.2

In case I want to run on cpu, it says no NVIDIA drivers selected.

Any help to resolve this is appreciated. Thanks!

First you need to initialize a torch.device like this device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Then use device in all constructors device = device

Finally, you will need to put the dataset tensor on device doing this:

dataset['train_input'] = dataset['train_input'].to(device)
dataset['train_label'] = dataset['train_label'].to(device)

@rajdeepbanerjee-git
Copy link

Thanks, now I am able to run on colab GPU. But the CPU problem persists.

@SimoSbara
Copy link

Thanks, now I am able to run on colab GPU. But the CPU problem persists.

This pull request solves it, you can try to modify pykan like in those commits:
d606bd8
c857dd6

I had the same problem #75.

@KindXiaoming
Copy link
Owner

Hi @AlessandroFlati, would appreciate you make another PR for me! Thanks in advance :)

@alpaca202204
Copy link

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device(type='cuda')

print(torch.cuda.is_available())
True

model.to(device)

dataset['train_input'] = dataset['train_input'].to(device)
dataset['train_label'] = dataset['train_label'].to(device)

but there is still a problem

--> 170 x = torch.einsum('ij,k->ikj', x, torch.ones(self.out_dim, device=self.device)).reshape(batch, self.size).permute(1, 0)
171 preacts = x.permute(1, 0).clone().reshape(batch, self.out_dim, self.in_dim)
172 base = self.base_fun(x).permute(1, 0) # shape (batch, size)

File E:\anaconda\envs\4torch2\lib\site-packages\torch\functional.py:380, in einsum(*args)
375 return einsum(equation, *_operands)
377 if len(operands) <= 2 or not opt_einsum.enabled:
378 # the path for contracting 0 or 1 time(s) is already optimized
379 # or the user has disabled using opt_einsum
--> 380 return _VF.einsum(equation, operands) # type: ignore[attr-defined]
382 path = None
383 if opt_einsum.is_available():

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

@AlessandroFlati
Copy link
Contributor Author

You shouldn't just model.to(device), but rather create both model and dataset passing device=device argument. Besides, you're missing test_input and test_label keys for dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants