Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not being able to load data that has been previously saved #6052

Open
VolodyaCO opened this issue Aug 20, 2023 · 6 comments
Open

Not being able to load data that has been previously saved #6052

VolodyaCO opened this issue Aug 20, 2023 · 6 comments
Labels

Comments

@VolodyaCO
Copy link

Description

I am trying to load data in a Dataset after I have previously saved it, but I cannot get the data itself. I know that the data is being read because the feature names are being loaded, but the data is not.

Reproducible example

The following loads some data into a Dataset:

>>> from sklearn.datasets import load_iris
>>> import lightgbm as lgb
>>> data = load_iris()
>>> X = data.data
>>> y = data.target
>>> ds = lgb.Dataset(X, y)
>>> ds.save_binary("example.bin")

Now that I have my data saved, I want to load it:

>>> ds2 = lgb.Dataset("example.bin")
>>> ds2.data
'example.bin'
>>> ds2.get_data()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/vladimirvargas/micromamba/envs/real-estate/lib/python3.9/site-packages/lightgbm/basic.py", line 2768, in get_data
    raise Exception("Cannot get data before construct Dataset")
Exception: Cannot get data before construct Dataset

As expected, I cannot get_data, so I construct lazily the Dataset:

>>> ds2.construct()
[LightGBM] [Info] Load from binary file example.bin
<lightgbm.basic.Dataset object at 0x10061d6d0>
>>> ds2.data
>>> ds2.get_data()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/vladimirvargas/micromamba/envs/real-estate/lib/python3.9/site-packages/lightgbm/basic.py", line 2787, in get_data
    raise LightGBMError("Cannot call `get_data` after freed raw data, "
lightgbm.basic.LightGBMError: Cannot call `get_data` after freed raw data, set free_raw_data=False when construct Dataset to avoid this.

which is rather unexpected, as I don't know why the raw data was freed, initially. Let us follow the message advise, though:

>>> ds2 = lgb.Dataset("example.bin", free_raw_data=False)
>>> ds2.data
'example.bin'
>>> ds2.construct()
[LightGBM] [Info] Load from binary file example.bin
<lightgbm.basic.Dataset object at 0x144e820a0>
>>> ds2.data
'example.bin'
>>> ds2.get_data()
'example.bin'

Note that this is not what I expected, I expected to actually be able to access the data. How can I access it?

Environment info

LightGBM version or commit hash:
v4.0.0

Command(s) you used to install LightGBM

pip install lightgbm

in a virtual environment.

@VolodyaCO
Copy link
Author

Also, I've noticed that loading the data and using it for something is not tested in https://github.com/microsoft/LightGBM/blob/v4.0.0/tests/python_package_test/test_basic.py#L247

@jrvmalik
Copy link

Saving the binary doesn't save the raw data I believe. In this case the "binary" data refers to the binned data, and the "raw" data refers to the data prior to binning (floats for instance). What if you just don't call get_data() but do lgb.train({}, ds2)? Do you also get an error?

@jrvmalik
Copy link

import lightgbm as lgb
import numpy as np

np.random.seed(1)
X = np.random.randn(300, 3)
y = np.random.randn(300)

data = lgb.Dataset(X, label=y, init_score=np.zeros_like(y), free_raw_data=False).construct()
data.save_binary('data.bin')

data2 = lgb.Dataset('data.bin', free_raw_data=False).construct()
data2.get_data()

This returns just the path to the data fwiw

@VolodyaCO
Copy link
Author

lgb.train({}, ds2)

Running this gets me

>>> lgb.train({}, ds2, num_boost_round=3)
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000556 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 98
[LightGBM] [Info] Number of data points in the train set: 150, number of used features: 4
[LightGBM] [Info] Start training from score 1.000000
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
<lightgbm.basic.Booster object at 0x100a32160>

I don't think any training happened 🤔

@VolodyaCO
Copy link
Author

Though, if I do it with the original dataset I get the same output:

>>> lgb.train({}, ds, num_boost_round=3)
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000464 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 98
[LightGBM] [Info] Number of data points in the train set: 150, number of used features: 4
[LightGBM] [Info] Start training from score 1.000000
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
<lightgbm.basic.Booster object at 0x100a18490>

@VolodyaCO
Copy link
Author

I would throw a warning when trying to get data and it is not possible because the Dataset has been loaded from a .bin file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants