Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Target parsing for MultiLabelFewShotGPTClassifier (extract_labels and _to_numpy) #114

Open
CarloLepelaars opened this issue Sep 23, 2024 · 1 comment

Comments

@CarloLepelaars
Copy link

CarloLepelaars commented Sep 23, 2024

Hello, I'm having trouble understanding MultiLabelFewShotGPTClassifier. The dummy example with skllm.datasets.get_multilabel_classification_dataset works fine but it breaks as soon as I start applying it to my own data.

Using DataFrame input for y

I'm using a multi-label DataFrame with 2 targets target_0, target_1. The expected label parsing is:

y_train.columns.tolist()
# ['target_0', 'target_1']

y shape is (5, 2) (pd.DataFrame).

However when fitting the objects extract_labels parses:

clf.fit(X_train, y_train)
clf.classes_
# ['t', 'a', 'r', 'g', 'e', '_', '0', '1']

X shape is (5,) (pd.Series with strings)

Using list or np.array input for y.

If I instead input y as an array I get the following error coming from _to_numpy:

clf.fit(X_train, y_train)
# --> y = _to_numpy(y)
# ---> X = np.squeeze(X, axis=tuple([i for i in range(1, len(X.shape))]))
# -> return squeeze(axis=axis)
# ValueError: cannot select an axis to squeeze out which has size not equal to one

The same error occurs when converting y to a list (list[list[str]]`)

Do you have an idea why the classes are incorrectly parsed or why if fails on trying to squeeze? Would be happy to work on a PR, but first wanted to figure out if its a bug or my input is wrong.

@CarloLepelaars CarloLepelaars changed the title [BUG] Target parsing for MultiLabelFewShotGPTClassifier extract_labels [BUG] Target parsing for MultiLabelFewShotGPTClassifier (extract_labels and _to_numpy) Sep 23, 2024
@AndreasKarasenko
Copy link
Contributor

AndreasKarasenko commented Sep 24, 2024

To trace back your issue to how scikit-llm works:
Start here. Which leads to here
y_train is of type DataFrame and does not fit pd.Series, list, or np.ndarray. None of the conversions of to_numpy apply and it is returned as is.
self.classes_ is then built using self._get_unique_targets(y) which leads you here and since it is Multilabel then here.

Since your y is the unaltered df you pass a dataframe to a nested for loop.

from skllm.models.gpt.classification.few_shot import MultiLabelFewShotGPTClassifier
import pandas as pd

X = [
    "I love reading science fiction novels, they transport me to other worlds.", # example 1 - book - sci-fi
    "A good mystery novel keeps me guessing until the very end.", # example 2 - book - mystery
    "Historical novels give me a sense of different times and places.", # example 3 - book - historical
    "I love watching science fiction movies, they transport me to other galaxies.", # example 4 - movie - sci-fi
    "A good mystery movie keeps me on the edge of my seat.", # example 5 - movie - mystery
    "Historical movies offer a glimpse into the past.", # example 6 - movie - historical
]

y = ["books", "books", "books", "movies", "movies", "movies"]
df = pd.DataFrame({"text": X, "label": y})

clf = MultiLabelFewShotGPTClassifier()
clf.fit(df.text, df)
clf.classes_
# > ['t', 'e', 'x', 'l', 'a', 'b']

Note that it does not matter what df.text (the X) contains since the issue is how you pass y.

If you instead pass a list the issue in your case is that np.asarray(y_list, dtype=object) returns an array of shape (n, 2) because you have a uniform number of possible labels (note how y in their example can has 2 or 3 items making it an array of shape (n,).
Next they flatten the array and it fails because axis=1 is not of shape 1.

Hope that helps you debug it.

tldr: scikit-llm does not support DataFrames.

EDIT: just for the sake of it I commented out lines 25-27 and it at least produces the classes. I have not tested actual prediction though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants