Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is TextGenerationEvaluator incomplete? #537

Open
inf3rnus opened this issue Jan 20, 2024 · 5 comments
Open

Is TextGenerationEvaluator incomplete? #537

inf3rnus opened this issue Jan 20, 2024 · 5 comments

Comments

@inf3rnus
Copy link

inf3rnus commented Jan 20, 2024

Hey all, fascinating library you got going on here.

Was trying to get a working example for perplexity on EleutherAI/lambada_openai with gpt2

Unfortunately, the only way I could get it working was by doing something like:

from transformers import (
    AutoTokenizer,
    pipeline as trans_pipeline,
)
import evaluate
from datasets import load_dataset

task = "text-generation"

task_evaluator = evaluate.evaluator(task)

dataset_name = "EleutherAI/lambada_openai"


data = load_dataset(dataset_name, split="test").shuffle(seed=42).select(range(10))

model = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model)

pipe = trans_pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    # accelerator="ort",
)

perplexity = evaluate.load("perplexity", module_type="metric")


references = data["text"]

predictions = pipe(references)

predictions = list(map(lambda prediction: prediction[0]["generated_text"], predictions))

perplexity.add_batch(predictions=predictions, references=references)

value = perplexity.compute(model_id="gpt2")

print("Perplexity is: ", value)

Okay, so that works, what doesn't is the task evaluator class for text generation. This code throws

import evaluate
from datasets import load_dataset

task = "text-generation"

task_evaluator = evaluate.evaluator(task)

dataset_name = "EleutherAI/lambada_openai"


data = load_dataset(dataset_name, split="test").shuffle(seed=42).select(range(10))

model = "gpt2"


eval_results = task_evaluator.compute(
    model_or_pipeline=model,
    data=data,
    metric="perplexity",
    # label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)

print(eval_results)

pass

I get the error:

Exception has occurred: ValueError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Evaluation module cache file doesn't exist. Please make sure that you call `add` or `add_batch` at least once before calling `compute`.

And this is likely because compute in evaluter/base.py needs to have metric inputs returned from self.prepare_data() in the TextGenerationEvaluator class

e.g. first element of the tuple returned in this guy is empty

    def prepare_data(self, data: Dataset, input_column: str, *args, **kwargs) -> Tuple[Dict, DatasetColumn]:
        """
        Prepare data.

        Args:
            data ([`Dataset`]):
                Specifies the dataset we will run evaluation on.
            input_column (`str`, defaults to `"text"`):
                The name of the column containing the text feature in the dataset specified by `data`.
        Returns:
            `dict`:  metric inputs.
            `list`:  pipeline inputs.
        """

        self.check_required_columns(data, {"input_column": input_column})

        return {}, DatasetColumn(data, input_column)

but is used by this code, which has nothing to compare against when evaluating perplexity, which is why the error was produced bc

        if any(v is not None for v in inputs.values()):
            self.add_batch(**inputs)

never gets called before self._finalize() in module.py (for metric.compute())

self._finalize()
    def compute(
        self,
        model_or_pipeline: Union[
            str,
            "Pipeline",
            Callable,
            "PreTrainedModel",
            "TFPreTrainedModel",  # noqa: F821
        ] = None,
        data: Union[str, Dataset] = None,
        subset: Optional[str] = None,
        split: Optional[str] = None,
        metric: Union[str, EvaluationModule] = None,
        tokenizer: Optional[Union[str, "PreTrainedTokenizer"]] = None,  # noqa: F821
        feature_extractor: Optional[
            Union[str, "FeatureExtractionMixin"]
        ] = None,  # noqa: F821
        strategy: Literal["simple", "bootstrap"] = "simple",
        confidence_level: float = 0.95,
        n_resamples: int = 9999,
        device: int = None,
        random_state: Optional[int] = None,
        input_column: str = "text",
        label_column: str = "label",
        label_mapping: Optional[Dict[str, Number]] = None,
    ) -> Dict[str, float]:
        result = {}

        self.check_for_mismatch_in_device_setup(device, model_or_pipeline)

        # Prepare inputs
        data = self.load_data(data=data, subset=subset, split=split)
        metric_inputs, pipe_inputs = self.prepare_data(
            data=data, input_column=input_column, label_column=label_column
        )
        pipe = self.prepare_pipeline(
            model_or_pipeline=model_or_pipeline,
            tokenizer=tokenizer,
            feature_extractor=feature_extractor,
            device=device,
        )
        metric = self.prepare_metric(metric)

        # Compute predictions
        predictions, perf_results = self.call_pipeline(pipe, pipe_inputs)
        predictions = self.predictions_processor(predictions, label_mapping)

        metric_inputs.update(predictions)

        # Compute metrics from references and predictions
        metric_results = self.compute_metric(
            metric=metric,
            metric_inputs=metric_inputs,
            strategy=strategy,
            confidence_level=confidence_level,
            n_resamples=n_resamples,
            random_state=random_state,
        )

        # TODO: To clarify why `wer` and `cer` return float
        # even though metric.compute contract says that it
        # returns Optional[dict].
        if type(metric_results) == float:
            metric_results = {metric.name: metric_results}

        result.update(metric_results)
        result.update(perf_results)

        return result

Wondering if it's still in progress, only works with some metrics for the time being, or what the deal is?

Final question - all the logic for determining how a metric is computed is held within the metric itself correct?

So e.,g. if I were trying to compute accuracy on lambada, I'd have to implement that myself?

Many thanks!

@DiabolicDev
Copy link

Hey @inf3rnus, did you figure it out?

@kaustubholpadkar
Copy link

Its crazy that its not solved yet.

@inf3rnus
Copy link
Author

@DiabolicDev Nope, It's been 84 years since I've looked at this, but I believe if the dataset doesn't have an evaluator defined in it, then you need to implement the code for whatever metric you're trying to evaluate yourself.

Depending on how bad you need this to work, I'd try to step what I did via a debugger, or I'd checkout https://github.com/EleutherAI/lm-evaluation-harness in the meantime.

image

@DiabolicDev
Copy link

Yeah definitely man, I've been looking at other evaluation libraries, just thought it'd be super easy to run these evaluations on Colab using models from huggingface and their evaluation library if they had made it properly. Anyways, thanks for letting me know!

@inf3rnus
Copy link
Author

@DiabolicDev Anytime! Happy hunting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants