Is TextGenerationEvaluator incomplete? #537

inf3rnus · 2024-01-20T02:38:00Z

Hey all, fascinating library you got going on here.

Was trying to get a working example for perplexity on EleutherAI/lambada_openai with gpt2

Unfortunately, the only way I could get it working was by doing something like:

from transformers import (
    AutoTokenizer,
    pipeline as trans_pipeline,
)
import evaluate
from datasets import load_dataset

task = "text-generation"

task_evaluator = evaluate.evaluator(task)

dataset_name = "EleutherAI/lambada_openai"


data = load_dataset(dataset_name, split="test").shuffle(seed=42).select(range(10))

model = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model)

pipe = trans_pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    # accelerator="ort",
)

perplexity = evaluate.load("perplexity", module_type="metric")


references = data["text"]

predictions = pipe(references)

predictions = list(map(lambda prediction: prediction[0]["generated_text"], predictions))

perplexity.add_batch(predictions=predictions, references=references)

value = perplexity.compute(model_id="gpt2")

print("Perplexity is: ", value)

Okay, so that works, what doesn't is the task evaluator class for text generation. This code throws

import evaluate
from datasets import load_dataset

task = "text-generation"

task_evaluator = evaluate.evaluator(task)

dataset_name = "EleutherAI/lambada_openai"


data = load_dataset(dataset_name, split="test").shuffle(seed=42).select(range(10))

model = "gpt2"


eval_results = task_evaluator.compute(
    model_or_pipeline=model,
    data=data,
    metric="perplexity",
    # label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)

print(eval_results)

pass

I get the error:

Exception has occurred: ValueError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Evaluation module cache file doesn't exist. Please make sure that you call `add` or `add_batch` at least once before calling `compute`.

And this is likely because compute in evaluter/base.py needs to have metric inputs returned from self.prepare_data() in the TextGenerationEvaluator class

e.g. first element of the tuple returned in this guy is empty

    def prepare_data(self, data: Dataset, input_column: str, *args, **kwargs) -> Tuple[Dict, DatasetColumn]:
        """
        Prepare data.

        Args:
            data ([`Dataset`]):
                Specifies the dataset we will run evaluation on.
            input_column (`str`, defaults to `"text"`):
                The name of the column containing the text feature in the dataset specified by `data`.
        Returns:
            `dict`:  metric inputs.
            `list`:  pipeline inputs.
        """

        self.check_required_columns(data, {"input_column": input_column})

        return {}, DatasetColumn(data, input_column)

but is used by this code, which has nothing to compare against when evaluating perplexity, which is why the error was produced bc

        if any(v is not None for v in inputs.values()):
            self.add_batch(**inputs)

never gets called before self._finalize() in module.py (for metric.compute())

self._finalize()

    def compute(
        self,
        model_or_pipeline: Union[
            str,
            "Pipeline",
            Callable,
            "PreTrainedModel",
            "TFPreTrainedModel",  # noqa: F821
        ] = None,
        data: Union[str, Dataset] = None,
        subset: Optional[str] = None,
        split: Optional[str] = None,
        metric: Union[str, EvaluationModule] = None,
        tokenizer: Optional[Union[str, "PreTrainedTokenizer"]] = None,  # noqa: F821
        feature_extractor: Optional[
            Union[str, "FeatureExtractionMixin"]
        ] = None,  # noqa: F821
        strategy: Literal["simple", "bootstrap"] = "simple",
        confidence_level: float = 0.95,
        n_resamples: int = 9999,
        device: int = None,
        random_state: Optional[int] = None,
        input_column: str = "text",
        label_column: str = "label",
        label_mapping: Optional[Dict[str, Number]] = None,
    ) -> Dict[str, float]:
        result = {}

        self.check_for_mismatch_in_device_setup(device, model_or_pipeline)

        # Prepare inputs
        data = self.load_data(data=data, subset=subset, split=split)
        metric_inputs, pipe_inputs = self.prepare_data(
            data=data, input_column=input_column, label_column=label_column
        )
        pipe = self.prepare_pipeline(
            model_or_pipeline=model_or_pipeline,
            tokenizer=tokenizer,
            feature_extractor=feature_extractor,
            device=device,
        )
        metric = self.prepare_metric(metric)

        # Compute predictions
        predictions, perf_results = self.call_pipeline(pipe, pipe_inputs)
        predictions = self.predictions_processor(predictions, label_mapping)

        metric_inputs.update(predictions)

        # Compute metrics from references and predictions
        metric_results = self.compute_metric(
            metric=metric,
            metric_inputs=metric_inputs,
            strategy=strategy,
            confidence_level=confidence_level,
            n_resamples=n_resamples,
            random_state=random_state,
        )

        # TODO: To clarify why `wer` and `cer` return float
        # even though metric.compute contract says that it
        # returns Optional[dict].
        if type(metric_results) == float:
            metric_results = {metric.name: metric_results}

        result.update(metric_results)
        result.update(perf_results)

        return result

Wondering if it's still in progress, only works with some metrics for the time being, or what the deal is?

Final question - all the logic for determining how a metric is computed is held within the metric itself correct?

So e.,g. if I were trying to compute accuracy on lambada, I'd have to implement that myself?

Many thanks!

The text was updated successfully, but these errors were encountered:

DiabolicDev · 2024-05-28T08:29:51Z

Hey @inf3rnus, did you figure it out?

kaustubholpadkar · 2024-05-31T18:05:00Z

Its crazy that its not solved yet.

inf3rnus · 2024-05-31T19:31:50Z

@DiabolicDev Nope, It's been 84 years since I've looked at this, but I believe if the dataset doesn't have an evaluator defined in it, then you need to implement the code for whatever metric you're trying to evaluate yourself.

Depending on how bad you need this to work, I'd try to step what I did via a debugger, or I'd checkout https://github.com/EleutherAI/lm-evaluation-harness in the meantime.

DiabolicDev · 2024-06-03T10:44:54Z

Yeah definitely man, I've been looking at other evaluation libraries, just thought it'd be super easy to run these evaluations on Colab using models from huggingface and their evaluation library if they had made it properly. Anyways, thanks for letting me know!

inf3rnus · 2024-06-10T19:20:59Z

@DiabolicDev Anytime! Happy hunting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is TextGenerationEvaluator incomplete? #537

Is TextGenerationEvaluator incomplete? #537

inf3rnus commented Jan 20, 2024 •

edited

Loading

DiabolicDev commented May 28, 2024

kaustubholpadkar commented May 31, 2024

inf3rnus commented May 31, 2024

DiabolicDev commented Jun 3, 2024

inf3rnus commented Jun 10, 2024

Is TextGenerationEvaluator incomplete? #537

Is TextGenerationEvaluator incomplete? #537

Comments

inf3rnus commented Jan 20, 2024 • edited Loading

DiabolicDev commented May 28, 2024

kaustubholpadkar commented May 31, 2024

inf3rnus commented May 31, 2024

DiabolicDev commented Jun 3, 2024

inf3rnus commented Jun 10, 2024

inf3rnus commented Jan 20, 2024 •

edited

Loading