Skip to content

1.2.0

Compare
Choose a tag to compare
@gabrielmbmb gabrielmbmb released this 18 Jun 12:40
· 75 commits to main since this release
3910aca

✨ Release highlights

Structured generation with instructor, InferenceEndpointsLLM now supports structured generation and StructuredGeneration task

  • instructor has been integrated bringing support for structured generation with OpenAILLM, AnthropicLLM, LiteLLM, MistralLLM, CohereLLM and GroqLLM:

    Structured generation with `instructor` example
    from typing import List
    
    from distilabel.llms import MistralLLM
    from distilabel.pipeline import Pipeline
    from distilabel.steps import LoadDataFromDicts
    from distilabel.steps.tasks import TextGeneration
    from pydantic import BaseModel, Field
    
    
    class Node(BaseModel):
        id: int
        label: str
        color: str
    
    
    class Edge(BaseModel):
        source: int
        target: int
        label: str
        color: str = "black"
    
    
    class KnowledgeGraph(BaseModel):
        nodes: List[Node] = Field(..., default_factory=list)
        edges: List[Edge] = Field(..., default_factory=list)
    
    
    with Pipeline(
        name="Knowledge-Graphs",
        description=(
            "Generate knowledge graphs to answer questions, this type of dataset can be used to "
            "steer a model to answer questions with a knowledge graph."
        ),
    ) as pipeline:
        sample_questions = [
            "Teach me about quantum mechanics",
            "Who is who in The Simpsons family?",
            "Tell me about the evolution of programming languages",
        ]
    
        load_dataset = LoadDataFromDicts(
            name="load_instructions",
            data=[
                {
                    "system_prompt": "You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.",
                    "instruction": f"{question}",
                }
                for question in sample_questions
            ],
        )
    
        text_generation = TextGeneration(
            name="knowledge_graph_generation",
            llm=MistralLLM(
                model="open-mixtral-8x22b",
                structured_output={"schema": KnowledgeGraph}
            ),
        )
        load_dataset >> text_generation
  • InferenceEndpointsLLM now supports structured generation

  • New StructuredGeneration task that allows defining the schema of the structured generation per input row.

New tasks for generating datasets for training embedding models

sentence-transformers v3 was recently released and we couldn't resist the urge of adding a few new tasks to allow creating datasets for training embedding models!

New Steps for loading data from different sources and saving/loading Distiset to disk

We've added a few new steps allowing to load data from different sources:

  • LoadDataFromDisk allows loading a Distisetor datasets.Dataset that was previously saved using the save_to_disk method.
  • LoadDataFromFileSystem allows loading a datasets.Dataset from a file system.

Thanks to @rasdani for helping us testing this new tasks!

In addition, we have added save_to_disk method to Distiset akin to datasets.Dataset.save_to_disk, that allows saving the generated distiset to disk, along with the pipeline.yaml and pipeline.log.

`save_to_disk` example
from distilabel.pipeline import Pipeline

with Pipeline(name="my-pipeline") as pipeline:
    ...
    
if __name__ == "__main__":
    distiset = pipeline.run(...)
    distiset.save_to_disk(dataset_path="my-distiset")

MixtureOfAgentsLLM implementation

We've added a new LLM called MixtureOfAgentsLLM derived from the paper Mixture-of-Agents Enhances Large Language Model Capabilities. This new LLM allows generating improved outputs thanks to the collective expertise of several LLMs.

`MixtureOfAgentsLLM` example
from distilabel.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM

llm = MixtureOfAgentsLLM(
    aggregator_llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
    proposers_llms=[
        InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
        ),
        InferenceEndpointsLLM(
            model_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
            tokenizer_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
        ),
        InferenceEndpointsLLM(
            model_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
            tokenizer_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
        ),
    ],
    rounds=2,
)

llm.load()

output = llm.generate(
    inputs=[
        [
            {
                "role": "user",
                "content": "My favorite witty review of The Rings of Power series is this: Input:",
            }
        ]
    ]
)

Saving cache and passing batches to GlobalSteps optimizations

  • The cache logic of the _BatchManager has been improved to incrementally update the cache making the process much faster.
  • The data of the input batches of the GlobalSteps will be passed to the step using the file system, as this is faster than passing it using the queue. This is possible thanks to new integration of fsspec, which can be configured to use a file system or cloud storage as backend for passing the data of the batches.

BasePipeline and _BatchManager refactor

The logic around BasePipeline and _BatchManager has been refactored, which will make it easier to implement new pipelines in the future.

Added ArenaHard as an example of how to use distilabel to implement a benchmark

distilabel can be easily used to create an LLM benchmark. To showcase this, we decided to implement Arena Hard as an example: Benchmarking with distilabel: Arena Hard

📚 Improved documentation structure

We have updated the documentation structure to make it more clear and self-explanatory, as well as more visually appealing 😏.

image

What's Changed

New Contributors

Full Changelog: 1.1.1...1.2.0