Skip to content

Latest commit

 

History

History
64 lines (45 loc) · 2.44 KB

README.md

File metadata and controls

64 lines (45 loc) · 2.44 KB

CAT-LM

Official release of CAT-LM: Aligned Code And Tests Language Model.

Overview

CAT-LM is a GPT-style language model with 2.7 Billion parameters, trained on a corpus of Python and Java projects. We utilize a novel pretraining signal that explicitly considers the mapping between code and test files when available. We also drastically increase the maximum sequence length of inputs to 8,192 tokens, 4x more than typical code generation models, to ensure that the code context is available to the model when generating test code. Our work highlights the importance of incorporating software-specific insights when training language models for code and paves the way to more powerful automated test generation.

Publication

CAT-LM: Training Language Models on Aligned Code And Tests
Nikitha Rao*, Kush Jain*, Uri Alon, Claire Le Goues, and Vincent J. Hellendoorn
38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023)

Usage

CAT-LM is on Hugging Face.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('nikitharao/catlm', use_fast = False)
model = AutoModelForCausalLM.from_pretrained('nikitharao/catlm')

prompt = """
def add(x,y):
    \"\"\"Add two numbers x and y\"\"\"
    return x+y
<|codetestpair|>
"""

print('Input prompt:')
print(prompt)
       
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# The model was trained without the `</s>` token and should be removed.
if tokenizer.decode(input_ids[0,-1]) == '</s>':
    input_ids = input_ids[:,:-1]

print(input_ids)
len_input = input_ids.shape[1]

sample_output = model.generate(
    input_ids,
    do_sample=True, 
    max_new_tokens = 512,
    top_k=50, 
    top_p=0.95,
    temperature=0.2
)
generated_output = sample_output[0][len_input:]
output = tokenizer.decode(generated_output, skip_special_tokens=True)
print('Output:')
print(output)

Note: The model was trained without the </s> token and should be removed.

Data and Model Training

The code and datasets for training and evaluating CAT-LM, results of additional experiments and comparison with TeCo, CodeGen and StarCoder are available at:

https://doi.org/10.5281/zenodo.7901830