When searching for useful code snippets, it helps to reformulate search queries in terms of code classes or methods. For example, a query: "How to download a file by URL in Java", can be expanded to include "InputStream, Vector, DataOutputStream". These terms are extracted from a pre-indexed storage that is built using example Java codes.

Code search are formulated as a natural-language query, but actual code may contain esoteric terms and phrases which can be difficult to retrieve. This tool translates and expands natural-language queries to a format that can be fed to a code search system for more relevant search results. See details.


pip and virualenv

To setup the application, it requires pip. and you might want to use a python virtual environment.

# activate a virtualenv

# install requirements
pip install -r requirements.txt


To setup the application with Conda package management system, it requires conda.

# Install conda

# Create a conda environment from conda.yml
conda env create -f conda.yml

# Activate the environment
source activate reformulate

Check installation

To check if the package properly installed you may run tests and perform a simple refomulation

# (optional) run tests
python -m unittest discover

# build a sample index
python index ./data/samples/Java/

# run a sample reformulation
python reform "how to write a file"


Before reformulation can be performed, the index has to be created. The index is based on training data - code snippets. It contains frequencies of each normalized term in the corpus and ASTs that will be used for reformulation. The index can be created once for each corpus and then reused for every query.


To create index run:

python index /path/to/code/snippets

The specified path is scanned (non-recursively) to find code files. Code files are parsed and snippets are extracted from each file. The index is stored in the ./index folder with index base name. The default path can be changed. python --help shows available parameters. Note, that the index operation is a slow operation if there are many code files involved. Once the index is created it can used for reformulation.


To reformulate queries run:

python reform "how to download file by URL"

The command reformulates the input query using the index in the ./index folder. The output is printed to the console. By default, the output contains only public API methods. --all parameters can be used to loosen the limit.

Using different languages

The default language for reformulation is Java. You can change the setting with the --lang="Python" parameter. The parameter must be specified both at the index and reformulation stage.

Interactive mode

In the interactive mode the application accepts queries and outputs the reformulation result in the terminal. To enter the interactive mode use -i key of the reform command.

Additional parameters

To find out all available parameters, please, run:

python --help

With additional parameters you can change index path to build multiple index files, change language, logging output, etc.

Useful snippets

Try out sample data:

python index data/samples/Java/ --name="samples"

python reform "how to download file by URL" --name="samples"
python reform "output random numbers" --name="samples"
python reform "connect to a server" --name="samples"
python reform "read input lines" --name="samples" --all

Use multiple index files at the same time:

python index PATH_TO_SAMPLES --name="samples"
python index PATH_TO_DEFAULT --name="default"

python reform "how to download file by URL" --name="samples"
python reform "output random numbers" --name="default"

Loosen the public API limit:

python reform "how to download file by URL" --name="samples" --all

Build index and reformulate terms using Java methods for reformulation:

python index ./data/samples/Java --name="samples" --methods
python reform "how to download file by URL" --name="samples" --methods

Build index and reformulate terms using Python methods for reformulation:

python index ./data/samples/Python --name="samples_python" --methods --lang="Python"
python reform "segment length to remove" --name="samples_python" --methods --lang="Python"

Build index for all files in a repository recursively

python index PATH -r

Downloading a big dataset

The chain of the links for the dataset ( -> -> Download file ERA_BigCloneBench_IJaDataset.tar.gz). The archive (when unzipped) has 3 folders, they are actually subsets of one another. So it's better to use one folder only for one index command

Running indexing on a big dataset

Indexing of big datasets with 200k+ files takes RAM so to it is possible to dump intermediate data from RAM to starage with --dump option.

python index data/samples/Java/ --name="samples" -d

API usage examples

Parts of the code can be reused for ad hoc reformulation-related tasks.


Normalization is the process of removal of common words and stemming of a query.

from reform.processing.normalize import Normalizer

normalizer = Normalizer()

# split on query terms, removes stop-words, stemms the terms
normalizer.process_query("how to play sound using java")

# ['play', 'sound', 'use', 'java']


Vectorization is the process counting the most common words that can be followed by reformulation.

from reform.parse.models import APINode
from reform.processing.vectorize import Vectorizer

# input snippets
corpus = [
            ['downloadwebpagesampl', 'address', 'string', 'string', 'client', 'http', 'client',
             'build', 'http', 'client', 'httpclient', 'request', 'address', 'http', 'get',
             'httpget', 'http', 'get',
             'httpget', 'respons', 'request', 'client', 'execut', 'http', 'respons', 'httprespons',
             'string', 'line', 'br', 'read', 'line', 'page', 'line', '"\\n"', 'br', 'close',
             'client', 'protocol', 'except', 'clientprotocolexcept', 'io', 'except', 'ioexcept']

# terms storage to output
ast_classes_storage = {"httpclient": APINode("HttpClient", count=1)}

# trains the model
vectorizer = Vectorizer()
vectorizer.train(corpus, ast_classes_storage, min_count=0)

# performes reformulation
res = vectorizer.reformulate(["request"], num_res=1)

# ['HttpClient']

Language-specific reformulations

Find a language using the input parameters

from reform.parse.selector import CorpusSelector

corpus_storage, parse =, self.methods, **kwargs)

if corpus_storage == NotImplemented: