When searching for useful code snippets, it helps to reformulate search queries in terms of code classes or methods. For example, a query: "How to download a file by URL in Java", can be expanded to include "InputStream, Vector, DataOutputStream". These terms are extracted from a pre-indexed storage that is built using example Java codes.
Code search are formulated as a natural-language query, but actual code may contain esoteric terms and phrases which can be difficult to retrieve. This tool translates and expands natural-language queries to a format that can be fed to a code search system for more relevant search results. See details.
To setup the application, it requires pip. and you might want to use a python virtual environment.
# activate a virtualenv # install requirements pip install -r requirements.txt
To setup the application with Conda package management system, it requires conda.
# Install conda # Create a conda environment from conda.yml conda env create -f conda.yml # Activate the environment source activate reformulate
To check if the package properly installed you may run tests and perform a simple refomulation
# (optional) run tests python -m unittest discover # build a sample index python reformulation.py index ./data/samples/Java/ # run a sample reformulation python reformulation.py reform "how to write a file"
Before reformulation can be performed, the index has to be created. The index is based on training data - code snippets. It contains frequencies of each normalized term in the corpus and ASTs that will be used for reformulation. The index can be created once for each corpus and then reused for every query.
To create index run:
python reformulation.py index /path/to/code/snippets
The specified path is scanned (non-recursively) to find code files. Code files are parsed and snippets are extracted from each file. The index is stored in the ./index folder with index base name. The default path can be changed. python reformulation.py --help shows available parameters. Note, that the index operation is a slow operation if there are many code files involved. Once the index is created it can used for reformulation.
To reformulate queries run:
python reformulation.py reform "how to download file by URL"
The command reformulates the input query using the index in the ./index folder. The output is printed to the console. By default, the output contains only public API methods. --all parameters can be used to loosen the limit.
The default language for reformulation is Java. You can change the setting with the --lang="Python" parameter. The parameter must be specified both at the index and reformulation stage.
In the interactive mode the application accepts queries and outputs the reformulation result in the terminal. To enter the interactive mode use -i key of the reform command.
To find out all available parameters, please, run:
python reformulation.py --help
With additional parameters you can change index path to build multiple index files, change language, logging output, etc.
Try out sample data:
python reformulation.py index data/samples/Java/ --name="samples" python reformulation.py reform "how to download file by URL" --name="samples" python reformulation.py reform "output random numbers" --name="samples" python reformulation.py reform "connect to a server" --name="samples" python reformulation.py reform "read input lines" --name="samples" --all
Use multiple index files at the same time:
python reformulation.py index PATH_TO_SAMPLES --name="samples" python reformulation.py index PATH_TO_DEFAULT --name="default" python reformulation.py reform "how to download file by URL" --name="samples" python reformulation.py reform "output random numbers" --name="default"
Loosen the public API limit:
python reformulation.py reform "how to download file by URL" --name="samples" --all
Build index and reformulate terms using Java methods for reformulation:
python reformulation.py index ./data/samples/Java --name="samples" --methods python reformulation.py reform "how to download file by URL" --name="samples" --methods
Build index and reformulate terms using Python methods for reformulation:
python reformulation.py index ./data/samples/Python --name="samples_python" --methods --lang="Python" python reformulation.py reform "segment length to remove" --name="samples_python" --methods --lang="Python"
Build index for all files in a repository recursively
python reformulation.py index PATH -r
The chain of the links for the dataset (https://github.com/clonebench/BigCloneBench -> https://drive.google.com/file/d/0B70GNOiQD-X7ZDVBMzRUWktDUWs/view -> Download file ERA_BigCloneBench_IJaDataset.tar.gz). The archive (when unzipped) has 3 folders, they are actually subsets of one another. So it's better to use one folder only for one index command
Indexing of big datasets with 200k+ files takes RAM so to it is possible to dump intermediate data from RAM to starage with --dump option.
python reformulation.py index data/samples/Java/ --name="samples" -d
Parts of the code can be reused for ad hoc reformulation-related tasks.
Normalization is the process of removal of common words and stemming of a query.
from reform.processing.normalize import Normalizer normalizer = Normalizer() # split on query terms, removes stop-words, stemms the terms normalizer.process_query("how to play sound using java") # ['play', 'sound', 'use', 'java']
Vectorization is the process counting the most common words that can be followed by reformulation.
from reform.parse.models import APINode from reform.processing.vectorize import Vectorizer # input snippets corpus = [ ['downloadwebpagesampl', 'address', 'string', 'string', 'client', 'http', 'client', 'build', 'http', 'client', 'httpclient', 'request', 'address', 'http', 'get', 'httpget', 'http', 'get', 'httpget', 'respons', 'request', 'client', 'execut', 'http', 'respons', 'httprespons', 'string', 'line', 'br', 'read', 'line', 'page', 'line', '"\\n"', 'br', 'close', 'client', 'protocol', 'except', 'clientprotocolexcept', 'io', 'except', 'ioexcept'] ] # terms storage to output ast_classes_storage = {"httpclient": APINode("HttpClient", count=1)} # trains the model vectorizer = Vectorizer() vectorizer.train(corpus, ast_classes_storage, min_count=0) # performes reformulation res = vectorizer.reformulate(["request"], num_res=1) # ['HttpClient']
Find a language using the input parameters
from reform.parse.selector import CorpusSelector corpus_storage, parse = CorpusSelector.select(self.lang, self.methods, **kwargs) if corpus_storage == NotImplemented: return