Skip to content

Commit

Permalink
[#1010] Add Solr instructions for prebuilt Lucene index (#1379)
Browse files Browse the repository at this point in the history
  • Loading branch information
adamyy committed Sep 13, 2020
1 parent 748f949 commit bc2628b
Showing 1 changed file with 42 additions and 0 deletions.
42 changes: 42 additions & 0 deletions docs/solrini.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ curl -X POST -H 'Content-type:application/json' --data-binary @src/main/resource
We can use Anserini as a common "frontend" for indexing into SolrCloud, thus supporting the same range of test collections that's already included in Anserini (when directly building local Lucene indexes).
Indexing into Solr is similar indexing to disk with Lucene, with a few added parameters.
Most notably, we replace the `-index` parameter (which specifies the Lucene index path on disk) with Solr parameters.
Alternatively, Solr can also be configured to [read prebuilt Lucene index](#solr-with-prebuilt-lucene-index), since Solr uses Lucene indexes under the hood.

We'll index [robust04](regressions-robust04.md) as an example.
First, create the `robust04` collection in Solr:
Expand Down Expand Up @@ -98,6 +99,47 @@ Make sure `core18` collection is created and `/path/to/WashingtonPost` is update
Solrini has also been verified to work with the [MS MARCO Passage Retrieval Corpus](experiments-msmarco-passage.md).
There should be no major issues with other collections that are supported by Anserini, but we have not tested them.

## Solr with Prebuilt Lucene Index

Solr can be considered a front-end for Lucene, and it is entirely possible for Solr to read prebuilt Lucene indexes.
To achieve this, some housekeeping are required.
The following uses [Robust04](regressions-robust04.md) as an example.
Assuming your index files are stored under `indexes/robust04/lucene-index.robust04.pos+docvectors+rawdocs/`.

First, a Solr collection must be created to house the index.
Here we create a collection `robust04` with configset `anserini`.

```
solrini/bin/solr create -n anserini -c robust04
```

Along with the collection, Solr will create a core instance, whose name can be found in the Solr UI under collection overview.
It might look something like `<collection_name>_shard<id>_replica_<id>` (e.g., `robust04_shard1_replica_n1`).
Solr stores configurations and data for the core instances under Solr home, which for us is `solrini/server/solr/` by default.

Second, make proper Solr schema adjustments if required.
Here `robust04` is a TREC collection whose schema is already taken care of by [managed-schema](https://github.com/castorini/anserini/blob/master/src/main/resources/solr/anserini/conf/managed-schema) in the Solr configset.
However, if you are dealing with a collection such as `cord19`, remember to make proper adjustments to the Solr schema, as [previously described](#setting-up-a-single-node-solrcloud-instance).

```
curl -X POST -H 'Content-type:application/json' --data-binary @src/main/resources/solr/schemas/SCHEMA_NAME.json http://localhost:8983/solr/COLLECTION_NAME/schema
```

Then, copy/move the index files to where Solr expected.
As previously established, Solr stores its index data in a directory called `/data` under the core’s instance directory (`solrini/server/solr/<core-instance-directory>/data`).
You can simply copy your Lucene index files to `/data/index` and Solr will be able to pick them up from there.

```
cp indexes/robust04/lucene-index.robust04.pos+docvectors+rawdocs/* solrini/server/solr/robust04_shard1_replica_n1/data/index
```

Lastly, restart Solr to make sure changes are effective.

```
solrini/bin/solr stop
solrini/bin/solr start -c -m 8G
```

## Solr integration test

We have an end-to-end integration testing script `run_solr_regression.py`.
Expand Down

0 comments on commit bc2628b

Please sign in to comment.