[#1010] Add Solr instructions for prebuilt Lucene index (#1379)

castorini · Sep 13, 2020 · bc2628b · bc2628b
1 parent 748f949
commit bc2628b
Showing 1 changed file with 42 additions and 0 deletions.
diff --git a/docs/solrini.md b/docs/solrini.md
@@ -48,6 +48,7 @@ curl -X POST -H 'Content-type:application/json' --data-binary @src/main/resource
 We can use Anserini as a common "frontend" for indexing into SolrCloud, thus supporting the same range of test collections that's already included in Anserini (when directly building local Lucene indexes).
 Indexing into Solr is similar indexing to disk with Lucene, with a few added parameters.
 Most notably, we replace the `-index` parameter (which specifies the Lucene index path on disk) with Solr parameters.
+Alternatively, Solr can also be configured to [read prebuilt Lucene index](#solr-with-prebuilt-lucene-index), since Solr uses Lucene indexes under the hood.
 
 We'll index [robust04](regressions-robust04.md) as an example.
 First, create the `robust04` collection in Solr:
@@ -98,6 +99,47 @@ Make sure `core18` collection is created and `/path/to/WashingtonPost` is update
 Solrini has also been verified to work with the [MS MARCO Passage Retrieval Corpus](experiments-msmarco-passage.md).
 There should be no major issues with other collections that are supported by Anserini, but we have not tested them.
 
+## Solr with Prebuilt Lucene Index
+
+Solr can be considered a front-end for Lucene, and it is entirely possible for Solr to read prebuilt Lucene indexes. 
+To achieve this, some housekeeping are required.
+The following uses [Robust04](regressions-robust04.md) as an example. 
+Assuming your index files are stored under `indexes/robust04/lucene-index.robust04.pos+docvectors+rawdocs/`.
+
+First, a Solr collection must be created to house the index.
+Here we create a collection `robust04` with configset `anserini`.
+
+```
+solrini/bin/solr create -n anserini -c robust04
+```
+
+Along with the collection, Solr will create a core instance, whose name can be found in the Solr UI under collection overview.
+It might look something like `<collection_name>_shard<id>_replica_<id>` (e.g., `robust04_shard1_replica_n1`).
+Solr stores configurations and data for the core instances under Solr home, which for us is `solrini/server/solr/` by default.
+
+Second, make proper Solr schema adjustments if required.
+Here `robust04` is a TREC collection whose schema is already taken care of by [managed-schema](https://github.com/castorini/anserini/blob/master/src/main/resources/solr/anserini/conf/managed-schema) in the Solr configset.
+However, if you are dealing with a collection such as `cord19`, remember to make proper adjustments to the Solr schema, as [previously described](#setting-up-a-single-node-solrcloud-instance).
+
+```
+curl -X POST -H 'Content-type:application/json' --data-binary @src/main/resources/solr/schemas/SCHEMA_NAME.json http://localhost:8983/solr/COLLECTION_NAME/schema
+```
+
+Then, copy/move the index files to where Solr expected.
+As previously established, Solr stores its index data in a directory called `/data` under the core’s instance directory (`solrini/server/solr/<core-instance-directory>/data`).
+You can simply copy your Lucene index files to `/data/index` and Solr will be able to pick them up from there.
+
+```
+cp indexes/robust04/lucene-index.robust04.pos+docvectors+rawdocs/* solrini/server/solr/robust04_shard1_replica_n1/data/index
+```
+
+Lastly, restart Solr to make sure changes are effective.
+
+```
+solrini/bin/solr stop
+solrini/bin/solr start -c -m 8G
+```
+
 ## Solr integration test
 
 We have an end-to-end integration testing script `run_solr_regression.py`.