searcher: make indexing of repos concurrent #272

sahildua2305 · 2017-12-26T18:59:26Z

This patch attempts to improve the startup time of hound vastly in cases
when we have huge number of repositories and hound would generally take
long time to start because of its sequential nature of indexing.

Now, we have the startup indexing in a concurrent way while respecting
the config.MaxConcurrentIndexers parameter set by users in config.json.

Fixes #250

sahildua2305 · 2017-12-26T19:18:01Z

It still needs some work with respect to the logging for different searchers. Right now, it's impossible to make any sense of the logs which are out of order because of the concurrency.

With "max-concurrent-indexers" : 3,, it looks like this:

2017/12/27 00:46:25 Searcher started for checkroot
2017/12/27 00:46:25 Searcher started for hackIDE2
2017/12/27 00:46:25 Searcher started for Hound2
2017/12/27 00:46:29 merge 0 files + mem
2017/12/27 00:46:29 5135 data bytes, 33552 index bytes
2017/12/27 00:46:29 Searcher started for social-media-share
2017/12/27 00:46:36 merge 0 files + mem
2017/12/27 00:46:36 25360 data bytes, 72254 index bytes
2017/12/27 00:46:36 Searcher started for Hound1
2017/12/27 00:46:37 merge 0 files + mem
2017/12/27 00:46:37 1982263 data bytes, 572876 index bytes
2017/12/27 00:46:37 Searcher started for hackIDE1
2017/12/27 00:46:47 merge 0 files + mem
2017/12/27 00:46:48 11896668 data bytes, 2077298 index bytes
2017/12/27 00:46:49 merge 0 files + mem
2017/12/27 00:46:49 1980573 data bytes, 572415 index bytes
2017/12/27 00:46:57 merge 0 files + mem
2017/12/27 00:46:58 11880955 data bytes, 2068178 index bytes
2017/12/27 00:46:58 All indexes built!
2017/12/27 00:46:58 running server at http://localhost:6080...

The rest of the implementation is good for review.

cc @kellegous

dgryski · 2017-12-27T15:33:47Z

You can simplify the goroutine logic by having a single channel that returns a single type with either a searcher or an error. A quick demo is here: https://play.golang.org/p/7ucBkx6MH_- . This way you know exactly how many results are coming back from the channel -- one per spawned goroutine. This layout also removes the waitgroup and the extra collector goroutine, and also makes it clear there are no data races with the maps. I think it's fine, but by updating them in the main goroutine it's clear by inspection.

For the issue with interleaved log lines, you'd need to thread the repository name into the logging code so each log line was prefixed appropriately. This could be done by not using the standard global logger but creating separate log.Logger instances and calling SetPrefix.

kellegous · 2017-12-28T13:42:56Z

I agree with @dgryski. The structure he proposes is almost exactly what I had in mind. Specifically, I like to think of this N processes that run independently (ignoring the limiter) and return their results (either good or bad) over a single channel. The main orchestrating routine will consume all N results in sequence and build the searcher map before returning.

sahildua2305 · 2017-12-29T18:16:47Z

Thanks for dropping in. Great suggestions, @dgryski! I'll make changes accordingly.

This patch attempts to improve the startup time of hound vastly in cases when we have huge number of repositories and hound would generally take long time to start because of its sequential nature of indexing. Now, we have the startup indexing in a concurrent way while respecting the config.MaxConcurrentIndexers parameter set by users in config.json. Fixes hound-search#250

Change the concurrent searchers init routine to simplify the implementation and use only one common channel to return successfully created searchers as well as errors. This way we also get rid of the go routine collecting the results on the channel by making the main routine collect the results by blocking the execution and listening on the channel.

`for range cfg.Repos` throws syntax error in golang 1.3. Hence using `len(cfg.Repos)` instead to iterate.

sahildua2305 · 2017-12-30T19:56:56Z

searcher/searcher.go

@@ -287,7 +287,7 @@ func MakeAll(cfg *config.Config) (map[string]*Searcher, map[string]error, error)
 	lim := makeLimiter(cfg.MaxConcurrentIndexers)

 	// Channel to receive the results from newSearcherConcurrent function.
-	resultCh := make(chan searcherResult, 1)
+	resultCh := make(chan searcherResult)


@dgryski IMO, making this channel unbuffered won't solve the problem of blocking writes from goroutines. How is unbuffered channel better than buffered channel of length 1?

Buffered channel of length len(cfg.Repos), however, will make sense.

Yes, I meant more that having a buffer size of 1 didn't make sense. Either make it totally unbuffered (so they all block), or with enough space for them all to put the results.

Ok, since we aren't doing any heavy work after receiving on the channel, I feel it's okay to make the goroutines blocking on sending to the channel. What do you think? Do you have a preference?

Standard practice for launching a set of goroutines with a response channel is to have it buffered with the number of known entries so that none of them block.

Ok, I'll use the standard practice then. Thanks 🙂

kellegous

Thanks for seeing this through.

sahildua2305 added 5 commits December 31, 2017 01:01

searcher: make results channel unbuffered

afb8e5d

searcher: remove explicit assignment of nil fields

a6aaac0

searcher: fix warning in golang 1.3

e74b892

`for range cfg.Repos` throws syntax error in golang 1.3. Hence using `len(cfg.Repos)` instead to iterate.

sahildua2305 force-pushed the concurrent-indexing branch from cb1ccc0 to e74b892 Compare December 30, 2017 19:32

sahildua2305 commented Dec 30, 2017

View reviewed changes

searcher: make result channel buffered

302e597

kellegous approved these changes Jan 18, 2018

View reviewed changes

kellegous merged commit 92e228e into hound-search:master Jan 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

searcher: make indexing of repos concurrent #272

searcher: make indexing of repos concurrent #272

sahildua2305 commented Dec 26, 2017

sahildua2305 commented Dec 26, 2017

dgryski commented Dec 27, 2017

kellegous commented Dec 28, 2017

sahildua2305 commented Dec 29, 2017

sahildua2305 Dec 30, 2017

dgryski Dec 30, 2017

sahildua2305 Dec 30, 2017

dgryski Dec 30, 2017

sahildua2305 Dec 30, 2017

kellegous left a comment

searcher: make indexing of repos concurrent #272

searcher: make indexing of repos concurrent #272

Conversation

sahildua2305 commented Dec 26, 2017

sahildua2305 commented Dec 26, 2017

dgryski commented Dec 27, 2017

kellegous commented Dec 28, 2017

sahildua2305 commented Dec 29, 2017

sahildua2305 Dec 30, 2017

Choose a reason for hiding this comment

dgryski Dec 30, 2017

Choose a reason for hiding this comment

sahildua2305 Dec 30, 2017

Choose a reason for hiding this comment

dgryski Dec 30, 2017

Choose a reason for hiding this comment

sahildua2305 Dec 30, 2017

Choose a reason for hiding this comment

kellegous left a comment

Choose a reason for hiding this comment