Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

searcher: make indexing of repos concurrent #272

Merged
merged 6 commits into from
Jan 18, 2018

Conversation

sahildua2305
Copy link
Contributor

This patch attempts to improve the startup time of hound vastly in cases
when we have huge number of repositories and hound would generally take
long time to start because of its sequential nature of indexing.

Now, we have the startup indexing in a concurrent way while respecting
the config.MaxConcurrentIndexers parameter set by users in config.json.

Fixes #250

@sahildua2305
Copy link
Contributor Author

It still needs some work with respect to the logging for different searchers. Right now, it's impossible to make any sense of the logs which are out of order because of the concurrency.

With "max-concurrent-indexers" : 3,, it looks like this:

2017/12/27 00:46:25 Searcher started for checkroot
2017/12/27 00:46:25 Searcher started for hackIDE2
2017/12/27 00:46:25 Searcher started for Hound2
2017/12/27 00:46:29 merge 0 files + mem
2017/12/27 00:46:29 5135 data bytes, 33552 index bytes
2017/12/27 00:46:29 Searcher started for social-media-share
2017/12/27 00:46:36 merge 0 files + mem
2017/12/27 00:46:36 25360 data bytes, 72254 index bytes
2017/12/27 00:46:36 Searcher started for Hound1
2017/12/27 00:46:37 merge 0 files + mem
2017/12/27 00:46:37 1982263 data bytes, 572876 index bytes
2017/12/27 00:46:37 Searcher started for hackIDE1
2017/12/27 00:46:47 merge 0 files + mem
2017/12/27 00:46:48 11896668 data bytes, 2077298 index bytes
2017/12/27 00:46:49 merge 0 files + mem
2017/12/27 00:46:49 1980573 data bytes, 572415 index bytes
2017/12/27 00:46:57 merge 0 files + mem
2017/12/27 00:46:58 11880955 data bytes, 2068178 index bytes
2017/12/27 00:46:58 All indexes built!
2017/12/27 00:46:58 running server at http://localhost:6080...

The rest of the implementation is good for review.

cc @kellegous

@dgryski
Copy link

dgryski commented Dec 27, 2017

You can simplify the goroutine logic by having a single channel that returns a single type with either a searcher or an error. A quick demo is here: https://play.golang.org/p/7ucBkx6MH_- . This way you know exactly how many results are coming back from the channel -- one per spawned goroutine. This layout also removes the waitgroup and the extra collector goroutine, and also makes it clear there are no data races with the maps. I think it's fine, but by updating them in the main goroutine it's clear by inspection.

For the issue with interleaved log lines, you'd need to thread the repository name into the logging code so each log line was prefixed appropriately. This could be done by not using the standard global logger but creating separate log.Logger instances and calling SetPrefix.

@kellegous
Copy link
Member

I agree with @dgryski. The structure he proposes is almost exactly what I had in mind. Specifically, I like to think of this N processes that run independently (ignoring the limiter) and return their results (either good or bad) over a single channel. The main orchestrating routine will consume all N results in sequence and build the searcher map before returning.

@sahildua2305
Copy link
Contributor Author

Thanks for dropping in. Great suggestions, @dgryski! I'll make changes accordingly.

This patch attempts to improve the startup time of hound vastly in cases
when we have huge number of repositories and hound would generally take
long time to start because of its sequential nature of indexing.

Now, we have the startup indexing in a concurrent way while respecting
the config.MaxConcurrentIndexers parameter set by users in config.json.

Fixes hound-search#250
Change the concurrent searchers init routine to simplify the
implementation and use only one common channel to return successfully
created searchers as well as errors. This way we also get rid of the go
routine collecting the results on the channel by making the main routine
collect the results by blocking the execution and listening on the
channel.
`for range cfg.Repos` throws syntax error in golang 1.3. Hence using
`len(cfg.Repos)` instead to iterate.
@@ -287,7 +287,7 @@ func MakeAll(cfg *config.Config) (map[string]*Searcher, map[string]error, error)
lim := makeLimiter(cfg.MaxConcurrentIndexers)

// Channel to receive the results from newSearcherConcurrent function.
resultCh := make(chan searcherResult, 1)
resultCh := make(chan searcherResult)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgryski IMO, making this channel unbuffered won't solve the problem of blocking writes from goroutines. How is unbuffered channel better than buffered channel of length 1?

Buffered channel of length len(cfg.Repos), however, will make sense.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant more that having a buffer size of 1 didn't make sense. Either make it totally unbuffered (so they all block), or with enough space for them all to put the results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, since we aren't doing any heavy work after receiving on the channel, I feel it's okay to make the goroutines blocking on sending to the channel. What do you think? Do you have a preference?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard practice for launching a set of goroutines with a response channel is to have it buffered with the number of known entries so that none of them block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll use the standard practice then. Thanks 🙂

Copy link
Member

@kellegous kellegous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for seeing this through.

@kellegous kellegous merged commit 92e228e into hound-search:master Jan 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants