Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get all documents? #144

Closed
sfeast opened this issue Mar 30, 2022 · 8 comments
Closed

How to get all documents? #144

sfeast opened this issue Mar 30, 2022 · 8 comments

Comments

@sfeast
Copy link

sfeast commented Mar 30, 2022

Is there a way to get all documents returned as results?

For example:

miniSearch.search("")

returns an empty array, but I'm looking for a way to get the opposite, all documents.

A use case I have is that I want to only filter by a numeric range in some cases. Something like this:

// get all documents with val property >= minVal
miniSearch.search('', {
    filter: (result) => {
        return result.val >= minVal
    }
})

but that currently returns nothing since no results are given to the filter.

I know it's not the best use of this library as mentioned here - #119 (comment) however it's just one of several scenarios I'm using it for & would be great to be able to leverage it as well for this.

& awesome library btw 🙌 🙏

@sfeast sfeast changed the title How to return all documents How to get all documents Mar 30, 2022
@sfeast sfeast changed the title How to get all documents How to get all documents? Mar 30, 2022
@lucaong
Copy link
Owner

lucaong commented Mar 31, 2022

Hi @sfeast ,
at the moment there is no built-in way to return all documents. I am evaluating a possibility, so it might be provided as a feature in the near future, but unfortunately not yet.

In general, it's often easier to filter outside of MiniSearch if you do not need to perform a full-text search. Something like:

documents.filter((doc) => doc.val >= minVal)

That said, if you really want to do that within MiniSearch, one way to do that is by creating a dummy field that always has a certain value:

const miniSearch = new MiniSearch({
  fields: ['dummy', 'val', /* ...your other fields here */],
  storeFields: ['val'],
  extractField: (document, fieldName) => {
    // Create a dummy field that always have the value "xxx"
    if (fieldName === 'dummy') { return 'xxx' }
    return document[fieldName]
  }
})

// Searching for "xxx" in field "dummy" should return all documents
miniSearch.search('xxx', {
  fields: ['dummy'],
  filter: (result) => result.val >= minVal
})

If, instead of xxx, you use a value that is guaranteed not to be in any other field (say, a rspecific andom alphanumeric string), you could even avoid restricting the search on the dummy field.

Now, this is admittedly a little hacky, but it should get it done.

@sfeast
Copy link
Author

sfeast commented Mar 31, 2022

Thanks @lucaong - both current options are workable for me & I appreciate the detailed example 🙇

One last question - with this approach:

documents.filter((doc) => doc.val >= minVal)

documents here would be my own copy of the documents right? ie there's no way to get that from MiniSearch directly? Just trying to avoid keeping my own copy if possible.

@lucaong
Copy link
Owner

lucaong commented Mar 31, 2022

Happy to help :)

Yes, that’s correct, in the first option that would be your own copy of the documents.

@lucaong lucaong closed this as completed Apr 1, 2022
@samuelstroschein
Copy link

I am looking for an alternative to fuse.js that optionally returns the original list given an empty search query "". The need for such a feature seems big, see krisk/Fuse#229 (PS your chance ;) )

@lucaong
Copy link
Owner

lucaong commented May 1, 2022

Hi @samuelstroschein ,
thanks for your comment.
The way I would implement a solution for such a need is:

const documents = [/* your documents…*/]

const docsById = documents.reduce((byId, doc) => {
  byId[doc.id]
  return byId
}, {})

const miniSearch = new MiniSearch({
  fields: [/* your fields… */]
})

const search = (query, options = {}) => {
  if (query == null) { return [] }
  if (query.trim().length === 0) {
    return documents
  } else {
    return miniSearch.search(query, options).map((result) => docsById[result.id])
  }
}

// Usage
search('')
// => …all documents 

search('something')
// => documents matching 'something'

As you can see, I had to make some choices that depend on the use case, such as:

  • A search query containing only spaces is considered empty
  • A null search query is not the same as an empty search query
  • The search function returns the matching documents in order of relevance, as opposed to search results (with match info, etc.)
  • When passed an empty query, the search function returns the original documents in their original order

Each of these decisions could vary depending on the specific needs of a project. Therefore, also considering that it is simple to define such a wrapper function, I think it is better that MiniSearch does not offer “natively” the option to return all results on empty search, and leaves the choice of implementation details to the developers.

@samuelstroschein
Copy link

samuelstroschein commented May 1, 2022

@lucaong

I had to make some choices that depend on the use case

Implement the option with a callback instead of a boolean flag.

const search = new MiniSearch({
  fields: [/* your fields… */],
  returnOriginalDocumentsWhen: (searchQuery) => searchQuery === ""
})

I think it is better that MiniSearch does not offer “natively” the option to return all results on empty search, and leaves the choice of implementation details to the developers.

Hmm, I mean that's why people install an npm package in the first place. I don't want to write code. The discussion in the other package is quite big, indicating that a lot of people want that feature.

Edit

Actually what it really needs is just a filter instead of search function.

minisearch.filter(('search query'))

@lucaong
Copy link
Owner

lucaong commented May 2, 2022

Implement the option with a callback instead of a boolean flag.

It's unfortunately more complicated than that. For example, how should documents be sorted? It would seem reasonable to return them in the original order, but what if one defines a boostDocument function? Then it makes more sense to compute the boost for each document and re-sort them. But since the original list is static, a smart developer would prefer to pre-sort the list only once, and skip the search-time boosting calculation when returning all documents.

Similarly, since MiniSearch returns an array of SearchResult, not documents, when returning all results it would have to first map each document into a search result. But depending on the use case, developers might map results back to documents (like I did in my example before). In that case, it's a lot more efficient to avoid mapping to SearchResult[] in the first place (especially as it maps the whole collection, potentially tens of thousands documents).

Moreover, at the moment MiniSearch does not keep a reference to the original collection of documents, so it cannot return it. This is by choice: it is possible to make some documents searchable without storing the document itself in memory.

Of course, it is theoretically possible to implement options for each of these choices, but that would make the API surface huge, and hard to learn. Instead, these details are better defined in code. The reason why code is better than configuration in this case, is that configuration is something that has to be learned for each and every library, while code is general purpose: for a configuration option to be ergonomic, it has to save the developer a non-trivial amount of code or cognitive load. If it generates more open questions, it is not worth, because learning all the implications takes more effort than taking control of the issue with code.

Hmm, I mean that's why people install an npm package in the first place. I don't want to write code.

I would say, one does not want to write code at the wrong level of abstraction. What I mean is: even when using a library, one does have to write code. The point is that one normally prefers to avoid writing code that pertains to the internal details of the problem solved by the library, and instead focus on code pertaining to the higher level goal of the application.

Therefore, a library has to choose its own boundaries and goals. MiniSearch, as its design document outlines, "enables developers to build [turn-key opinionated solutions] on top of its core API, but does not provide them out of the box.". MiniSearch takes care, for example, of implementation details of the inverted index or of the document scoring, but it leaves to the developers the responsibility to write code that defines their specific full-text search problem.

It would be absolutely appropriate to build a library on top of MiniSearch that makes some of these decision and builds a higher level of abstraction. That would save developers from writing some code, but also restrict their options. For developers that have those specific needs, such library would facilitate things. MiniSearch itself though has to enable also developers that have different needs. In other words, your request is completely legitimate, it just lies outside of MiniSearch self-assigned boundaries of abstraction.

The discussion in the other package is quite big, indicating that a lot of people want that feature.

I understand and respect the fact that many people have this need. As a matter of fact, even some of my own apps have the same need. But apart from using MiniSearch in my production applications, I do not profit from MiniSearch: my motivation in maintaining it stems from the satisfaction of what I consider a well crafted piece of software. I am happy if more people use it, because it means that it is solving more problems than it was originally conceived for, but I would not sacrifice the solidity of its design for popularity. By open-sourcing my library, I get to keep the satisfaction of crafting software the way I consider best, without having to sacrifice it to chase more users. Users, in turn, get the freedom to use my library, and to create applications or other libraries on top of it.

In sum, I do agree with you that yours is a common need. My opinion though, is that such need is better served by writing some thin layer of code, like the example I provided, than by adding more configuration options. But it is perfectly reasonable to disagree with that, and such thin layer can be packaged in a library for convenience.

@samuelstroschein
Copy link

samuelstroschein commented May 2, 2022

@lucaong

Thank you for the in-depth reply and explanation. I overread the stated goal of "[...] enables developers to build [turn-key opinionated solutions] on top of its core API, but does not provide them out of the box" and was looking (expecting) a drop-in replacement for fuse.js.


On a side note, I have a question regarding your i18n workflow at megaloop. Can you send me a DM on Twitter or via email?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants