Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when file is large, seek is very slow #155

Open
observerss opened this issue Mar 29, 2023 · 1 comment
Open

when file is large, seek is very slow #155

observerss opened this issue Mar 29, 2023 · 1 comment

Comments

@observerss
Copy link

In stream.py, seek function is

    def seek(self, offset: int, whence: int = 0) -> int:  # noqa: C901
        """Seek the file object."""
        if whence == 0:
            loc = offset
        elif whence == 1:
            if offset >= 0:
                self.read(offset)
                return self.loc
            loc = self.loc + offset
        elif whence == 2:
            if not self.size:
                raise ValueError("cannot seek to the end of file")
            loc = self.size + offset
        else:
            raise ValueError(f"invalid whence ({whence}, should be 0, 1 or 2)")
        if loc < 0:
            raise ValueError("Seek before start of file")
        if loc and not self.supports_ranges:
            raise ValueError("server does not support ranges")

        self.close()
        self._cm = iter_url(self.client, self.url, pos=loc, chunk_size=self.chunk_size)
        #  pylint: disable=no-member
        _, self._iterator = self._cm.__enter__()
        self.loc = loc
        return loc

when whence == 1 and offset > 0, the seek will read to the offset

            if offset >= 0:
                self.read(offset)
                return self.loc
            loc = self.loc + offset

to seek 1G later will read 1G content first, which is very inefficient
If I comment out the if statement, the seek operation works too, it will create a new iterator, use Range header to fast locate the position

@skshetry
Copy link
Owner

I think it was added assuming that on SEEK_CUR, the offsets are small, and might be already cached in our buffer and that I wanted to reset the iterator as much as possible (not all webdav servers support ranges).

Feel free to propose a PR. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants