diff --git a/README.ipynb b/README.ipynb index ef25721..f90e3ab 100644 --- a/README.ipynb +++ b/README.ipynb @@ -50,9 +50,10 @@ " numbers = ''\n", " return f'[{title}]({file}{numbers})'\n", "\n", - "dmd = lambda x: display(md(x))\n", + "dmd = lambda x: display(md(x.strip()))\n", "\n", "import cachew\n", + "import cachew.extra\n", "import cachew.experimental\n", "import cachew.tests.test_cachew as tests" ] @@ -82,18 +83,19 @@ } }, "source": [ - "[![CircleCI](https://circleci.com/gh/karlicoss/cachew.svg?style=svg)](https://circleci.com/gh/karlicoss/cachew)\n", - "\n", "# What is Cachew?\n", "TLDR: cachew lets you **cache function calls** into an sqlite database on your disk in a matter of **single decorator** (similar to [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache)). The difference from `functools.lru_cache` is that cached data is persisted between program runs, so next time you call your function, it will only be a matter of reading from the cache.\n", "Cache is **invalidated automatically** if your function's arguments change, so you don't have to think about maintaining it.\n", "\n", - "In order to be cacheable, your function needs to return (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator), that is generator, tuple or list) of simple data types:\n", + "In order to be cacheable, your function needs to return (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator), that is a generator, tuple or list) of simple data types:\n", "\n", "- primitive types: `str`/`int`/`float`/`datetime`\n", + "- JSON-like types\n", + "- `Exception` (useful for [error handling](https://beepb00p.xyz/mypy-error-handling.html#kiss) )\n", "- [NamedTuples](https://docs.python.org/3/library/typing.html#typing.NamedTuple)\n", "- [dataclasses](https://docs.python.org/3/library/dataclasses.html)\n", "\n", + "\n", "That allows to **automatically infer schema from type hints** ([PEP 526](https://www.python.org/dev/peps/pep-0526)) and not think about serializing/deserializing.\n", "\n", "## Motivation\n", @@ -153,7 +155,7 @@ "\n", "I'm using an [environment sensor](https://bluemaestro.com/products/product-details/bluetooth-environmental-monitor-and-logger) to log stats about temperature and humidity.\n", "Data is synchronized via bluetooth in the sqlite database, which is easy to access. However sensor has limited memory (e.g. 1000 latest measurements).\n", - "That means that I end up with a new database every few days which contains, each of them containing only slice of data I need: e.g.:\n", + "That means that I end up with a new database every few days, each of them containing only a slice of data I need, e.g.:\n", "\n", " ...\n", " 20190715100026.db\n", @@ -182,9 +184,9 @@ " - requires careful scheduling, ideally you want to access new data without having to refresh cache\n", "\n", " \n", - "Cachew gives me best of two worlds and makes it **easy and efficient**. Only thing you have to do is to decorate your function:\n", + "Cachew gives the best of two worlds and makes it both **easy and efficient**. The only thing you have to do is to decorate your function:\n", "\n", - " @cachew(\"/data/cache/measurements.sqlite\") \n", + " @cachew \n", " def measurements(chunks: List[Path]) -> Iterator[Measurement]:\n", " # ...\n", " \n", @@ -247,11 +249,12 @@ "* supported types: \n", "\n", " * primitive: {', '.join(types)}\n", + " \n", + " See {flink('tests.test_types')}, {flink('tests.test_primitive')}, {flink('tests.test_dates')}\n", " * {flink('Optional', 'tests.test_optional')} types\n", " * {flink('Union', 'tests.test_union')} types\n", " * {flink('nested datatypes', 'tests.test_nested')}\n", - " * {flink('Exceptions', 'tests.test_exceptions')} (experimental, enabled by calling {flink('`cachew.experimental.enable_exceptions`')})\n", - " {cachew.experimental.enable_exceptions.__doc__.replace(' ', ' ' * 7)}\n", + " * {flink('Exceptions', 'tests.test_exceptions')}\n", " \n", "* detects {flink('datatype schema changes', 'tests.test_schema_change')} and discards old data automatically \n", "\"\"\")\n", @@ -267,7 +270,18 @@ "\n", "During reading cache all that happens is reading rows from sqlite and mapping them onto your target datatype, so the only overhead would be from reading sqlite, which is quite fast.\n", "\n", - "I haven't set up formal benchmarking/regression tests yet, so don't want to make specific claims, however that would almost certainly make your programm faster if computations take more than several seconds." + "I haven't set up proper benchmarks/performance regressions yet, so don't want to make specific claims, however that would almost certainly make your programm faster if computations take more than several seconds." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dmd(f'''\n", + "If you want to experiment for youself, check out {flink('tests.test_many')}\n", + "''')" ] }, { @@ -288,19 +302,21 @@ "See {flink('docstring', 'cachew.cachew')} for up-to-date documentation on parameters and return types. \n", "You can also use {flink('extensive unit tests', 'tests')} as a reference.\n", " \n", - "Some useful arguments of `@cachew` decorator:\n", + "Some useful (but optional) arguments of `@cachew` decorator:\n", " \n", - "* `cache_path` can be a filename, or you can specify a callable that {flink('returns a path', 'tests.test_callable_cache_path')} and depends on function's arguments.\n", - " \n", - " It's not required to specify the path (it will be created in `/tmp`) but recommended.\n", + "* `cache_path` can be a directory, or a callable that {flink('returns a path', 'tests.test_callable_cache_path')} and depends on function's arguments.\n", + " \n", + " By default, `settings.DEFAULT_CACHEW_DIR` is used.\n", " \n", - "* `hashf` is a function that determines whether your arguments have changed.\n", + "* `depends_on` is a function which determines whether your inputs have changed, and the cache needs to be invalidated.\n", " \n", " By default it just uses string representation of the arguments, you can also specify a custom callable.\n", " \n", " For instance, it can be used to {flink('discard cache', 'tests.test_custom_hash')} if the input file was modified.\n", " \n", - "* `cls` is the type that would be serialized. It is inferred from return type annotations by default, but can be specified if you don't control the code you want to cache. \n", + "* `cls` is the type that would be serialized.\n", + "\n", + " By default, it is inferred from return type annotations, but can be specified if you don't control the code you want to cache. \n", "\"\"\")" ] }, @@ -316,10 +332,10 @@ "# Installing\n", "Package is available on [pypi](https://pypi.org/project/cachew/).\n", "\n", - " pip install cachew\n", + " pip3 install --user cachew\n", " \n", "## Developing\n", - "I'm using [tox](tox.ini) to run tests, and [circleci](.circleci/config.yml)." + "I'm using [tox](tox.ini) to run tests, and [Github Actions](.github/workflows/main.yml) for CI." ] }, { @@ -389,9 +405,9 @@ "metadata": {}, "outputs": [], "source": [ - "import cachew.misc\n", + "import cachew.extra\n", "dmd(f\"\"\"```python\n", - "{inspect.getsource(cachew.misc.mcachew)}\n", + "{inspect.getsource(cachew.extra.mcachew)}\n", "```\"\"\")" ] }, @@ -399,7 +415,34 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now you can use `@mcachew` in place of `@cachew`, and be certain things don't break if `cachew` is missing." + "Now you can use `@mcachew` in place of `@cachew`, and be certain things don't break if `cachew` is missing.\n", + "\n", + "## Settings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dmd(f'''\n", + "{flink('cachew.settings')} exposes some parameters that allow you to control `cachew` behaviour:\n", + "- `ENABLE`: set to `False` if you want to disable caching for without removing the decorators (useful for testing and debugging).\n", + " You can also use {flink('cachew.extra.disabled_cachew')} context manager to do it temporarily.\n", + "- `DEFAULT_CACHEW_DIR`: override to set a different base directory.\n", + "- `THROW_ON_ERROR`: by default, cachew is defensive and simply attemps to cause the original function on caching issues.\n", + " Set to `True` to catch errors earlier.\n", + "\n", + "''')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Updating this readme\n", + "This is a literate readme, implemented as a Jupiter notebook: [README.ipynb](README.ipynb). To update the (autogenerated) [README.md](README.md), use [generate-readme](generate-readme) script." ] } ], @@ -420,7 +463,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.3" + "version": "3.8.2" }, "name": "README.ipynb" }, diff --git a/README.md b/README.md index fd1204d..e7ce643 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,19 @@ - - - + # What is Cachew? TLDR: cachew lets you **cache function calls** into an sqlite database on your disk in a matter of **single decorator** (similar to [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache)). The difference from `functools.lru_cache` is that cached data is persisted between program runs, so next time you call your function, it will only be a matter of reading from the cache. Cache is **invalidated automatically** if your function's arguments change, so you don't have to think about maintaining it. -In order to be cacheable, your function needs to return (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator), that is generator, tuple or list) of simple data types: +In order to be cacheable, your function needs to return (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator), that is a generator, tuple or list) of simple data types: - primitive types: `str`/`int`/`float`/`datetime` +- JSON-like types +- `Exception` (useful for [error handling](https://beepb00p.xyz/mypy-error-handling.html#kiss) ) - [NamedTuples](https://docs.python.org/3/library/typing.html#typing.NamedTuple) - [dataclasses](https://docs.python.org/3/library/dataclasses.html) + That allows to **automatically infer schema from type hints** ([PEP 526](https://www.python.org/dev/peps/pep-0526)) and not think about serializing/deserializing. ## Motivation @@ -73,7 +74,7 @@ This is my most common usecase of cachew, which I'll illustrate with example. I'm using an [environment sensor](https://bluemaestro.com/products/product-details/bluetooth-environmental-monitor-and-logger) to log stats about temperature and humidity. Data is synchronized via bluetooth in the sqlite database, which is easy to access. However sensor has limited memory (e.g. 1000 latest measurements). -That means that I end up with a new database every few days which contains, each of them containing only slice of data I need: e.g.: +That means that I end up with a new database every few days, each of them containing only a slice of data I need, e.g.: ... 20190715100026.db @@ -102,9 +103,9 @@ To access **all** of historic temperature data, I have two options: - requires careful scheduling, ideally you want to access new data without having to refresh cache -Cachew gives me best of two worlds and makes it **easy and efficient**. Only thing you have to do is to decorate your function: +Cachew gives the best of two worlds and makes it both **easy and efficient**. The only thing you have to do is to decorate your function: - @cachew("/data/cache/measurements.sqlite") + @cachew def measurements(chunks: List[Path]) -> Iterator[Measurement]: # ... @@ -114,12 +115,11 @@ Cachew gives me best of two worlds and makes it **easy and efficient**. Only thi All the complexity of handling database is hidden in `cachew` implementation. - # How it works -Basically, your data objects get [flattened out](src/cachew/__init__.py#L356) -and python types are mapped [onto sqlite types and back](src/cachew/__init__.py#L426). +Basically, your data objects get [flattened out](src/cachew/__init__.py#L444) +and python types are mapped [onto sqlite types and back](src/cachew/__init__.py#L514). -When the function is called, cachew [computes the hash of your function's arguments ](src/cachew/__init__.py:#L690) +When the function is called, cachew [computes the hash of your function's arguments ](src/cachew/__init__.py:#L844) and compares it against the previously stored hash value. - If they match, it would deserialize and yield whatever is stored in the cache database @@ -127,30 +127,22 @@ and compares it against the previously stored hash value. - # Features - -* automatic schema inference: [1](src/cachew/tests/test_cachew.py#L200), [2](src/cachew/tests/test_cachew.py#L214) +* automatic schema inference: [1](src/cachew/tests/test_cachew.py#L275), [2](src/cachew/tests/test_cachew.py#L289) * supported types: - * primitive: `str`, `int`, `float`, `bool`, `datetime`, `date`, `dict` - * [Optional](src/cachew/tests/test_cachew.py#L340) types - * [Union](src/cachew/tests/test_cachew.py#L521) types - * [nested datatypes](src/cachew/tests/test_cachew.py#L256) - * [Exceptions](src/cachew/tests/test_cachew.py#L644) (experimental, enabled by calling [`cachew.experimental.enable_exceptions`](src/cachew/__init__.py#L28)) - - Enables support for caching Exceptions. Exception arguments are going to be serialized as strings. - - It's useful for defensive error handling, in case of cachew in particular for preserving error state. - - I elaborate on it here: [mypy-driven error handling](https://beepb00p.xyz/mypy-error-handling.html#kiss). - + * primitive: `str`, `int`, `float`, `bool`, `datetime`, `date`, `dict`, `Exception` -* detects [datatype schema changes](src/cachew/tests/test_cachew.py#L286) and discards old data automatically - + See [tests.test_types](src/cachew/tests/test_cachew.py#L555), [tests.test_primitive](src/cachew/tests/test_cachew.py#L600), [tests.test_dates](src/cachew/tests/test_cachew.py#L515) + * [Optional](src/cachew/tests/test_cachew.py#L414) types + * [Union](src/cachew/tests/test_cachew.py#L670) types + * [nested datatypes](src/cachew/tests/test_cachew.py#L331) + * [Exceptions](src/cachew/tests/test_cachew.py#L912) + +* detects [datatype schema changes](src/cachew/tests/test_cachew.py#L361) and discards old data automatically # Performance @@ -158,37 +150,41 @@ Updating cache takes certain overhead, but that would depend on how complicated During reading cache all that happens is reading rows from sqlite and mapping them onto your target datatype, so the only overhead would be from reading sqlite, which is quite fast. -I haven't set up formal benchmarking/regression tests yet, so don't want to make specific claims, however that would almost certainly make your programm faster if computations take more than several seconds. +I haven't set up proper benchmarks/performance regressions yet, so don't want to make specific claims, however that would almost certainly make your programm faster if computations take more than several seconds. + + +If you want to experiment for youself, check out [tests.test_many](src/cachew/tests/test_cachew.py#L220) # Using -See [docstring](src/cachew/__init__.py#L588) for up-to-date documentation on parameters and return types. +See [docstring](src/cachew/__init__.py#L708) for up-to-date documentation on parameters and return types. You can also use [extensive unit tests](src/cachew/tests/test_cachew.py) as a reference. -Some useful arguments of `@cachew` decorator: +Some useful (but optional) arguments of `@cachew` decorator: -* `cache_path` can be a filename, or you can specify a callable that [returns a path](src/cachew/tests/test_cachew.py#L236) and depends on function's arguments. - - It's not required to specify the path (it will be created in `/tmp`) but recommended. +* `cache_path` can be a directory, or a callable that [returns a path](src/cachew/tests/test_cachew.py#L311) and depends on function's arguments. + + By default, `settings.DEFAULT_CACHEW_DIR` is used. -* `hashf` is a function that determines whether your arguments have changed. +* `depends_on` is a function which determines whether your inputs have changed, and the cache needs to be invalidated. By default it just uses string representation of the arguments, you can also specify a custom callable. - For instance, it can be used to [discard cache](src/cachew/tests/test_cachew.py#L66) if the input file was modified. + For instance, it can be used to [discard cache](src/cachew/tests/test_cachew.py#L91) if the input file was modified. -* `cls` is the type that would be serialized. It is inferred from return type annotations by default, but can be specified if you don't control the code you want to cache. +* `cls` is the type that would be serialized. + By default, it is inferred from return type annotations, but can be specified if you don't control the code you want to cache. # Installing Package is available on [pypi](https://pypi.org/project/cachew/). - pip install cachew + pip3 install --user cachew ## Developing -I'm using [tox](tox.ini) to run tests, and [circleci](.circleci/config.yml). +I'm using [tox](tox.ini) to run tests, and [Github Actions](.github/workflows/main.yml) for CI. # Implementation @@ -256,3 +252,17 @@ def mcachew(*args, **kwargs): Now you can use `@mcachew` in place of `@cachew`, and be certain things don't break if `cachew` is missing. + +## Settings + + +[cachew.settings](src/cachew/__init__.py#L61) exposes some parameters that allow you to control `cachew` behaviour: +- `ENABLE`: set to `False` if you want to disable caching for without removing the decorators (useful for testing and debugging). + You can also use [cachew.extra.disabled_cachew](src/cachew/__init__.py#L18) context manager to do it temporarily. +- `DEFAULT_CACHEW_DIR`: override to set a different base directory. +- `THROW_ON_ERROR`: by default, cachew is defensive and simply attemps to cause the original function on caching issues. + Set to `True` to catch errors earlier. + + +## Updating this readme +This is a literate readme, implemented as a Jupiter notebook: [README.ipynb](README.ipynb). To update the (autogenerated) [README.md](README.md), use [generate-readme](generate-readme) script.