Skip to content

Commit

Permalink
refresh readme
Browse files Browse the repository at this point in the history
  • Loading branch information
karlicoss committed Oct 8, 2020
1 parent 2688e11 commit 6f3c4d5
Show file tree
Hide file tree
Showing 2 changed files with 114 additions and 61 deletions.
87 changes: 65 additions & 22 deletions README.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,10 @@
" numbers = ''\n",
" return f'[{title}]({file}{numbers})'\n",
"\n",
"dmd = lambda x: display(md(x))\n",
"dmd = lambda x: display(md(x.strip()))\n",
"\n",
"import cachew\n",
"import cachew.extra\n",
"import cachew.experimental\n",
"import cachew.tests.test_cachew as tests"
]
Expand Down Expand Up @@ -82,18 +83,19 @@
}
},
"source": [
"[![CircleCI](https://circleci.com/gh/karlicoss/cachew.svg?style=svg)](https://circleci.com/gh/karlicoss/cachew)\n",
"\n",
"# What is Cachew?\n",
"TLDR: cachew lets you **cache function calls** into an sqlite database on your disk in a matter of **single decorator** (similar to [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache)). The difference from `functools.lru_cache` is that cached data is persisted between program runs, so next time you call your function, it will only be a matter of reading from the cache.\n",
"Cache is **invalidated automatically** if your function's arguments change, so you don't have to think about maintaining it.\n",
"\n",
"In order to be cacheable, your function needs to return (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator), that is generator, tuple or list) of simple data types:\n",
"In order to be cacheable, your function needs to return (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator), that is a generator, tuple or list) of simple data types:\n",
"\n",
"- primitive types: `str`/`int`/`float`/`datetime`\n",
"- JSON-like types\n",
"- `Exception` (useful for [error handling](https://beepb00p.xyz/mypy-error-handling.html#kiss) )\n",
"- [NamedTuples](https://docs.python.org/3/library/typing.html#typing.NamedTuple)\n",
"- [dataclasses](https://docs.python.org/3/library/dataclasses.html)\n",
"\n",
"\n",
"That allows to **automatically infer schema from type hints** ([PEP 526](https://www.python.org/dev/peps/pep-0526)) and not think about serializing/deserializing.\n",
"\n",
"## Motivation\n",
Expand Down Expand Up @@ -153,7 +155,7 @@
"\n",
"I'm using an [environment sensor](https://bluemaestro.com/products/product-details/bluetooth-environmental-monitor-and-logger) to log stats about temperature and humidity.\n",
"Data is synchronized via bluetooth in the sqlite database, which is easy to access. However sensor has limited memory (e.g. 1000 latest measurements).\n",
"That means that I end up with a new database every few days which contains, each of them containing only slice of data I need: e.g.:\n",
"That means that I end up with a new database every few days, each of them containing only a slice of data I need, e.g.:\n",
"\n",
" ...\n",
" 20190715100026.db\n",
Expand Down Expand Up @@ -182,9 +184,9 @@
" - requires careful scheduling, ideally you want to access new data without having to refresh cache\n",
"\n",
" \n",
"Cachew gives me best of two worlds and makes it **easy and efficient**. Only thing you have to do is to decorate your function:\n",
"Cachew gives the best of two worlds and makes it both **easy and efficient**. The only thing you have to do is to decorate your function:\n",
"\n",
" @cachew(\"/data/cache/measurements.sqlite\") \n",
" @cachew \n",
" def measurements(chunks: List[Path]) -> Iterator[Measurement]:\n",
" # ...\n",
" \n",
Expand Down Expand Up @@ -247,11 +249,12 @@
"* supported types: \n",
"\n",
" * primitive: {', '.join(types)}\n",
" \n",
" See {flink('tests.test_types')}, {flink('tests.test_primitive')}, {flink('tests.test_dates')}\n",
" * {flink('Optional', 'tests.test_optional')} types\n",
" * {flink('Union', 'tests.test_union')} types\n",
" * {flink('nested datatypes', 'tests.test_nested')}\n",
" * {flink('Exceptions', 'tests.test_exceptions')} (experimental, enabled by calling {flink('`cachew.experimental.enable_exceptions`')})\n",
" {cachew.experimental.enable_exceptions.__doc__.replace(' ', ' ' * 7)}\n",
" * {flink('Exceptions', 'tests.test_exceptions')}\n",
" \n",
"* detects {flink('datatype schema changes', 'tests.test_schema_change')} and discards old data automatically \n",
"\"\"\")\n",
Expand All @@ -267,7 +270,18 @@
"\n",
"During reading cache all that happens is reading rows from sqlite and mapping them onto your target datatype, so the only overhead would be from reading sqlite, which is quite fast.\n",
"\n",
"I haven't set up formal benchmarking/regression tests yet, so don't want to make specific claims, however that would almost certainly make your programm faster if computations take more than several seconds."
"I haven't set up proper benchmarks/performance regressions yet, so don't want to make specific claims, however that would almost certainly make your programm faster if computations take more than several seconds."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dmd(f'''\n",
"If you want to experiment for youself, check out {flink('tests.test_many')}\n",
"''')"
]
},
{
Expand All @@ -288,19 +302,21 @@
"See {flink('docstring', 'cachew.cachew')} for up-to-date documentation on parameters and return types. \n",
"You can also use {flink('extensive unit tests', 'tests')} as a reference.\n",
" \n",
"Some useful arguments of `@cachew` decorator:\n",
"Some useful (but optional) arguments of `@cachew` decorator:\n",
" \n",
"* `cache_path` can be a filename, or you can specify a callable that {flink('returns a path', 'tests.test_callable_cache_path')} and depends on function's arguments.\n",
" \n",
" It's not required to specify the path (it will be created in `/tmp`) but recommended.\n",
"* `cache_path` can be a directory, or a callable that {flink('returns a path', 'tests.test_callable_cache_path')} and depends on function's arguments.\n",
" \n",
" By default, `settings.DEFAULT_CACHEW_DIR` is used.\n",
" \n",
"* `hashf` is a function that determines whether your arguments have changed.\n",
"* `depends_on` is a function which determines whether your inputs have changed, and the cache needs to be invalidated.\n",
" \n",
" By default it just uses string representation of the arguments, you can also specify a custom callable.\n",
" \n",
" For instance, it can be used to {flink('discard cache', 'tests.test_custom_hash')} if the input file was modified.\n",
" \n",
"* `cls` is the type that would be serialized. It is inferred from return type annotations by default, but can be specified if you don't control the code you want to cache. \n",
"* `cls` is the type that would be serialized.\n",
"\n",
" By default, it is inferred from return type annotations, but can be specified if you don't control the code you want to cache. \n",
"\"\"\")"
]
},
Expand All @@ -316,10 +332,10 @@
"# Installing\n",
"Package is available on [pypi](https://pypi.org/project/cachew/).\n",
"\n",
" pip install cachew\n",
" pip3 install --user cachew\n",
" \n",
"## Developing\n",
"I'm using [tox](tox.ini) to run tests, and [circleci](.circleci/config.yml)."
"I'm using [tox](tox.ini) to run tests, and [Github Actions](.github/workflows/main.yml) for CI."
]
},
{
Expand Down Expand Up @@ -389,17 +405,44 @@
"metadata": {},
"outputs": [],
"source": [
"import cachew.misc\n",
"import cachew.extra\n",
"dmd(f\"\"\"```python\n",
"{inspect.getsource(cachew.misc.mcachew)}\n",
"{inspect.getsource(cachew.extra.mcachew)}\n",
"```\"\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you can use `@mcachew` in place of `@cachew`, and be certain things don't break if `cachew` is missing."
"Now you can use `@mcachew` in place of `@cachew`, and be certain things don't break if `cachew` is missing.\n",
"\n",
"## Settings"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dmd(f'''\n",
"{flink('cachew.settings')} exposes some parameters that allow you to control `cachew` behaviour:\n",
"- `ENABLE`: set to `False` if you want to disable caching for without removing the decorators (useful for testing and debugging).\n",
" You can also use {flink('cachew.extra.disabled_cachew')} context manager to do it temporarily.\n",
"- `DEFAULT_CACHEW_DIR`: override to set a different base directory.\n",
"- `THROW_ON_ERROR`: by default, cachew is defensive and simply attemps to cause the original function on caching issues.\n",
" Set to `True` to catch errors earlier.\n",
"\n",
"''')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Updating this readme\n",
"This is a literate readme, implemented as a Jupiter notebook: [README.ipynb](README.ipynb). To update the (autogenerated) [README.md](README.md), use [generate-readme](generate-readme) script."
]
}
],
Expand All @@ -420,7 +463,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
"version": "3.8.2"
},
"name": "README.ipynb"
},
Expand Down
88 changes: 49 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@



<!--THIS FILE IS AUTOGENERATED BY README.ipynb. Use generate-readme to update it.-->


# What is Cachew?
TLDR: cachew lets you **cache function calls** into an sqlite database on your disk in a matter of **single decorator** (similar to [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache)). The difference from `functools.lru_cache` is that cached data is persisted between program runs, so next time you call your function, it will only be a matter of reading from the cache.
Cache is **invalidated automatically** if your function's arguments change, so you don't have to think about maintaining it.

In order to be cacheable, your function needs to return (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator), that is generator, tuple or list) of simple data types:
In order to be cacheable, your function needs to return (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator), that is a generator, tuple or list) of simple data types:

- primitive types: `str`/`int`/`float`/`datetime`
- JSON-like types
- `Exception` (useful for [error handling](https://beepb00p.xyz/mypy-error-handling.html#kiss) )
- [NamedTuples](https://docs.python.org/3/library/typing.html#typing.NamedTuple)
- [dataclasses](https://docs.python.org/3/library/dataclasses.html)


That allows to **automatically infer schema from type hints** ([PEP 526](https://www.python.org/dev/peps/pep-0526)) and not think about serializing/deserializing.

## Motivation
Expand Down Expand Up @@ -73,7 +74,7 @@ This is my most common usecase of cachew, which I'll illustrate with example.

I'm using an [environment sensor](https://bluemaestro.com/products/product-details/bluetooth-environmental-monitor-and-logger) to log stats about temperature and humidity.
Data is synchronized via bluetooth in the sqlite database, which is easy to access. However sensor has limited memory (e.g. 1000 latest measurements).
That means that I end up with a new database every few days which contains, each of them containing only slice of data I need: e.g.:
That means that I end up with a new database every few days, each of them containing only a slice of data I need, e.g.:

...
20190715100026.db
Expand Down Expand Up @@ -102,9 +103,9 @@ To access **all** of historic temperature data, I have two options:
- requires careful scheduling, ideally you want to access new data without having to refresh cache


Cachew gives me best of two worlds and makes it **easy and efficient**. Only thing you have to do is to decorate your function:
Cachew gives the best of two worlds and makes it both **easy and efficient**. The only thing you have to do is to decorate your function:

@cachew("/data/cache/measurements.sqlite")
@cachew
def measurements(chunks: List[Path]) -> Iterator[Measurement]:
# ...
Expand All @@ -114,81 +115,76 @@ Cachew gives me best of two worlds and makes it **easy and efficient**. Only thi
All the complexity of handling database is hidden in `cachew` implementation.



# How it works
Basically, your data objects get [flattened out](src/cachew/__init__.py#L356)
and python types are mapped [onto sqlite types and back](src/cachew/__init__.py#L426).
Basically, your data objects get [flattened out](src/cachew/__init__.py#L444)
and python types are mapped [onto sqlite types and back](src/cachew/__init__.py#L514).

When the function is called, cachew [computes the hash of your function's arguments ](src/cachew/__init__.py:#L690)
When the function is called, cachew [computes the hash of your function's arguments ](src/cachew/__init__.py:#L844)
and compares it against the previously stored hash value.

- If they match, it would deserialize and yield whatever is stored in the cache database
- If the hash mismatches, the original function is called and new data is stored along with the new hash




# Features




* automatic schema inference: [1](src/cachew/tests/test_cachew.py#L200), [2](src/cachew/tests/test_cachew.py#L214)
* automatic schema inference: [1](src/cachew/tests/test_cachew.py#L275), [2](src/cachew/tests/test_cachew.py#L289)
* supported types:

* primitive: `str`, `int`, `float`, `bool`, `datetime`, `date`, `dict`
* [Optional](src/cachew/tests/test_cachew.py#L340) types
* [Union](src/cachew/tests/test_cachew.py#L521) types
* [nested datatypes](src/cachew/tests/test_cachew.py#L256)
* [Exceptions](src/cachew/tests/test_cachew.py#L644) (experimental, enabled by calling [`cachew.experimental.enable_exceptions`](src/cachew/__init__.py#L28))

Enables support for caching Exceptions. Exception arguments are going to be serialized as strings.

It's useful for defensive error handling, in case of cachew in particular for preserving error state.

I elaborate on it here: [mypy-driven error handling](https://beepb00p.xyz/mypy-error-handling.html#kiss).

* primitive: `str`, `int`, `float`, `bool`, `datetime`, `date`, `dict`, `Exception`

* detects [datatype schema changes](src/cachew/tests/test_cachew.py#L286) and discards old data automatically

See [tests.test_types](src/cachew/tests/test_cachew.py#L555), [tests.test_primitive](src/cachew/tests/test_cachew.py#L600), [tests.test_dates](src/cachew/tests/test_cachew.py#L515)
* [Optional](src/cachew/tests/test_cachew.py#L414) types
* [Union](src/cachew/tests/test_cachew.py#L670) types
* [nested datatypes](src/cachew/tests/test_cachew.py#L331)
* [Exceptions](src/cachew/tests/test_cachew.py#L912)

* detects [datatype schema changes](src/cachew/tests/test_cachew.py#L361) and discards old data automatically


# Performance
Updating cache takes certain overhead, but that would depend on how complicated your datatype in the first place, so I'd suggest measuring if you're not sure.

During reading cache all that happens is reading rows from sqlite and mapping them onto your target datatype, so the only overhead would be from reading sqlite, which is quite fast.

I haven't set up formal benchmarking/regression tests yet, so don't want to make specific claims, however that would almost certainly make your programm faster if computations take more than several seconds.
I haven't set up proper benchmarks/performance regressions yet, so don't want to make specific claims, however that would almost certainly make your programm faster if computations take more than several seconds.


If you want to experiment for youself, check out [tests.test_many](src/cachew/tests/test_cachew.py#L220)



# Using
See [docstring](src/cachew/__init__.py#L588) for up-to-date documentation on parameters and return types.
See [docstring](src/cachew/__init__.py#L708) for up-to-date documentation on parameters and return types.
You can also use [extensive unit tests](src/cachew/tests/test_cachew.py) as a reference.

Some useful arguments of `@cachew` decorator:
Some useful (but optional) arguments of `@cachew` decorator:

* `cache_path` can be a filename, or you can specify a callable that [returns a path](src/cachew/tests/test_cachew.py#L236) and depends on function's arguments.

It's not required to specify the path (it will be created in `/tmp`) but recommended.
* `cache_path` can be a directory, or a callable that [returns a path](src/cachew/tests/test_cachew.py#L311) and depends on function's arguments.
By default, `settings.DEFAULT_CACHEW_DIR` is used.

* `hashf` is a function that determines whether your arguments have changed.
* `depends_on` is a function which determines whether your inputs have changed, and the cache needs to be invalidated.

By default it just uses string representation of the arguments, you can also specify a custom callable.

For instance, it can be used to [discard cache](src/cachew/tests/test_cachew.py#L66) if the input file was modified.
For instance, it can be used to [discard cache](src/cachew/tests/test_cachew.py#L91) if the input file was modified.

* `cls` is the type that would be serialized. It is inferred from return type annotations by default, but can be specified if you don't control the code you want to cache.
* `cls` is the type that would be serialized.

By default, it is inferred from return type annotations, but can be specified if you don't control the code you want to cache.


# Installing
Package is available on [pypi](https://pypi.org/project/cachew/).

pip install cachew
pip3 install --user cachew

## Developing
I'm using [tox](tox.ini) to run tests, and [circleci](.circleci/config.yml).
I'm using [tox](tox.ini) to run tests, and [Github Actions](.github/workflows/main.yml) for CI.

# Implementation

Expand Down Expand Up @@ -256,3 +252,17 @@ def mcachew(*args, **kwargs):


Now you can use `@mcachew` in place of `@cachew`, and be certain things don't break if `cachew` is missing.

## Settings


[cachew.settings](src/cachew/__init__.py#L61) exposes some parameters that allow you to control `cachew` behaviour:
- `ENABLE`: set to `False` if you want to disable caching for without removing the decorators (useful for testing and debugging).
You can also use [cachew.extra.disabled_cachew](src/cachew/__init__.py#L18) context manager to do it temporarily.
- `DEFAULT_CACHEW_DIR`: override to set a different base directory.
- `THROW_ON_ERROR`: by default, cachew is defensive and simply attemps to cause the original function on caching issues.
Set to `True` to catch errors earlier.


## Updating this readme
This is a literate readme, implemented as a Jupiter notebook: [README.ipynb](README.ipynb). To update the (autogenerated) [README.md](README.md), use [generate-readme](generate-readme) script.

0 comments on commit 6f3c4d5

Please sign in to comment.