Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update readme #7

Merged
merged 3 commits into from
Jan 5, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
more readme updates
  • Loading branch information
karlicoss committed Jan 5, 2020
commit dfc980f243d8947fd7bd8eb3dcc17a8bd2489dac
56 changes: 36 additions & 20 deletions README.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -81,20 +81,20 @@
"[![CircleCI](https://circleci.com/gh/karlicoss/cachew.svg?style=svg)](https://circleci.com/gh/karlicoss/cachew)\n",
"\n",
"# What is Cachew?\n",
"TLDR: cachew lets you cache function calls into an sqlite database on your disk in a matter of single decorator (similar to [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache)). The difference from `functools.lru_cache` is that data is preserved between program runs, so next time you call your function, it will only be a matter of reading from the cache.\n",
"Cache is invalidated automatically if your function's arguments change, so you don't have to think about maintaing it.\n",
"TLDR: cachew lets you **cache function calls** into an sqlite database on your disk in a matter of **single decorator** (similar to [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache)). The difference from `functools.lru_cache` is that cached data is persisted between program runs, so next time you call your function, it will only be a matter of reading from the cache.\n",
"Cache is **invalidated automatically** if your function's arguments change, so you don't have to think about maintaining it.\n",
"\n",
"In order to be cacheable, your function needs to return (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator), that is generator, tuple or list) of simple data types:\n",
"\n",
"- primitive types like string, ints, floats and datetimes\n",
"- primitive types: `str`/`int`/`float`/`datetime`\n",
"- [NamedTuples](https://docs.python.org/3/library/typing.html#typing.NamedTuple)\n",
"- [dataclasses](https://docs.python.org/3/library/dataclasses.html)\n",
"\n",
"That allows to **automatically infer schema from type hints** ([PEP 526](https://www.python.org/dev/peps/pep-0526)) and not think about serializing/deserializing.\n",
"\n",
"## Motivation\n",
"\n",
"I often find myself processing big chunks of data, computing some aggregates on it or extracting few bits I'm interested at. While I'm trying to utilize REPL as much as I can, some things are still fragile and often you just have to rerun the whole thing in the process of development. This can be frustrating if data parsing and processing takes seconds, let alone minutes in some cases.\n",
"I often find myself processing big chunks of data, merging data together, computing some aggregates on it or extracting few bits I'm interested at. While I'm trying to utilize REPL as much as I can, some things are still fragile and often you just have to rerun the whole thing in the process of development. This can be frustrating if data parsing and processing takes seconds, let alone minutes in some cases.\n",
"\n",
"Conventional way of dealing with it is serializing results along with some sort of hash (e.g. md5) of input files,\n",
"comparing on the next run and returning cached data if nothing changed.\n",
Expand Down Expand Up @@ -161,24 +161,24 @@
"\n",
"To access **all** of historic temperature data, I have two options:\n",
"\n",
"- Go through all the data chunks every time I wan to access them and 'merge' into a unified stream of measurements, e.g. somethingg like:\n",
"- Go through all the data chunks every time I wan to access them and 'merge' into a unified stream of measurements, e.g. something like:\n",
" \n",
" def measurements(chunks: List[Path]) -> Iterator[Measurement]:\n",
" for chunk in chunks:\n",
" # read measurements from 'chunk' and yeild unseen ones\n",
" # read measurements from 'chunk' and yield unseen ones\n",
"\n",
" This is very **easy, but slow** and you waste CPU for no reason every time you need data.\n",
"\n",
"- Keep a 'master' database and write code to merge chunks in it.\n",
"\n",
" This is very **effecient, but tedious**:\n",
" This is very **efficient, but tedious**:\n",
" \n",
" - requires serializing/deserializing data -- boilerplate\n",
" - requires manually managing sqlite database -- error prone, hard to get right every time\n",
" - requires careful scheduling, ideally you want to access new data without having to refresh cache\n",
"\n",
" \n",
"Cachew gives me best of two worlds and makes it **easy and effecient**. Only thing you have to do is to decorate your function:\n",
"Cachew gives me best of two worlds and makes it **easy and efficient**. Only thing you have to do is to decorate your function:\n",
"\n",
" @cachew(\"/data/cache/measurements.sqlite\") \n",
" def measurements(chunks: List[Path]) -> Iterator[Measurement]:\n",
Expand Down Expand Up @@ -240,15 +240,29 @@
"types = [f'`{c.__name__}`' for c in cachew.PRIMITIVES.keys()]\n",
"dmd(f\"\"\"\n",
"* automatic schema inference: {flink('1', 'tests.test_return_type_inference')}, {flink('2', 'tests.test_return_type_mismatch')}\n",
"* supports primitive types: {', '.join(types)}\n",
"* supports {flink('Optional', 'tests.test_optional')} types\n",
"* supports {flink('Union', 'tests.test_union')} types\n",
"* supports {flink('nested datatypes', 'tests.test_nested')}\n",
"* supported types: \n",
"\n",
" * primitive: {', '.join(types)}\n",
" * {flink('Optional', 'tests.test_optional')} types\n",
" * {flink('Union', 'tests.test_union')} types\n",
" * {flink('nested datatypes', 'tests.test_nested')}\n",
"* detects {flink('datatype schema changes', 'tests.test_schema_change')} and discards old data automatically \n",
"\"\"\")\n",
"# * custom hash function TODO example with mtime?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Performance\n",
"Updating cache takes certain overhead, but that would depend on how complicated your datatype in the first place, so I'd suggest measuring if you're not sure.\n",
"\n",
"During reading cache all that happens is reading rows from sqlite and mapping them onto your target datatype, so the only overhead would be from reading sqlite, which is quite fast.\n",
"\n",
"I haven't set up formal benchmarking/regression tests yet, so don't want to make specific claims, however that would almost certainly make your programm faster if computations take more than several seconds."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -267,17 +281,19 @@
"See {flink('docstring', 'cachew.cachew')} for up-to-date documentation on parameters and return types. \n",
"You can also use {flink('extensive unit tests', 'tests')} as a reference.\n",
" \n",
"Some highlights:\n",
"Some useful arguments of `@cachew` decorator:\n",
" \n",
"* `cache_path` can be a filename, or you can specify a callable that {flink('returns a path', 'tests.test_callable_cache_path')} and depends on function's arguments.\n",
" \n",
" It's not required to specify the path (it will be created in `/tmp`) but recommended.\n",
" \n",
"* `hashf` by default just hashes all the arguments, you can also specify a custom callable.\n",
"* `hashf` is a function that determines whether your arguments have changed.\n",
" \n",
" By default it just uses string representation of the arguments, you can also specify a custom callable.\n",
" \n",
" For instance, it can be used to {flink('discard cache', 'tests.test_custom_hash')} if the input file was modified.\n",
" \n",
"* `cls` is inferred from return type annotations by default, but can be specified if you don't control the code you want to cache. \n",
"* `cls` is the type that would be serialized. It is inferred from return type annotations by default, but can be specified if you don't control the code you want to cache. \n",
"\"\"\")"
]
},
Expand Down Expand Up @@ -313,8 +329,8 @@
"* why tuples and dataclasses?\n",
" \n",
" Tuples are natural in Python for quickly grouping together return results.\n",
" `NamedTuple` and `dataclass` specifically provide a very straighforward and self documenting way to represent data in Python.\n",
" Very compact syntax makes it extremely convenitent even for one-off means of communicating between couple of functions.\n",
" `NamedTuple` and `dataclass` specifically provide a very straightforward and self documenting way to represent data in Python.\n",
" Very compact syntax makes it extremely convenient even for one-off means of communicating between couple of functions.\n",
" \n",
" If you want to find out more why you should use more dataclasses in your code I suggest these links:\n",
" \n",
Expand All @@ -328,12 +344,12 @@
"\n",
"* why `sqlite` database for storage?\n",
"\n",
" It's pretty effecient and sequence of namedtuples maps onto database rows in a very straighforward manner.\n",
" It's pretty efficient and sequence of namedtuples maps onto database rows in a very straightforward manner.\n",
"\n",
"* why not `pandas.DataFrame`?\n",
"\n",
" DataFrames are great and can be serialised to csv or pickled.\n",
" They are good to have as one of the ways you can interface with your data, however hardly convenitent to think about it abstractly due to their dynamic nature.\n",
" They are good to have as one of the ways you can interface with your data, however hardly convenient to think about it abstractly due to their dynamic nature.\n",
" They also can't be nested.\n",
" \n",
"* why not [ORM](https://en.wikipedia.org/wiki/Object-relational_mapping)?\n",
Expand All @@ -345,7 +361,7 @@
"\n",
"* why not [marshmallow](https://marshmallow.readthedocs.io/en/3.0/nesting.html)?\n",
" \n",
" Marshmallow is a common way to map data into db-friendly format, but it requires explicit schema which is an overhead when you have it already in the form of type annotations. I've looked at existing projects to utilise type annotations, but didn't find them covering all I wanted:\n",
" Marshmallow is a common way to map data into db-friendly format, but it requires explicit schema which is an overhead when you have it already in the form of type annotations. I've looked at existing projects to utilize type annotations, but didn't find them covering all I wanted:\n",
" \n",
" * https://marshmallow-annotations.readthedocs.io/en/latest/ext/namedtuple.html#namedtuple-type-api\n",
" * https://pypi.org/project/marshmallow-dataclass"
Expand Down
Loading