TLDR: cachew can persistently cache any sequence (strictly speaking, anything that’s an Iterator over NamedTuples) into an sqlite database on your disk.
Imagine you’re working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive. Parsing it takes hours, however, the archive is presumably updated not very frequently. Normally to get around this, you would have to serialize your pipeline results along with some sort of hash (e.g. md5) of input files, compare on the next query and return them on matching hash, or discard and compute new ones if the hash (i.e. input data) changed.
This is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task. This library is meant to solve that problem through a single line of decorator code.
>>> from typing import NamedTuple, Iterator
>>> class Link(NamedTuple):
... url : str
... text: str
...
>>> @cachew
... def extract_links(archive: str) -> Iterator[Link]:
... for i in range(5):
... import time; time.sleep(1) # simulate slow IO
... yield Link(url=f'http://link{i}.org', text=f'text {i}')
...
>>> list(extract_links(archive='wikipedia_20190830.zip')) # that should take about 5 seconds on first run
[Link(url='http://link0.org', text='text 0'), Link(url='http://link1.org', text='text 1'), Link(url='http://link2.org', text='text 2'), Link(url='http://link3.org', text='text 3'), Link(url='http://link4.org', text='text 4')]
>>> from timeit import Timer
>>> res = Timer(lambda: list(extract_links(archive='wikipedia_20190830.zip'))).timeit(number=1) # second run is cached, so should take less time
>>> print(f"took {int(res)} seconds to query cached items")
took 0 seconds to query cached items
Basically, your NamedTuple gets flattened out (if it contains nested NamedTuples) and python types are mapped onto sqlite types and back.
When the function is called, cachew
computes the hash of your functions’ arguments and compares it against the previously stored hash value.
If they match, it would deserialize and yield whatever is stored in the cache database, if the hash mismatches, the original data provider is called and new data is stored along with the new hash.
[2019-07-30 Tue 21:06] ugh, really would be nice to write that along the code… e.g. currently supported types would be nice
- relies on
typing
annotations. TODO link to test - supports nested
NamedTuples
- supports
datetime
- supports
Optional
- detects schema changes and discards old data automatically
- custom hash function TODO example with mtime?
- TODO iterative so works if data doesn’t fit in memory??
Mainly this was inspired by ~functools.lru_cache~, which is excellent if you need to cache something within a single python process run.
Tuples are natural in Python for quickly grouping together return results.
NamedTuple
and dataclass
specifically provide a very straighforward and self documenting way way to represent a bit of data in Python.
Very compact syntax makes it extremely convenitent even for one-off means of communicating between couple of functions.
why not ~pickle~?
Pickling is a bit heavyweight for data. There are many reports of pickle being slower than even JSON and it’s also security risk. Lastly, it can only be loaded via Python.
It’s pretty effecient and sequence of namedtuples maps onto database rows in a very straighforward manner.
DataFrames are great and can be serialised to csv or pickled. They are good to have as one of the ways you can interface with your data, however hardly convenitent to think about it abstractly due to their dynamic nature. They also can’t be nested.
why not ORM?
ORMs tend to be pretty invasive, which might complicate your scripts or even ruin performance. It’s also somewhat an overkill for such a specific purpose.
E.g. SQLAlchemy requires you using custom sqlalchemy specific types and inheriting a base class.
Also it doesn’t support nested types.
why not marshmallow?
Marshmallow is great, but it requires explicit schema which is an overhead when you have it already in the form of type annotations.
https://github.com/justanr/marshmallow-annotations TODO has support for NamedTuples
https://marshmallow-annotations.readthedocs.io/en/latest/ext/namedtuple.html#namedtuple-type-api