Skip to content

karlicoss/cachew

Repository files navigation

Build status: https://circleci.com/gh/karlicoss/cachew.png

Cachew: quick NamedTuple cache

TLDR: cachew can persistently cache any sequence (strictly speaking, anything that’s an Iterator over NamedTuples) into an sqlite database on your disk.

Imagine you’re working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive. Parsing it takes hours, however, the archive is presumably updated not very frequently. Normally to get around this, you would have to serialize your pipeline results along with some sort of hash (e.g. md5) of input files, compare on the next query and return them on matching hash, or discard and compute new ones if the hash (i.e. input data) changed.

This is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task. This library is meant to solve that problem through a single line of decorator code.

Installing

Example

>>> from typing import NamedTuple, Iterator
>>> class Link(NamedTuple):
...     url : str
...     text: str
...
>>> @cachew
... def extract_links(archive: str) -> Iterator[Link]:
...     for i in range(5):
...         import time; time.sleep(1) # simulate slow IO
...         yield Link(url=f'http://link{i}.org', text=f'text {i}')
...
>>> list(extract_links(archive='wikipedia_20190830.zip')) # that should take about 5 seconds on first run
[Link(url='http://link0.org', text='text 0'), Link(url='http://link1.org', text='text 1'), Link(url='http://link2.org', text='text 2'), Link(url='http://link3.org', text='text 3'), Link(url='http://link4.org', text='text 4')]

>>> from timeit import Timer
>>> res = Timer(lambda: list(extract_links(archive='wikipedia_20190830.zip'))).timeit(number=1) # second run is cached, so should take less time
>>> print(f"took {int(res)} seconds to query cached items")
took 0 seconds to query cached items

How it works

Basically, your NamedTuple gets flattened out (if it contains nested NamedTuples) and python types are mapped onto sqlite types and back.

When the function is called, cachew computes the hash of your functions’ arguments and compares it against the previously stored hash value. If they match, it would deserialize and yield whatever is stored in the cache database, if the hash mismatches, the original data provider is called and new data is stored along with the new hash.

[2019-07-30 Tue 21:06] ugh, really would be nice to write that along the code… e.g. currently supported types would be nice

Features

  • relies on typing annotations. TODO link to test
  • supports nested NamedTuples
  • supports datetime
  • supports Optional
  • detects schema changes and discards old data automatically
  • custom hash function TODO example with mtime?
  • TODO iterative so works if data doesn’t fit in memory??

links to tests

Inspiration

Mainly this was inspired by ~functools.lru_cache~, which is excellent if you need to cache something within a single python process run.

Implementation

why tuples and dataclasses?

Tuples are natural in Python for quickly grouping together return results. NamedTuple and dataclass specifically provide a very straighforward and self documenting way way to represent a bit of data in Python. Very compact syntax makes it extremely convenitent even for one-off means of communicating between couple of functions.

[2019-07-30 Tue 21:02] some link to data class

why not ~pickle~?

Pickling is a bit heavyweight for data. There are many reports of pickle being slower than even JSON and it’s also security risk. Lastly, it can only be loaded via Python.

why sqlite database for storage?

It’s pretty effecient and sequence of namedtuples maps onto database rows in a very straighforward manner.

why not pandas.DataFrame?

DataFrames are great and can be serialised to csv or pickled. They are good to have as one of the ways you can interface with your data, however hardly convenitent to think about it abstractly due to their dynamic nature. They also can’t be nested.

why not ORM?

ORMs tend to be pretty invasive, which might complicate your scripts or even ruin performance. It’s also somewhat an overkill for such a specific purpose.

E.g. SQLAlchemy requires you using custom sqlalchemy specific types and inheriting a base class.

Also it doesn’t support nested types.

why not marshmallow?

Marshmallow is great, but it requires explicit schema which is an overhead when you have it already in the form of type annotations.

https://marshmallow-annotations.readthedocs.io/en/latest/ext/namedtuple.html#namedtuple-type-api

https://pypi.org/project/marshmallow-dataclass/

mention that in code?

[2019-07-30 Tue 19:00] post some link to data classes?

examples

[2019-07-30 Tue 20:15] e.g. if hash is date you can ensure you only serve one piece of data a day