Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.circleci		.circleci
.idea/dictionaries		.idea/dictionaries
src/cachew		src/cachew
.bandit.yml		.bandit.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE.txt		LICENSE.txt
README.ipynb		README.ipynb
README.org		README.org
generate-readme		generate-readme
readme.tpl		readme.tpl
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Repository files navigation

Build status:

Cachew: quick NamedTuple cache

TLDR: cachew can persistently cache any sequence (strictly speaking, anything that’s an Iterator over NamedTuples) into an sqlite database on your disk.

Imagine you’re working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive. Parsing it takes hours, however, the archive is presumably updated not very frequently. Normally to get around this, you would have to serialize your pipeline results along with some sort of hash (e.g. md5) of input files, compare on the next query and return them on matching hash, or discard and compute new ones if the hash (i.e. input data) changed.

This is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task. This library is meant to solve that problem through a single line of decorator code.

Installing

Example

>>> from typing import NamedTuple, Iterator
>>> class Link(NamedTuple):
...     url : str
...     text: str
...
>>> @cachew
... def extract_links(archive: str) -> Iterator[Link]:
...     for i in range(5):
...         import time; time.sleep(1) # simulate slow IO
...         yield Link(url=f'http://link{i}.org', text=f'text {i}')
...
>>> list(extract_links(archive='wikipedia_20190830.zip')) # that should take about 5 seconds on first run
[Link(url='http://link0.org', text='text 0'), Link(url='http://link1.org', text='text 1'), Link(url='http://link2.org', text='text 2'), Link(url='http://link3.org', text='text 3'), Link(url='http://link4.org', text='text 4')]

>>> from timeit import Timer
>>> res = Timer(lambda: list(extract_links(archive='wikipedia_20190830.zip'))).timeit(number=1) # second run is cached, so should take less time
>>> print(f"took {int(res)} seconds to query cached items")
took 0 seconds to query cached items

How it works

Basically, your NamedTuple gets flattened out (if it contains nested NamedTuples) and python types are mapped onto sqlite types and back.

When the function is called, cachew computes the hash of your functions’ arguments and compares it against the previously stored hash value. If they match, it would deserialize and yield whatever is stored in the cache database, if the hash mismatches, the original data provider is called and new data is stored along with the new hash.

[2019-07-30 Tue 21:06] ugh, really would be nice to write that along the code… e.g. currently supported types would be nice

Features

relies on typing annotations. TODO link to test
supports nested NamedTuples
supports datetime
supports Optional
detects schema changes and discards old data automatically
custom hash function TODO example with mtime?
TODO iterative so works if data doesn’t fit in memory??

links to tests

Inspiration

Mainly this was inspired by ~functools.lru_cache~, which is excellent if you need to cache something within a single python process run.

Implementation

why tuples and dataclasses?

Tuples are natural in Python for quickly grouping together return results. NamedTuple and dataclass specifically provide a very straighforward and self documenting way way to represent a bit of data in Python. Very compact syntax makes it extremely convenitent even for one-off means of communicating between couple of functions.

[2019-07-30 Tue 21:02] some link to data class

why not ~pickle~?

Pickling is a bit heavyweight for data. There are many reports of pickle being slower than even JSON and it’s also security risk. Lastly, it can only be loaded via Python.

why `sqlite` database for storage?

It’s pretty effecient and sequence of namedtuples maps onto database rows in a very straighforward manner.

why not `pandas.DataFrame`?

DataFrames are great and can be serialised to csv or pickled. They are good to have as one of the ways you can interface with your data, however hardly convenitent to think about it abstractly due to their dynamic nature. They also can’t be nested.

why not ORM?

ORMs tend to be pretty invasive, which might complicate your scripts or even ruin performance. It’s also somewhat an overkill for such a specific purpose.

E.g. SQLAlchemy requires you using custom sqlalchemy specific types and inheriting a base class.

Also it doesn’t support nested types.

why not marshmallow?

Marshmallow is great, but it requires explicit schema which is an overhead when you have it already in the form of type annotations.

https://github.com/justanr/marshmallow-annotations TODO has support for NamedTuples

https://marshmallow-annotations.readthedocs.io/en/latest/ext/namedtuple.html#namedtuple-type-api

https://pypi.org/project/marshmallow-dataclass/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cachew: quick NamedTuple cache

Installing

Example

How it works

[2019-07-30 Tue 21:06] ugh, really would be nice to write that along the code… e.g. currently supported types would be nice

Features

links to tests

Inspiration

Implementation

why tuples and dataclasses?

[2019-07-30 Tue 21:02] some link to data class

why not ~pickle~?

why `sqlite` database for storage?

why not `pandas.DataFrame`?

why not ORM?

E.g. SQLAlchemy requires you using custom sqlalchemy specific types and inheriting a base class.

why not marshmallow?

https://github.com/justanr/marshmallow-annotations TODO has support for NamedTuples

mention that in code?

[2019-07-30 Tue 19:00] post some link to data classes?

examples

[2019-07-30 Tue 20:15] e.g. if hash is date you can ensure you only serve one piece of data a day

About

Releases 19

Packages

Contributors 2

Languages

License

karlicoss/cachew

Folders and files

Latest commit

History

Repository files navigation

Cachew: quick NamedTuple cache

Installing

Example

How it works

[2019-07-30 Tue 21:06] ugh, really would be nice to write that along the code… e.g. currently supported types would be nice

Features

links to tests

Inspiration

Implementation

why tuples and dataclasses?

[2019-07-30 Tue 21:02] some link to data class

why not ~pickle~?

why sqlite database for storage?

why not pandas.DataFrame?

why not ORM?

E.g. SQLAlchemy requires you using custom sqlalchemy specific types and inheriting a base class.

why not marshmallow?

https://github.com/justanr/marshmallow-annotations TODO has support for NamedTuples

mention that in code?

[2019-07-30 Tue 19:00] post some link to data classes?

examples

[2019-07-30 Tue 20:15] e.g. if hash is date you can ensure you only serve one piece of data a day

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 19

Packages 0

Contributors 2

Languages

why `sqlite` database for storage?

why not `pandas.DataFrame`?

Packages