Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update readme #7

Merged
merged 3 commits into from
Jan 5, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Update readme
  • Loading branch information
karlicoss committed Jan 5, 2020
commit 115f93a63462b0130b8249a3448e57c1cea3da7f
187 changes: 156 additions & 31 deletions README.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"autoscroll": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
},
"tags": [
"noexport"
]
Expand All @@ -28,9 +34,9 @@
" else:\n",
" [modname, fname] = split\n",
" module = globals()[modname]\n",
" \n",
"\n",
" file = Path(module.__file__).relative_to(cwd)\n",
" \n",
"\n",
" if fname is not None:\n",
" func = module\n",
" for p in fname.split('.'):\n",
Expand All @@ -40,7 +46,7 @@
" else:\n",
" numbers = ''\n",
" return f'[{title}]({file}{numbers})'\n",
" \n",
"\n",
"dmd = lambda x: display(md(x))\n",
"\n",
"import cachew\n",
Expand All @@ -50,39 +56,56 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"autoscroll": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"dmd(f'<!--THIS FILE IS AUTOGENERATED BY README.ipynb. Use generate-readme to update it.-->')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"[![CircleCI](https://circleci.com/gh/karlicoss/cachew.svg?style=svg)](https://circleci.com/gh/karlicoss/cachew)\n",
"\n",
"# Cachew: quick NamedTuple/dataclass cache\n",
"TLDR: cachew can persistently cache any sequence (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator)) over [NamedTuples](https://docs.python.org/3/library/typing.html#typing.NamedTuple) or [dataclasses](https://docs.python.org/3/library/dataclasses.html) into an sqlite database on your disk.\n",
"Database schema is automatically inferred from type annotations ([PEP 526](https://www.python.org/dev/peps/pep-0526)).\n",
"# What is Cachew?\n",
"TLDR: cachew lets you cache function calls into an sqlite database on your disk in a matter of single decorator (similar to [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache)). The difference from `functools.lru_cache` is that data is preserved between program runs, so next time you call your function, it will only be a matter of reading from the cache.\n",
"Cache is invalidated automatically if your function's arguments change, so you don't have to think about maintaing it.\n",
"\n",
"It works in a similar manner to [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache): caching your data is just a matter of decorating it.\n",
"In order to be cacheable, your function needs to return (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator), that is generator, tuple or list) of simple data types:\n",
"\n",
"The difference from `functools.lru_cache` is that data is preserved between program runs.\n",
"- primitive types like string, ints, floats and datetimes\n",
"- [NamedTuples](https://docs.python.org/3/library/typing.html#typing.NamedTuple)\n",
"- [dataclasses](https://docs.python.org/3/library/dataclasses.html)\n",
"\n",
"That allows to **automatically infer schema from type hints** ([PEP 526](https://www.python.org/dev/peps/pep-0526)) and not think about serializing/deserializing.\n",
"\n",
"## Motivation\n",
"\n",
"I often find myself processing big chunks of data, computing some aggregates on it or extracting only bits I'm interested at. While I'm trying to utilize REPL as much as I can, some things are still fragile and often you just have to rerun the whole thing in the process of development. This can be frustrating if data parsing and processing takes seconds, let alone minutes in some cases. \n",
"I often find myself processing big chunks of data, computing some aggregates on it or extracting few bits I'm interested at. While I'm trying to utilize REPL as much as I can, some things are still fragile and often you just have to rerun the whole thing in the process of development. This can be frustrating if data parsing and processing takes seconds, let alone minutes in some cases.\n",
"\n",
"Conventional way of dealing with it is serializing results along with some sort of hash (e.g. md5) of input files,\n",
"comparing on the next run and returning cached data if nothing changed.\n",
"\n",
"Simple as it sounds, it is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task.\n",
"\n",
"\n",
"# Example\n",
"# Examples\n",
"## Processing Wikipedia\n",
"Imagine you're working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive.\n",
"Parsing it (`extract_links` function) takes hours, however, the archive is presumably updated not very frequently.\n",
"Parsing it (`extract_links` function) takes hours, however, as long as the archive is same you will always get same results. So it would be nice to be able to cache the results somehow.\n",
"\n",
"\n",
"With this library your can achieve it through single `@cachew` decorator."
Expand All @@ -91,7 +114,14 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"autoscroll": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"doc = inspect.getdoc(cachew.cachew)\n",
Expand All @@ -101,10 +131,76 @@
"```\"\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next time you call `extract_links` with the same archive, you will start getting results in a matter of milliseconds, as fast as sqlite reads it.\n",
"\n",
"When you use newer archive, `archive_path` will change, which will make cachew invalidate old cache and recompute it, so you don't need to think about maintaining it separately."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Incremental data exports\n",
"This is my most common usecase of cachew, which I'll illustrate with example.\n",
"\n",
"I'm using an [environment sensor](https://bluemaestro.com/products/product-details/bluetooth-environmental-monitor-and-logger) to log stats about temperature and humidity.\n",
"Data is synchronized via bluetooth in the sqlite database, which is easy to access. However sensor has limited memory (e.g. 1000 latest measurements).\n",
"That means that I end up with a new database every few days which contains, each of them containing only slice of data I need: e.g.:\n",
"\n",
" ...\n",
" 20190715100026.db\n",
" 20190716100138.db\n",
" 20190717101651.db\n",
" 20190718100118.db\n",
" 20190719100701.db\n",
" ...\n",
"\n",
"To access **all** of historic temperature data, I have two options:\n",
"\n",
"- Go through all the data chunks every time I wan to access them and 'merge' into a unified stream of measurements, e.g. somethingg like:\n",
" \n",
" def measurements(chunks: List[Path]) -> Iterator[Measurement]:\n",
" for chunk in chunks:\n",
" # read measurements from 'chunk' and yeild unseen ones\n",
"\n",
" This is very **easy, but slow** and you waste CPU for no reason every time you need data.\n",
"\n",
"- Keep a 'master' database and write code to merge chunks in it.\n",
"\n",
" This is very **effecient, but tedious**:\n",
" \n",
" - requires serializing/deserializing data -- boilerplate\n",
" - requires manually managing sqlite database -- error prone, hard to get right every time\n",
" - requires careful scheduling, ideally you want to access new data without having to refresh cache\n",
"\n",
" \n",
"Cachew gives me best of two worlds and makes it **easy and effecient**. Only thing you have to do is to decorate your function:\n",
"\n",
" @cachew(\"/data/cache/measurements.sqlite\") \n",
" def measurements(chunks: List[Path]) -> Iterator[Measurement]:\n",
" # ...\n",
" \n",
"- as long as `chunks` stay same, data stays same so you always read from sqlite cache which is very fast\n",
"- you don't need to maintain the database, cache is automatically refreshed when `chunks` change (i.e. you got new data)\n",
"\n",
" All the complexity of handling database is hidden in `cachew` implementation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"autoscroll": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"[composite] = [x\n",
Expand All @@ -117,28 +213,37 @@
"dmd(f'''\n",
"# How it works\n",
"Basically, your data objects get {flink('flattened out', 'cachew.NTBinder.to_row')}\n",
"and python types are mapped {flink('onto sqlite types and back', 'cachew.NTBinder.iter_columns')}\n",
"and python types are mapped {flink('onto sqlite types and back', 'cachew.NTBinder.iter_columns')}.\n",
"\n",
"When the function is called, cachew [computes the hash]({link}) of your function's arguments \n",
"When the function is called, cachew [computes the hash of your function's arguments ]({link})\n",
"and compares it against the previously stored hash value.\n",
" \n",
"If they match, it would deserialize and yield whatever is stored in the cache database, if the hash mismatches, the original data provider is called and new data is stored along with the new hash.\n",
"- If they match, it would deserialize and yield whatever is stored in the cache database\n",
"- If the hash mismatches, the original function is called and new data is stored along with the new hash\n",
"''')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"autoscroll": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"dmd('# Features')\n",
"types = [f'`{c.__name__}`' for c in cachew.PRIMITIVES.keys()]\n",
"dmd(f\"\"\"\n",
"* automatic schema inference: {flink('1', 'tests.test_return_type_inference')}, {flink('2', 'tests.test_return_type_mismatch')}\n",
"* supports primitive types: {', '.join(types)}\n",
"* supports {flink('Optional', 'tests.test_optional')}\n",
"* supports {flink('Optional', 'tests.test_optional')} types\n",
"* supports {flink('Union', 'tests.test_union')} types\n",
"* supports {flink('nested datatypes', 'tests.test_nested')}\n",
"* supports return type inference: {flink('1', 'tests.test_return_type_inference')}, {flink('2', 'tests.test_return_type_mismatch')}\n",
"* detects {flink('datatype schema changes', 'tests.test_schema_change')} and discards old data automatically \n",
"\"\"\")\n",
"# * custom hash function TODO example with mtime?"
Expand All @@ -147,7 +252,14 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"autoscroll": false,
"ein.hycell": false,
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"outputs": [],
"source": [
"dmd(f\"\"\"\n",
Expand All @@ -157,21 +269,26 @@
" \n",
"Some highlights:\n",
" \n",
"* `cache_path` can be a filename, or you can specify a callable {flink('returning path', 'tests.test_callable_cache_path')} and depending on function's arguments.\n",
"* `cache_path` can be a filename, or you can specify a callable that {flink('returns a path', 'tests.test_callable_cache_path')} and depends on function's arguments.\n",
" \n",
" It's not required to specify the path (it will be created in `/tmp`) but recommended.\n",
" \n",
"* `hashf` by default just hashes all the arguments, you can also specify a custom callable.\n",
" \n",
" For instance, it can be used to {flink('discard cache', 'tests.test_custom_hash')} the input file was modified.\n",
" For instance, it can be used to {flink('discard cache', 'tests.test_custom_hash')} if the input file was modified.\n",
" \n",
"* `cls` is deduced from return type annotations by default, but can be specified if you don't control the code you want to cache. \n",
"* `cls` is inferred from return type annotations by default, but can be specified if you don't control the code you want to cache. \n",
"\"\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"# Installing\n",
"Package is available on [pypi](https://pypi.org/project/cachew/).\n",
Expand All @@ -184,23 +301,30 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"ein.tags": "worksheet-0",
"slideshow": {
"slide_type": "-"
}
},
"source": [
"# Implementation\n",
"\n",
"* why tuples and dataclasses?\n",
" \n",
" Tuples are natural in Python for quickly grouping together return results.\n",
" `NamedTuple` and `dataclass` specifically provide a very straighforward and self documenting way way to represent a bit of data in Python.\n",
" `NamedTuple` and `dataclass` specifically provide a very straighforward and self documenting way to represent data in Python.\n",
" Very compact syntax makes it extremely convenitent even for one-off means of communicating between couple of functions.\n",
" \n",
" If you want to find out more why you should use more dataclasses in your code I suggest these links:\n",
" [What are data classes?](https://stackoverflow.com/questions/47955263/what-are-data-classes-and-how-are-they-different-from-common-classes), [basic data classes](https://realpython.com/python-data-classes/#basic-data-classes).\n",
" \n",
" - [What are data classes?](https://stackoverflow.com/questions/47955263/what-are-data-classes-and-how-are-they-different-from-common-classes)\n",
" - [basic data classes](https://realpython.com/python-data-classes/#basic-data-classes)\n",
" \n",
" \n",
"* why not [pickle](https://docs.python.org/3/library/pickle.html)?\n",
"\n",
" Pickling is a bit heavyweight for plain data class. There are many reports of pickle being slower than even JSON and it's also security risk. Lastly, it can only be loaded via Python.\n",
" Pickling is a bit heavyweight for plain data class. There are many reports of pickle being slower than even JSON and it's also security risk. Lastly, it can only be loaded via Python, whereas sqlite has numerous bindings and tools to explore and interface.\n",
"\n",
"* why `sqlite` database for storage?\n",
"\n",
Expand Down Expand Up @@ -246,7 +370,8 @@
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"name": "README.ipynb"
},
"nbformat": 4,
"nbformat_minor": 2
Expand Down
Loading