Skip to content

Commit

Permalink
more work on ipython readme
Browse files Browse the repository at this point in the history
  • Loading branch information
karlicoss committed Aug 18, 2019
1 parent 1e69cda commit cda7748
Show file tree
Hide file tree
Showing 5 changed files with 150 additions and 46 deletions.
161 changes: 134 additions & 27 deletions README.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,38 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": [
"noexport"
]
},
"outputs": [],
"source": [
"from pathlib import Path\n",
"import sys; sys.path.insert(0, str(Path('src').absolute()))\n",
"import os\n",
"print(os.getcwd())\n",
"print(sys.path)\n",
"import cachew; print(cachew.__file__)"
"cwd =os.getcwd()\n",
"\n",
"import ast\n",
"import inspect\n",
"\n",
"from IPython.display import Markdown as md\n",
"\n",
"def flink(title: str, name: str):\n",
" [modname, fname] = name.split('.', maxsplit=1)\n",
" module = globals()[modname]\n",
" \n",
" func = module\n",
" for p in fname.split('.'):\n",
" func = getattr(func, p)\n",
" file = Path(inspect.getsourcefile(func)).relative_to(cwd)\n",
" _, number = inspect.getsourcelines(func)\n",
" return f'[{title}]({file}:{number})'\n",
" \n",
"dmd = lambda x: display(md(x))\n",
"\n",
"import cachew\n",
"import cachew.tests.test_cachew as tests"
]
},
{
Expand All @@ -20,7 +43,29 @@
"metadata": {},
"outputs": [],
"source": [
"import cachew.tests"
"dmd(f'<!--THIS FILE IS AUTOGENERATED BY README.ipynb. Use generate-readme to update it.-->')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cachew: quick NamedTuple/dataclass cache\n",
"TLDR: cachew can persistently cache any sequence (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator)) over [NamedTuples](https://docs.python.org/3/library/typing.html#typing.NamedTuple) or [dataclasses](https://docs.python.org/3/library/dataclasses.html) into an sqlite database on your disk.\n",
"\n",
"Imagine you're working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive.\n",
"Parsing it takes hours, however, the archive is presumably updated not very frequently.\n",
"Normally to get around this, you would have to serialize your pipeline results along with some sort of hash (e.g. md5) of input files,\n",
"compare on the next query and return them on matching hash, or discard and compute new ones if the hash (i.e. input data) changed. \n",
"\n",
"This is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task.\n",
"This library is meant to solve that problem through a single line of decorator code.\n",
"\n",
"TODO move this^ to Example section?\n",
"\n",
"# Installing\n",
"\n",
"TODO\n"
]
},
{
Expand All @@ -29,8 +74,12 @@
"metadata": {},
"outputs": [],
"source": [
"import cachew.tests.test_cachew as tests\n",
"from IPython.display import Markdown as md"
"dmd('# Example')\n",
"doc = inspect.getdoc(cachew.cachew)\n",
"doc = doc.split('Usage example:')[-1].lstrip()\n",
"dmd(f\"\"\"```python\n",
"{doc}\n",
"```\"\"\")"
]
},
{
Expand All @@ -39,13 +88,16 @@
"metadata": {},
"outputs": [],
"source": [
"# dir(tests.test_nested)\n",
"import inspect\n",
"def flink(title: str, fname: str):\n",
" test = getattr(tests, fname)\n",
" file = Path(inspect.getsourcefile(test)).relative_to(os.getcwd())\n",
" _, number = inspect.getsourcelines(tests.test_nested)\n",
" return f'[{title}]({file}:{number})'"
"dmd('# Features')\n",
"types = [f'`{c.__name__}`' for c in cachew.PRIMITIVES.keys()]\n",
"dmd('Supported primitive types:' + ', '.join(types))\n",
"dmd(f\"\"\"\n",
"* supports Optional TODO\n",
"* supports {flink('nested datatypes', 'tests.test_nested')}\n",
"* supports return type inference: {flink('1', 'tests.test_return_type_inference')}, {flink('2', 'tests.test_return_type_mismatch')}\n",
"* detects {flink('datatype schema changes', 'tests.test_schema_change')} and discards old data automatically \n",
"\"\"\")\n",
"# * custom hash function TODO example with mtime?"
]
},
{
Expand All @@ -54,29 +106,84 @@
"metadata": {},
"outputs": [],
"source": [
"types = [f'`{c.__name__}`' for c in cachew.PRIMITIVES.keys()]\n",
"display(md('Supported primitive types:' + ', '.join(types)))\n",
"display(md(f\"\"\"\n",
"* supports {flink('nested datatypes', 'test_nested')}\n",
"* supports return type inference: {flink('1', 'test_return_type_inference')}, {flink('2', 'test_return_type_mismatch')}\n",
"* supports Optional\n",
"* detects {flink('datatype schema changes', 'test_schema_change')} and discards old data automatically \n",
"\"\"\"))"
"[composite] = [x\n",
" for x in ast.walk(ast.parse(inspect.getsource(cachew))) \n",
" if isinstance(x, ast.FunctionDef) and x.name == 'composite_hash'\n",
"]\n",
"\n",
"link = f'{Path(cachew.__file__).relative_to(cwd)}:{composite.lineno}'\n",
"\n",
"dmd(f'''\n",
"# How it works\n",
"Basically, your data objects get {flink('flattened out', 'cachew.NTBinder.to_row')}\n",
"and python types are mapped {flink('onto sqlite types and back', 'cachew.NTBinder.iter_columns')}\n",
"\n",
"When the function is called, cachew [computes the hash]({link}) of your function's arguments \n",
"and compares it against the previously stored hash value.\n",
" \n",
"If they match, it would deserialize and yield whatever is stored in the cache database, if the hash mismatches, the original data provider is called and new data is stored along with the new hash.\n",
"''')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* supports nested NamedTuples\n",
"* supports datetime\n",
"* supports Optional\n",
"* detects schema changes and discards old data automatically \n",
"* custom hash function TODO example with mtime?\n"
"# Inspiration\n",
"Mainly this was inspired by [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache), which is excellent if you need to cache something within a single python process run.\n",
"\n",
"## Implementation\n",
"\n",
"* why tuples and dataclasses?\n",
" \n",
" Tuples are natural in Python for quickly grouping together return results.\n",
" `NamedTuple` and `dataclass` specifically provide a very straighforward and self documenting way way to represent a bit of data in Python.\n",
" Very compact syntax makes it extremely convenitent even for one-off means of communicating between couple of functions.\n",
" \n",
" * TODO [2019-07-30 Tue 21:02] some link to data class\n",
" \n",
"* why not [pickle](https://docs.python.org/3/library/pickle.html)?\n",
"\n",
" Pickling is a bit heavyweight for plain data class. There are many reports of pickle being slower than even JSON and it's also security risk. Lastly, it can only be loaded via Python.\n",
"\n",
"* why `sqlite` database for storage?\n",
"\n",
" It's pretty effecient and sequence of namedtuples maps onto database rows in a very straighforward manner.\n",
"\n",
"* why not `pandas.DataFrame`?\n",
"\n",
" DataFrames are great and can be serialised to csv or pickled.\n",
" They are good to have as one of the ways you can interface with your data, however hardly convenitent to think about it abstractly due to their dynamic nature.\n",
" They also can't be nested.\n",
" \n",
"* why not [ORM](https://en.wikipedia.org/wiki/Object-relational_mapping)?\n",
" \n",
" ORMs tend to be pretty invasive, which might complicate your scripts or even ruin performance. It's also somewhat an overkill for such a specific purpose.\n",
"\n",
" * E.g. [SQLAlchemy](https://docs.sqlalchemy.org/en/13/orm/tutorial.html#declare-a-mapping) requires you using custom sqlalchemy specific types and inheriting a base class.\n",
" Also it doesn't support nested types.\n",
"\n",
"* why not [marshmallow](https://marshmallow.readthedocs.io/en/3.0/nesting.html)?\n",
" \n",
" Marshmallow is a common way to map data into db-friendly format, but it requires explicit schema which is an overhead when you have it already in the form of type annotations.\n",
" \n",
" * https://github.com/justanr/marshmallow-annotations TODO has support for NamedTuples\n",
" https://marshmallow-annotations.readthedocs.io/en/latest/ext/namedtuple.html#namedtuple-type-api\n",
"\n",
" https://pypi.org/project/marshmallow-dataclass/\n",
" * TODO mention that in code?\n",
"\n",
"* TODO [2019-07-30 Tue 19:00] post some link to data classes?\n",
" \n",
"# examples\n",
"* [2019-07-30 Tue 20:15] e.g. if hash is date you can ensure you only serve one piece of data a day\n",
"\n",
"\n"
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
Expand Down
5 changes: 0 additions & 5 deletions cog

This file was deleted.

6 changes: 6 additions & 0 deletions generate-readme
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash

cd "$(dirname "$0")"

# '--TagRemovePreprocessor.remove_cell_tags={"noexport"}'
jupyter nbconvert --to markdown --template readme.tpl README.ipynb
10 changes: 10 additions & 0 deletions readme.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{# disable code used to generate readme #}
{# based on https://stackoverflow.com/a/55305881/706389 #}

{%- extends 'markdown.tpl' -%}

{% block input_group %}
{%- if cell.metadata.get('nbconvert', {}).get('show_code', False) -%}
((( super() )))
{%- endif -%}
{% endblock input_group %}
14 changes: 0 additions & 14 deletions src/cachew/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -479,20 +479,7 @@ def cachew(
:param logger: custom logger, if not specified will use logger named `cachew`. See :func:`get_logger`.
:return: iterator over original or cached items
# [[[cog
# import cog
# lines = open('README.org').readlines()
# l = lines.index('#+BEGIN_SRC python\n')
# r = lines.index('#+END_SRC\n')
# src = lines[l + 1: r]
# cog.outl("'''")
# for line in src:
# cog.out(line)
# cog.outl("'''")
# ]]]
Usage example:
>>> from typing import NamedTuple, Iterator
>>> class Link(NamedTuple):
... url : str
Expand All @@ -512,7 +499,6 @@ def cachew(
>>> print(f"took {int(res)} seconds to query cached items")
took 0 seconds to query cached items
"""
# [[[end]]]

# func is optional just to make pylint happy https://github.com/PyCQA/pylint/issues/259
# kassert(func is not None)
Expand Down

0 comments on commit cda7748

Please sign in to comment.