Lineapy is a Python library for capturing, analyzing, and automating data science workflows.
On a high-level, Linea traces the code executed to get an understanding of the code and its context. These understanding of your development process allow Linea to provide a set of tools that help you get more value out of your work.
A natural unit of organization for these code are variables in the code---both their value and the code used to create them. Our features revolve around these units, which we call artifacts.
We are still early in the process, and we currently support the following features:
- Code cleanup: often when working with data, we don't know what efforts
will pan out. When we do have something we want to keep, we can save it as an
artifact and create a version of the code that only includes the pieces necessary
to recreate that artifact. This is called "Program Slicing" in the literature.
Linea's slicing feature makes it easy to share and re-execute these work.
- This is done automatically by calling the
lineapy.save
API on the variable of interest. /tests/housing.py
contains an example.
- This is done automatically by calling the
- Pipeline extraction: Automatic creation of Airflow DAGs (and related systems) from Linea artifacts. Note that we take a radically different approach that tools like Papermill, because we are actually analyzing the code to automatically instrument the optimizations.
- This is done via the API
.to_airflow()
on a linea artifact that is returned either through.save
or.get
. By default, the generated airflow file is placed under your home directory'sairflow/dag
folders. examples/Demo_1_Preprocessing.ipynb
andexamples/Demo_2_Modeling.ipynb
contains two end-to-end examples.
- This is done via the API
- Artifact store: Saving and getting artifacts from different sessions/notebooks/scripts.
- This is done via
lineapy.get
API of artifacts that were saved vialineapy.save
. - You can also view a catalog of saved artifacts through the API
lineapy.catalog
. examples/1_Explorations.ipynb
+/examples/2_APIs.ipynb
contains an example for all three APIs.
- This is done via
We are working towards a number of other features and have created issues to describe some of them in Github, tagged with User Story
, which include:
- Metadata search e.g. "Find all charts that use this column from this table" see issues on analyzing data sources and analyzing SQL).
- Enhanced execution based versioning.
- Integration with existing infra, e.g., AWS, Airflow
- Support execution scale up, e.g., automatically creating the same version of the code with Dask.
If you have any feedback for us, please get in touch! We welcome feedback on Github, either by commenting on existing issues or creating new ones. You can also find us on Twitter and Slack!
You can run lineapy
through three options:
- Github CodeSpaces
- Docker image
- DIY: clone the repository
We'll describe the options below.
Click the green "<> Code" button above (in the homepage), and in the "Codespaces" tab you can click on the gray button "New codespace".
The first time you load it might take a while to download Docker. Once the VS Code interface loads, after a few seconds, the "PORTS" tab on the lower panel should load (per the image below).
If you click on the globe icon (🌐) next to JupyterLab, it will open port 8888. If you click the same globe icon next to Airflow, it will open port 8080.
By default, lab will have two demo notebooks open. Run Demo 1, and then Demo 2 to the end, then you will see the Airflow jobs deployed in the dashboard!
- First install Docker and then authenticate to the Github Container Registry so you can pull our private image.
- Now you can pull and run our image to slice Python code:
$ docker run --rm -v $PWD:/app -w /app ghcr.io/linealabs/lineapy:main lineapy --slice "p value" tests/housing.py
...
You can also run Linea by cloning this repository and running the lineapy
:
$ git clone git@github.com:LineaLabs/lineapy.git
$ cd lineapy
# Linea currently requires Python 3.8+
$ pip install -r requirements.txt
$ python setup.py install
$ lineapy --slice "p value" tests/housing.py
...
Note that if you are not using Codespaces and are manually running Airflow and JupyterLab,
we also created convenient Makefile configs to start Airflow (make airflow_start
) on
localhost:8080
and JupyterLab (make jupyterlab_start
)
on localhost:8888
.
These features are currently exposed via two surfaces, one is the CLI and the other is Jupyter, supporting all notebook interfaces.
Currently, you can run Linea as CLI command to slice your Python code to extract only the code that is necessary to recompute some result. Along the way, Linea stores the semantics of your code into a database, which we are working on exposing as well.
$ lineapy --help
Usage: lineapy [OPTIONS] FILE_NAME
Options:
--db-url TEXT Set the DB URL. If None, will default to
reading from the LINEA_DATABASE_URL env
variable and if that is not set then will
default to sqlite:///{LINEA_HOME}/db.sqlite.
Note that {LINEA_HOME} will be replaced with
the root linea home directory. This is the
first directory found which has a .linea
folder
--slice TEXT Print the sliced code that this artifact
depends on
--export-slice TEXT Requires --slice. Export the sliced code
that {slice} depends on to {export_slice}.py
--export-slice-to-airflow-dag, --airflow TEXT
Requires --slice. Export the sliced code
from all slices to an Airflow DAG {export-
slice-to-airflow-dag}.py
--airflow-task-dependencies TEXT
Optional flag for --airflow. Specifies tasks
dependencies in Airflow format, i.e. 'p
value' >> 'y' or 'p value', 'x' >> 'y'. Put
slice names under single quotes.
--print-source Whether to print the source code
--print-graph Whether to print the generated graph code
--verbose Print out logging for graph creation and
execution
--visualize Visualize the resulting graph with Graphviz
--help Show this message and exit.
# Run linea on a Python file to analyze it.
# --visualize creates a visual representation of the underlying graph and displays it
$ lineapy --print-source --visualize tests/simple.py
...
# Use --slice to slice the code to that which is needed to recompute an artifact
$ lineapy --print-source tests/housing.py --slice 'p value'
...
You can also run Linea interactively in a notebook or IPython.
However, to do so the lineapy
extension needs to be loaded. We have provided wrapper CLI commands to do this transperently, lineapy ipython
and lineapy jupyter
. For example you can run lineapy jupyter lab
to start JupyterLab with
the required extension auto loading.
This sets the InteractiveShellApp.extensions
configuration option to include
lineapy
for the kernel.
$ lineapy ipython
Python 3.9.7 (default, Sep 16 2021, 08:50:36)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.29.0 -- An enhanced Interactive Python. Type '?' for help.
[16:48:07] INFO Connecting to Linea DB at sqlite:///.linea/db.sqlite
In [1]: import lineapy
...: x = 100
...: y = x + 500
...: z = x - 10
...: print(lineapy.save(z, "z").code)
x = 100
z = x - 10
This also works for starting jupyter notebook
or jupyter lab
, or any other
frontend which uses the ipython kernel.
You can also add c.InteractiveShellApp.extensions = ["lineapy"]
to your own IPython config (found by running ipython locate profile default
).
See ipython
's documentation on their configuration for more information
For a larger example, you can look at examples/Explorations.ipynb
If you have an existing notebook, you can try running it through linea, to see if it still works, and to save the resulting graph. For example:
lineapy jupyter nbconvert --to notebook --execute examples/Explorations.ipynb --inplace --allow-errors
If you would like to change the database that linea talks to, you can use the
LINEA_DATABASE_URL
env variable. For example, to set it to sqlite:///:memory:
to use an in memory database instead of writing to disk.
In order to properly slice your code, we have to understand different Python
language features and libraries. We are working to add coverage to support all
of Python, as well as make our analysis more accurate. We have
a number of open issues to track what things we know we don't support in Python, tagged under Language Support
.
Feel free to open more if come across code that doesn't run or doesn't properly slice.