Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame / Panel price handler #58

Closed
femtotrader opened this issue Jun 25, 2016 · 5 comments
Closed

DataFrame / Panel price handler #58

femtotrader opened this issue Jun 25, 2016 · 5 comments

Comments

@femtotrader
Copy link
Contributor

femtotrader commented Jun 25, 2016

According discussion #53

Maybe a more general price handler should be done
This price handler could have

  • a DataFrame as input (when a backtest for only one ticker is necessary)
  • a Panel as input (when a backtest with several tickers is given)

Index will be Time

Depending of columns names it could be either

  • a tick price handler (need at least a column Price or Bid/Ask and optionally Volume(s))
  • a bar price handler (need at least columns named Open High Low Close)
import pandas as pd
import pandas_datareader.data as web
import datetime
import requests_cache


expire_after = datetime.timedelta(days=3)
session = requests_cache.CachedSession(cache_name='cache', backend='sqlite', expire_after=expire_after)
panel = web.DataReader(["GOOG", "IBM"], "yahoo", session=session)

Panel looks like:

<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 1630 (major_axis) x 2 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2010-01-04 00:00:00 to 2016-06-23 00:00:00
Minor_axis axis: GOOG to IBM
def data_from_dataframe(df):
    for (dt, row) in df.iterrows():
        yield(dt, row)

def data_from_panel(panel):
    for dt, data in panel.transpose(1,0,2).iteritems():
        for (ticker, bar) in data.iteritems():
            yield(dt, ticker, bar)

ticker = "GOOG"
df = panel[:,:,ticker]
for (dt, bar) in data_from_dataframe(df):
    print(dt, ticker, bar)

for (dt, ticker, bar) in data_from_panel(panel):
    print(dt, ticker, bar)

@mhallsmoore
Copy link
Owner

Is the idea here to separate the loading of the data from the handling of the data? I can see that this makes sense. At the moment it would be necessary to clutter PriceHandlers with a lot of data connection logic for DBs, or CSV loading for files.

@femtotrader
Copy link
Contributor Author

femtotrader commented Jul 2, 2016

Yes you can load from database a chunk of data using Pandas quite easily and have this data into memory (a DataFrame or a Panel).

See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html

But it's really important to avoid to load all data into memory if you process tick data (because it can be very long before backtest starts)

An other approach (for DB) could be to use Blaze http://blaze.readthedocs.io/en/latest/index.html

table_uri = 'dialect+driver://user:password@host:port/database::table'
for i, chunk in enumerate(odo(table_uri, chunks(pd.DataFrame), chunksize=chunksize)):
    print("="*5 + " chunk %d " % (i + 1) + "="*5)
    for j, row in chunk.iterrows():
        print(row)

Anyway it's important to process by chunk.

@mhallsmoore
Copy link
Owner

Yeah, I had this problem with QSForex when I wanted to load tick data in day-by-day. I wrote a method that would basically load each day's data and process that, then repeat for the next day. That was manageable.

However we definitely need to move to a streaming model, not only from the point of view of memory, but also as this will ease the transition to live data.

@femtotrader
Copy link
Contributor Author

femtotrader commented Jul 2, 2016

The problem with this Pandas approach is that if your tick data come in a monthly CSV file (an not daily!!) like TrueFX http://www.truefx.com/ does it take a long to load all tick into memory.
So that's the reason why I did https://github.com/femtotrader/femtotrading/blob/master/femtotrading/data_iterator/monthly_csv_tick.py
So I think we need both approach depending of data when want to process.

@femtotrader
Copy link
Contributor Author

Some code (just for loader... not for handler) that might help:

#!/usr/bin/env python

from abc import ABCMeta, abstractmethod
import six
from collections import OrderedDict
import pandas as pd


class AbstractPandasPriceLoader(object):

    __metaclass__ = ABCMeta

    def __init__(self, filenames):
        if isinstance(filenames, tuple):
            filenames = OrderedDict(filenames)
        self.filenames = filenames

    @abstractmethod
    def _get_df_from_filename(self, filename):
        raise NotImplementedError("Should implement _get_df_from_filename()")

    @property
    def panel(self):
        # if len(self.filenames) <= 1:
        #     raise NotImplementedError("filenames must contain several filenames")
        tickers_data = OrderedDict()
        for ticker, filename in self.filenames.items():
            tickers_data[ticker] = self._get_df_from_filename(filename)
        panel = pd.Panel.from_dict(tickers_data).transpose(2, 1, 0)
        return panel

    @property
    def dataframe(self):
        if len(self.filenames) == 1:
            df = self._get_df_from_filename(list(self.filenames.values())[0])
            return df
        else:
            raise NotImplementedError("filenames must contain only one filename")


class CsvTickPandas(AbstractPandasPriceLoader):
    def _get_df_from_filename(self, filename):
        df = pd.io.parsers.read_csv(
            filename, header=0, parse_dates=True,
            dayfirst=True, index_col=1,
            names=("Ticker", "Time", "Bid", "Ask")
        )
        df.index.name = "Date"
        del df["Ticker"]
        return df


class CsvOHLCVPandas(AbstractPandasPriceLoader):
    def _get_df_from_filename(self, filename):
        df = pd.io.parsers.read_csv(
            filename, header=0, parse_dates=True,
            index_col=0, names=(
                "Date", "Open", "High", "Low",
                "Close", "Volume", "Adj Close"
            )
        )
        return df

pd.set_option("max_rows", 10)

# filenames = OrderedDict([
#     ("GOOG", "data/GOOG.csv"),
# ])
# filenames = ("GOOG", "data/GOOG.csv"),
# csv_tick_pandas = CsvTickPandas(filenames)
# df = csv_tick_pandas.dataframe
# print(df)

# filenames = OrderedDict([
#     ("GOOG", "data/GOOG.csv"),
#     ("MSFT", "data/MSFT.csv")
# ])
# csv_tick_pandas = CsvTickPandas(filenames)
# panel = csv_tick_pandas.panel
# print(panel)

filenames = OrderedDict([
    ("SP500TR", "data/SP500TR.csv"),
])
csv_ohlcv_pandas = CsvOHLCVPandas(filenames)
df = csv_ohlcv_pandas.dataframe
print(df)
# panel = csv_ohlcv_pandas.panel
# print(panel)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants