-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame / Panel price handler #58
Comments
Is the idea here to separate the loading of the data from the handling of the data? I can see that this makes sense. At the moment it would be necessary to clutter PriceHandlers with a lot of data connection logic for DBs, or CSV loading for files. |
Yes you can load from database a chunk of data using Pandas quite easily and have this data into memory (a DataFrame or a Panel). See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html But it's really important to avoid to load all data into memory if you process tick data (because it can be very long before backtest starts) An other approach (for DB) could be to use Blaze http://blaze.readthedocs.io/en/latest/index.html table_uri = 'dialect+driver://user:password@host:port/database::table'
for i, chunk in enumerate(odo(table_uri, chunks(pd.DataFrame), chunksize=chunksize)):
print("="*5 + " chunk %d " % (i + 1) + "="*5)
for j, row in chunk.iterrows():
print(row) Anyway it's important to process by chunk. |
Yeah, I had this problem with QSForex when I wanted to load tick data in day-by-day. I wrote a method that would basically load each day's data and process that, then repeat for the next day. That was manageable. However we definitely need to move to a streaming model, not only from the point of view of memory, but also as this will ease the transition to live data. |
The problem with this Pandas approach is that if your tick data come in a monthly CSV file (an not daily!!) like TrueFX http://www.truefx.com/ does it take a long to load all tick into memory. |
Some code (just for loader... not for handler) that might help: #!/usr/bin/env python
from abc import ABCMeta, abstractmethod
import six
from collections import OrderedDict
import pandas as pd
class AbstractPandasPriceLoader(object):
__metaclass__ = ABCMeta
def __init__(self, filenames):
if isinstance(filenames, tuple):
filenames = OrderedDict(filenames)
self.filenames = filenames
@abstractmethod
def _get_df_from_filename(self, filename):
raise NotImplementedError("Should implement _get_df_from_filename()")
@property
def panel(self):
# if len(self.filenames) <= 1:
# raise NotImplementedError("filenames must contain several filenames")
tickers_data = OrderedDict()
for ticker, filename in self.filenames.items():
tickers_data[ticker] = self._get_df_from_filename(filename)
panel = pd.Panel.from_dict(tickers_data).transpose(2, 1, 0)
return panel
@property
def dataframe(self):
if len(self.filenames) == 1:
df = self._get_df_from_filename(list(self.filenames.values())[0])
return df
else:
raise NotImplementedError("filenames must contain only one filename")
class CsvTickPandas(AbstractPandasPriceLoader):
def _get_df_from_filename(self, filename):
df = pd.io.parsers.read_csv(
filename, header=0, parse_dates=True,
dayfirst=True, index_col=1,
names=("Ticker", "Time", "Bid", "Ask")
)
df.index.name = "Date"
del df["Ticker"]
return df
class CsvOHLCVPandas(AbstractPandasPriceLoader):
def _get_df_from_filename(self, filename):
df = pd.io.parsers.read_csv(
filename, header=0, parse_dates=True,
index_col=0, names=(
"Date", "Open", "High", "Low",
"Close", "Volume", "Adj Close"
)
)
return df
pd.set_option("max_rows", 10)
# filenames = OrderedDict([
# ("GOOG", "data/GOOG.csv"),
# ])
# filenames = ("GOOG", "data/GOOG.csv"),
# csv_tick_pandas = CsvTickPandas(filenames)
# df = csv_tick_pandas.dataframe
# print(df)
# filenames = OrderedDict([
# ("GOOG", "data/GOOG.csv"),
# ("MSFT", "data/MSFT.csv")
# ])
# csv_tick_pandas = CsvTickPandas(filenames)
# panel = csv_tick_pandas.panel
# print(panel)
filenames = OrderedDict([
("SP500TR", "data/SP500TR.csv"),
])
csv_ohlcv_pandas = CsvOHLCVPandas(filenames)
df = csv_ohlcv_pandas.dataframe
print(df)
# panel = csv_ohlcv_pandas.panel
# print(panel) |
According discussion #53
Maybe a more general price handler should be done
This price handler could have
Index will be Time
Depending of columns names it could be either
Price
orBid
/Ask
and optionallyVolume
(s))Open
High
Low
Close
)Panel looks like:
The text was updated successfully, but these errors were encountered: