Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Dataset remix #5372

Merged
merged 39 commits into from
Aug 25, 2021
Merged
Changes from 1 commit
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
10c5479
Adds a dataset that can be read and written lazily
dirkgr Aug 7, 2021
36f9b67
This approach might work better.
dirkgr Aug 7, 2021
e74540d
Make ShuffledSequence take indices
dirkgr Aug 7, 2021
0eb53bf
Formatting
dirkgr Aug 7, 2021
dcedfd5
Adds failing test
dirkgr Aug 7, 2021
36948ce
Merge remote-tracking branch 'origin/main' into TangoBigData
dirkgr Aug 11, 2021
44eccf9
Fix sparse sequence tests
dirkgr Aug 11, 2021
f305de7
Fixes the Sqlite format
dirkgr Aug 11, 2021
61f8810
Quality-of-life hack
dirkgr Aug 11, 2021
989f15c
Makes an internal string less alarming
dirkgr Aug 11, 2021
9c461b7
Save the files to the right place
dirkgr Aug 11, 2021
15e0be4
Merge remote-tracking branch 'origin/main' into TangoBigData
dirkgr Aug 18, 2021
ca26abe
Formatting
dirkgr Aug 19, 2021
f2f0a34
Merge remote-tracking branch 'origin/main' into TangoBigData
dirkgr Aug 19, 2021
bb572b3
Fix for SqliteDatasetFormat
dirkgr Aug 20, 2021
6953d7d
Performance improvement for SqliteSparseSequence
dirkgr Aug 20, 2021
3f99be7
Changelog
dirkgr Aug 20, 2021
d69ea38
Merge branch 'main' into TangoBigData
dirkgr Aug 20, 2021
d58a52f
Global imports
dirkgr Aug 20, 2021
104777d
More Sequence classes
dirkgr Aug 21, 2021
b6b5f05
Say DatasetDict when we mean DatasetDict
dirkgr Aug 21, 2021
05c4dd6
Test for the sequences
dirkgr Aug 21, 2021
4304a93
Use the step name correctly in the error message
dirkgr Aug 21, 2021
d6cb8ab
Use and consume step_name correctly in Step.from_params()
dirkgr Aug 21, 2021
fd305a6
Uncacheable steps don't get cached even if they have a name
dirkgr Aug 21, 2021
3ae61eb
Adds a step that can remix a dataset
dirkgr Aug 21, 2021
2004fd2
Improve log message
dirkgr Aug 21, 2021
b0c3626
Fix relative import
dirkgr Aug 21, 2021
fcf651f
Changelog
dirkgr Aug 21, 2021
aa82e3d
Merge branch 'main' into DatasetRemix
dirkgr Aug 23, 2021
ca5cad3
Adds documentation
dirkgr Aug 23, 2021
d5f11f4
Merge branch 'DatasetRemix' of https://github.com/allenai/allennlp in…
dirkgr Aug 23, 2021
c52b050
Give the option of changing a det_hash simply()
dirkgr Aug 23, 2021
a32c7f2
Tix fypo
dirkgr Aug 23, 2021
6cccd64
Adds ability to shuffle datasets
dirkgr Aug 24, 2021
765575d
Test for det_hash
dirkgr Aug 24, 2021
c69df7e
Merge branch 'main' into DatasetRemix
dirkgr Aug 24, 2021
451e4ee
We don't use relative imports
dirkgr Aug 25, 2021
1d71b69
Merge branch 'DatasetRemix' of https://github.com/allenai/allennlp in…
dirkgr Aug 25, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Save the files to the right place
  • Loading branch information
dirkgr committed Aug 11, 2021
commit 9c461b7456c90a789ad1791f8718b691a9cf7a07
4 changes: 2 additions & 2 deletions allennlp/tango/sqlite_format.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@

@Format.register("sqlite")
class SqliteDictFormat(Format[DatasetDict]):
VERSION = 1
VERSION = 2

def write(self, artifact: DatasetDict, dir: Union[str, PathLike]):
dir = pathlib.Path(dir)
Expand All @@ -28,7 +28,7 @@ def write(self, artifact: DatasetDict, dir: Union[str, PathLike]):
if isinstance(split, SqliteSparseSequence):
split.copy_to(filename)
else:
sqlite = SqliteSparseSequence(filename)
sqlite = SqliteSparseSequence(dir / filename)
sqlite.extend(split)

def read(self, dir: Union[str, PathLike]) -> DatasetDict:
Expand Down