Skip to content

Converts OSCAR's jsonl files into parquet

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

pjox/oscar2parquet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OSCAR2Parquet

This cli tool converts OSCAR's jsonl files into parquet. It takes Ungoliant's output as input and writes the parquet files to the destination folder. This tool intends to replace the splitting and compression steps of the OSCAR generation previously performed by oscar-tools.

Todo

  • Add Python bindings
  • Add tests
  • Add option to control the maximum number of rows per parquet file

Usage

oscar2parquet -h
Converts OSCAR's jsonl files into parquet.

Usage: oscar2parquet [OPTIONS] <INPUT FOLDER> <DESTINATION FOLDER>

Arguments:
  <INPUT FOLDER>        Folder containing the indices
  <DESTINATION FOLDER>  Parquet file to write

Options:
  -t, --threads <NUMBER OF THREADS>  Number of threads to use [default: 10]
  -h, --help                         Print help
  -V, --version                      Print version

About

Converts OSCAR's jsonl files into parquet

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages