A small effort to quantify how many papers on arXiv just don't bother with metadata.
This project is just a small rust binary that will print out some info and save more comprehensive data to a results.csv
:
cargo run --release
You can then generate some plots with Julia, running:
julia --project=.
> include("analysis.jl")
> main()
If you don't have the space, time or need to download extra datasets, you can use the all.csv
file in the root of this repo.
The notebook does not currently download the datasets itself, instead it expects them to be in a certain location.
# list available pdf groups
gsutil ls gs://arxiv-dataset/arxiv/arxiv/pdf
# make the data dir
mkdir -p arxiv-pdfs
# then, download one set
gsutil -m cp -r gs://arxiv-dataset/arxiv/arxiv/pdf/<yearmonth> ./arxiv-pdfs